English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Introduction
The ROC (Receiver Operating Characteristic) curve and AUC are often used to evaluate the performance of a binary classifier. This article will first briefly introduce ROC and AUC, and then demonstrate how to create an ROC curve graph and calculate AUC using Python.
Introduction to AUC
AUC (Area Under Curve) is a very commonly used evaluation indicator in machine learning binary classification models, compared with F1-Score has greater tolerance for the imbalance of the project, at present, common machine learning libraries (such as scikit-The learn) are generally integrated into the calculation of this indicator, but sometimes the model is single or self-written, and at this time, in order to evaluate the good or bad of the trained model, an AUC calculation module must be created. This article found libsvm in the search for information-The tools have a very easy-to-understand auc calculation, so it is extracted for future use.
AUC Calculation
The calculation of AUC is divided into the following three steps:
1Calculate the preparation of the data, if only the training set is used during the model training, it is generally used to calculate the cross-validation method, and if there is an evaluation set (evaluate), it can be calculated directly. The format of the data is generally the predicted score and its target category (note that it is the target category, not the predicted category)
2According to the threshold, obtain the horizontal (X: False Positive Rate) and vertical (Y: True Positive Rate) points
3After connecting the coordinate points into a curve, calculate the area under the curve, which is the value of AUC
Here is the Python code directly
#! -*- coding=utf-8 -*- import pylab as pl from math import log,exp,sqrt evaluate_result="you file path" db = [] #[score,nonclk,clk] pos, neg = 0, 0 with open(evaluate_result,'r') as fs: for line in fs: nonclk,clk,score = line.strip().split('\t') nonclk = int(nonclk) clk = int(clk) score = float(score) db.append([score,nonclk,clk]) pos += clk neg += nonclk db = sorted(db, key=lambda x:x[0], reverse=True) # Calculate ROC coordinate points xy_arr = [] tp, fp = 0., 0. for i in range(len(db)): tp += db[i][2]}} fp += db[i][1]}} xy_arr.append([fp/neg,tp/pos]) #Calculate the area under the curve auc = 0. prev_x = 0 for x,y in xy_arr: if x != prev_x: auc += (x - prev_x) * y prev_x = x print "the auc is %s."%auc x = [_v[0] for _v in xy_arr] y = [_v[1] for _v in xy_arr] pl.title("ROC curve of %s (AUC = %.4f)" % ('svm',auc)) pl.xlabel("False Positive Rate") pl.ylabel("True Positive Rate") pl.plot(x, y)# use pylab to plot x and y pl.show()# show the plot on the screen
The input dataset can be referred tosvm prediction result
The format is:
nonclk \t clk \t score
Among them:
1nonclick: the data that has not been clicked, which can be regarded as the number of negative samples
2clk: the number of clicks, which can be regarded as the number of positive samples
3score: the predicted score, grouping the positive and negative samples according to this score can reduce the calculation amount of AUC
The result of the run is:
If pylab is not installed on this machine, you can comment on the dependency and drawing part
Note
The code posted above:
1It can only calculate the result of binary classification (as for the label of binary classification, it can be processed arbitrarily)
2Each score in the above code is made a threshold, which is actually quite inefficient, we can sample the samples or perform equal division calculation when calculating the abscissa coordinate
Summary
That is all the content of this article, I hope the content of this article can bring some help to everyone's learning or work, if you have any questions, you can leave a message for communication.