Using Python to Draw ROC Curve and Calculate AUC Value

Introduction

The ROC (Receiver Operating Characteristic) curve and AUC are often used to evaluate the performance of a binary classifier. This article will first briefly introduce ROC and AUC, and then demonstrate how to create an ROC curve graph and calculate AUC using Python.

Introduction to AUC

AUC (Area Under Curve) is a very commonly used evaluation indicator in machine learning binary classification models, compared with F1-Score has greater tolerance for the imbalance of the project, at present, common machine learning libraries (such as scikit-The learn) are generally integrated into the calculation of this indicator, but sometimes the model is single or self-written, and at this time, in order to evaluate the good or bad of the trained model, an AUC calculation module must be created. This article found libsvm in the search for information-The tools have a very easy-to-understand auc calculation, so it is extracted for future use.

AUC Calculation

The calculation of AUC is divided into the following three steps:

　　　 1Calculate the preparation of the data, if only the training set is used during the model training, it is generally used to calculate the cross-validation method, and if there is an evaluation set (evaluate), it can be calculated directly. The format of the data is generally the predicted score and its target category (note that it is the target category, not the predicted category)

　　　 2According to the threshold, obtain the horizontal (X: False Positive Rate) and vertical (Y: True Positive Rate) points

　　　 3After connecting the coordinate points into a curve, calculate the area under the curve, which is the value of AUC

Here is the Python code directly

#! -*- coding=utf-8 -*-
import pylab as pl
from math import log,exp,sqrt
evaluate_result="you file path"
db = [] #[score,nonclk,clk]
pos, neg = 0, 0 
with open(evaluate_result,'r') as fs:
 for line in fs:
 nonclk,clk,score = line.strip().split('\t')
 nonclk = int(nonclk)
 clk = int(clk)
 score = float(score)
 db.append([score,nonclk,clk])
 pos += clk
 neg += nonclk
db = sorted(db, key=lambda x:x[0], reverse=True)
# Calculate ROC coordinate points
xy_arr = []
tp, fp = 0., 0.  
for i in range(len(db)):
 tp += db[i][2]}}
 fp += db[i][1]}}
 xy_arr.append([fp/neg,tp/pos])
#Calculate the area under the curve
auc = 0.  
prev_x = 0
for x,y in xy_arr:
 if x != prev_x:
 auc += (x - prev_x) * y
 prev_x = x
print "the auc is %s."%auc
x = [_v[0] for _v in xy_arr]
y = [_v[1] for _v in xy_arr]
pl.title("ROC curve of %s (AUC = %.4f)" % ('svm',auc))
pl.xlabel("False Positive Rate")
pl.ylabel("True Positive Rate")
pl.plot(x, y)# use pylab to plot x and y
pl.show()# show the plot on the screen

The input dataset can be referred tosvm prediction result

The format is:

nonclk \t clk \t score

Among them:
　　　 1nonclick: the data that has not been clicked, which can be regarded as the number of negative samples

　　　 2clk: the number of clicks, which can be regarded as the number of positive samples

　　　 3score: the predicted score, grouping the positive and negative samples according to this score can reduce the calculation amount of AUC

The result of the run is:

If pylab is not installed on this machine, you can comment on the dependency and drawing part

Note

The code posted above:

　　　 1It can only calculate the result of binary classification (as for the label of binary classification, it can be processed arbitrarily)

　　　 2Each score in the above code is made a threshold, which is actually quite inefficient, we can sample the samples or perform equal division calculation when calculating the abscissa coordinate

Summary

That is all the content of this article, I hope the content of this article can bring some help to everyone's learning or work, if you have any questions, you can leave a message for communication.

Basic Tutorial