English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
This article shares the specific code of the Python KNN classification algorithm for everyone's reference, as follows
The KNN classification algorithm can be regarded as one of the simplest classification algorithms in machine learning. KNN stands for K-NearestNeighbor (K nearest sample nodes). Before performing classification, the KNN classifier reads a large number of sample data with classification labels as reference data for classification. When it classifies an unknown sample, it calculates the difference between the current sample and all reference samples; this difference is measured by the distance between data points in the multi-dimensional feature space of the samples, that is, the closer two sample points are in the multi-dimensional space of their feature data, the smaller the difference between them, and the greater the possibility that they belong to the same category. The KNN classification algorithm utilizes this basic cognition, calculates the distance between the sample point to be predicted and all samples in the reference sample space, finds the K nearest reference sample points to the sample point, counts the proportion of the most frequent category among the K nearest sample points, and takes this category as the prediction result.
The KNN model is very simple, it does not involve model training, and each prediction requires calculating the distance between the point and all known points. Therefore, as the number of reference sample sets increases, the computational cost of the KNN classifier will also increase proportionally, and KNN is not suitable for sample sets with a small number of samples. After KNN was proposed, many people have proposed many improved algorithms, aiming at improving both the speed and accuracy of the algorithm. However, they are all based on the principle that "the closer the distance, the greater the possibility of similarity." Here, the KNN algorithm in its original version is implemented using Python. The dataset used is the Iris dataset commonly used in machine learning courses. At the same time, I added a small amount of noise data to the original dataset to test the robustness of the KNN algorithm.
The dataset uses the Iris dataset,Download link.
The dataset contains90 data points (training set), divided into2class, each class45data points, each data point4attributes
Sepal.Length (sepal length), unit is cm;
Sepal.Width (sepal width), unit is cm;
Petal.Length (petal length), unit is cm;
Petal.Width (petal width), unit is cm;
Classification types: Iris Setosa (Mountain Iris), Iris Versicolour (Multicolored Iris)
Previously, I focused on C++I have recently learned Python, and today I want to practice implementing KNN. Below is the code:
#coding=utf-8 import math #Define the Iris data class class Iris: data=[] label=[] pass #Define a function to read the Iris dataset def load_dataset(filename="Iris_train.txt"): f=open(filename) line=f.readline().strip() propty=line.split(',')#Attribute names dataset=[]#Save the data information of each sample label=[]#Save the labels of the samples while line: line=f.readline().strip() if(not line): temp=line.split(','), content=[] for i in temp[0:-1]: content.append(float(i)) dataset.append(content) label.append(temp[-1]) total=Iris() total.data=dataset total.label=label return total # Return the dataset # Define a Knn classifier class class KnnClassifier: def __init__(self,k,type="Euler"): # Define positive integer K and distance calculation method when initializing self.k=k self.type=type self.dataloaded=False def load_traindata(self,traindata): # Load the dataset self.data=traindata.data self.label=traindata.label self.label_set=set(traindata.label) self.dataloaded=True # Mark whether to load the dataset def Euler_dist(self,x,y): # Euler distance calculation method, x, y are vectors sum=0 for i,j in zip(x,y): sum+=math.sqrt((i-j)**2) return sum def Manhattan_dist(self,x,y): # Manhattan distance calculation method, x, y are vectors sum=0 for i,j in zip(x,y): sum+=abs(i-j) return sum def predict(self,temp): # Prediction function, read in the data of a prediction sample, temp is a vector if(not self.dataloaded): # Determine if there is training data print "No train_data load in" return distance_and_label=[] if(self.type=="Euler"): # Determine the distance calculation method, Euler distance or Manhattan distance for i,j in zip(self.data,self.label): dist=self.Euler_dist(temp,i) distance_and_label.append([dist,j]) else: if(self.type=="Manhattan"): for i,j in zip(self.data,self.label): dist=self.Manhattan_dist(temp,i) distance_and_label.append([dist,j]) else: print "type choice error" #Get the distance and category label of the K nearest samples neighborhood=sorted(distance_and_label,cmp=lambda x,y : cmp(x[0],y[0]))[0:self.k] neighborhood_class=[] for i in neighborhood: neighborhood_class.append(i[1]) class_set=set(neighborhood_class) neighborhood_class_count=[] print "In k nearest neighborhoods:\ #Count the number of each category among the K nearest points for i in class_set: a=neighborhood_class.count(i) neighborhood_class_count.append([i,a]) print "class: ",i," count: ",a result=sorted(neighborhood_class_count,cmp=lambda x,y : cmp(x[1],y[1))[-1][0] print "result: ",result return result#Return the predicted category if __name__ == '__main__': traindata=load_dataset()#training data testdata=load_dataset("Iris_test.txt")#testing data #Create a Knn classifier with K=20, default for Euclidean distance calculation kc=KnnClassifier(20) kc.load_traindata(traindata) predict_result=[] #Predict the results of all samples to be predicted in the test set testdata for i,j in zip(testdata.data,testdata.label): predict_result.append([i,kc.predict(i),j]) correct_count=0 #Compare the predicted results with the correct results to calculate the accuracy of this prediction for i in predict_result: if(i[1==i[2) correct_count+=1 ratio=float(correct_count)/len(predict_result) print "correct predicting ratio",ratio
in the test set11classification results of the following test sample points:
In k nearest neighborhoods: class: Iris-setosa count: 20 result: Iris-setosa In k nearest neighborhoods: class: Iris-setosa count: 20 result: Iris-setosa In k nearest neighborhoods: class: Iris-setosa count: 20 result: Iris-setosa In k nearest neighborhoods: class: Iris-setosa count: 20 result: Iris-setosa In k nearest neighborhoods: class: Iris-setosa count: 20 result: Iris-setosa In k nearest neighborhoods: class: Iris-versicolor count: 20 result: Iris-versicolor In k nearest neighborhoods: class: Iris-versicolor count: 20 result: Iris-versicolor In k nearest neighborhoods: class: Iris-versicolor count: 20 result: Iris-versicolor In k nearest neighborhoods: class: Iris-versicolor count: 20 result: Iris-versicolor In k nearest neighborhoods: class: Iris-versicolor count: 20 result: Iris-versicolor In k nearest neighborhoods: class: Iris-setosa count: 18 class: Iris-versicolor count: 2 result: Iris-setosa correct predicting ratio 0.909090909091
There are many methods to calculate distance in KNN, different methods are suitable for different datasets. This code only implements two calculation methods: Euclidean distance and Manhattan distance; the data in the test set is extracted from the original data set, the amount of data is not very large, and the result cannot well reflect the performance of KNN, so the program running result is for reference only.
That's all for this article. Hope it will be helpful to everyone's learning, and also hope everyone will support Yell Tutorial more.
Statement: The content of this article is from the Internet, the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email to report abuse, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.)