English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Implementation of KNN classification algorithm using Python

This article shares the specific code of the Python KNN classification algorithm for everyone's reference, as follows

The KNN classification algorithm can be regarded as one of the simplest classification algorithms in machine learning. KNN stands for K-NearestNeighbor (K nearest sample nodes). Before performing classification, the KNN classifier reads a large number of sample data with classification labels as reference data for classification. When it classifies an unknown sample, it calculates the difference between the current sample and all reference samples; this difference is measured by the distance between data points in the multi-dimensional feature space of the samples, that is, the closer two sample points are in the multi-dimensional space of their feature data, the smaller the difference between them, and the greater the possibility that they belong to the same category. The KNN classification algorithm utilizes this basic cognition, calculates the distance between the sample point to be predicted and all samples in the reference sample space, finds the K nearest reference sample points to the sample point, counts the proportion of the most frequent category among the K nearest sample points, and takes this category as the prediction result.

The KNN model is very simple, it does not involve model training, and each prediction requires calculating the distance between the point and all known points. Therefore, as the number of reference sample sets increases, the computational cost of the KNN classifier will also increase proportionally, and KNN is not suitable for sample sets with a small number of samples. After KNN was proposed, many people have proposed many improved algorithms, aiming at improving both the speed and accuracy of the algorithm. However, they are all based on the principle that "the closer the distance, the greater the possibility of similarity." Here, the KNN algorithm in its original version is implemented using Python. The dataset used is the Iris dataset commonly used in machine learning courses. At the same time, I added a small amount of noise data to the original dataset to test the robustness of the KNN algorithm.

The dataset uses the Iris dataset,Download link.

The dataset contains90 data points (training set), divided into2class, each class45data points, each data point4attributes 

Sepal.Length (sepal length), unit is cm;
Sepal.Width (sepal width), unit is cm;
Petal.Length (petal length), unit is cm;
Petal.Width (petal width), unit is cm;

Classification types: Iris Setosa (Mountain Iris), Iris Versicolour (Multicolored Iris)
Previously, I focused on C++I have recently learned Python, and today I want to practice implementing KNN. Below is the code:

#coding=utf-8
import math
#Define the Iris data class
class Iris:
 data=[]
 label=[]
 pass
#Define a function to read the Iris dataset
def load_dataset(filename="Iris_train.txt"):
 f=open(filename)
 line=f.readline().strip()
 propty=line.split(',')#Attribute names
 dataset=[]#Save the data information of each sample
 label=[]#Save the labels of the samples
 while line:
 line=f.readline().strip()
 if(not line):
 
 temp=line.split(','),
 content=[]
 for i in temp[0:-1]:
 content.append(float(i))
 dataset.append(content)
 label.append(temp[-1])
 total=Iris()
 total.data=dataset
 total.label=label
 return total # Return the dataset
# Define a Knn classifier class
class KnnClassifier:
 def __init__(self,k,type="Euler"): # Define positive integer K and distance calculation method when initializing
 self.k=k
 self.type=type
 self.dataloaded=False
 def load_traindata(self,traindata): # Load the dataset
 self.data=traindata.data
 self.label=traindata.label
 self.label_set=set(traindata.label)
 self.dataloaded=True # Mark whether to load the dataset
 def Euler_dist(self,x,y): # Euler distance calculation method, x, y are vectors
 sum=0
 for i,j in zip(x,y):
 sum+=math.sqrt((i-j)**2)
 return sum
 def Manhattan_dist(self,x,y): # Manhattan distance calculation method, x, y are vectors
 sum=0
 for i,j in zip(x,y):
 sum+=abs(i-j)
 return sum
 def predict(self,temp): # Prediction function, read in the data of a prediction sample, temp is a vector
 if(not self.dataloaded): # Determine if there is training data
 print "No train_data load in"
 return
 distance_and_label=[]
 if(self.type=="Euler"): # Determine the distance calculation method, Euler distance or Manhattan distance
 for i,j in zip(self.data,self.label):
 dist=self.Euler_dist(temp,i)
 distance_and_label.append([dist,j])
 else:
 if(self.type=="Manhattan"):
 for i,j in zip(self.data,self.label):
 dist=self.Manhattan_dist(temp,i)
 distance_and_label.append([dist,j])
 else:
 print "type choice error"
 #Get the distance and category label of the K nearest samples
 neighborhood=sorted(distance_and_label,cmp=lambda x,y : cmp(x[0],y[0]))[0:self.k]
 neighborhood_class=[]
 for i in neighborhood:
 neighborhood_class.append(i[1])
 class_set=set(neighborhood_class)
 neighborhood_class_count=[]
 print "In k nearest neighborhoods:\
 #Count the number of each category among the K nearest points
 for i in class_set:
 a=neighborhood_class.count(i)
 neighborhood_class_count.append([i,a])
 print "class: ",i," count: ",a
 result=sorted(neighborhood_class_count,cmp=lambda x,y : cmp(x[1],y[1))[-1][0]
 print "result: ",result
 return result#Return the predicted category
if __name__ == '__main__':
 traindata=load_dataset()#training data
 testdata=load_dataset("Iris_test.txt")#testing data
 #Create a Knn classifier with K=20, default for Euclidean distance calculation
 kc=KnnClassifier(20)
 kc.load_traindata(traindata)
 predict_result=[]
 #Predict the results of all samples to be predicted in the test set testdata
 for i,j in zip(testdata.data,testdata.label):
 predict_result.append([i,kc.predict(i),j])
 correct_count=0
 #Compare the predicted results with the correct results to calculate the accuracy of this prediction
 for i in predict_result:
 if(i[1==i[2)
 correct_count+=1
 ratio=float(correct_count)/len(predict_result)
 print "correct predicting ratio",ratio

in the test set11classification results of the following test sample points:

In k nearest neighborhoods:
class: Iris-setosa count: 20
result: Iris-setosa
In k nearest neighborhoods:
class: Iris-setosa count: 20
result: Iris-setosa
In k nearest neighborhoods:
class: Iris-setosa count: 20
result: Iris-setosa
In k nearest neighborhoods:
class: Iris-setosa count: 20
result: Iris-setosa
In k nearest neighborhoods:
class: Iris-setosa count: 20
result: Iris-setosa
In k nearest neighborhoods:
class: Iris-versicolor count: 20
result: Iris-versicolor
In k nearest neighborhoods:
class: Iris-versicolor count: 20
result: Iris-versicolor
In k nearest neighborhoods:
class: Iris-versicolor count: 20
result: Iris-versicolor
In k nearest neighborhoods:
class: Iris-versicolor count: 20
result: Iris-versicolor
In k nearest neighborhoods:
class: Iris-versicolor count: 20
result: Iris-versicolor
In k nearest neighborhoods:
class: Iris-setosa count: 18
class: Iris-versicolor count: 2
result: Iris-setosa
correct predicting ratio 0.909090909091

There are many methods to calculate distance in KNN, different methods are suitable for different datasets. This code only implements two calculation methods: Euclidean distance and Manhattan distance; the data in the test set is extracted from the original data set, the amount of data is not very large, and the result cannot well reflect the performance of KNN, so the program running result is for reference only.

That's all for this article. Hope it will be helpful to everyone's learning, and also hope everyone will support Yell Tutorial more.

Statement: The content of this article is from the Internet, the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email to report abuse, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.)

You may also like