English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Clustering is an unsupervised learning method that places similar objects into the same cluster, which is somewhat like automatic classification. The better the clustering effect, the more similar the objects within the cluster, and the greater the difference between the clusters.
1K-means clustering algorithm
K-means clustering divides the data into k clusters, each cluster is described by its centroid, that is, the center of all points in the cluster. First, randomly determine k initial points as centroids, then assign the dataset to the nearest cluster. Then update the centroid of each cluster to the average of all data points. Then divide the dataset again, until the clustering result no longer changes.
Pseudocode is
Randomly create k cluster centroids
When the cluster assignment of any point changes:
For each data point in the dataset:
For each centroid:
Calculate the distance of the dataset to the centroid
Assign the dataset to the cluster corresponding to the nearest centroid distance
For each cluster, calculate the mean of all points in the cluster and use the mean as the centroid
python implementation
import numpy as np import matplotlib.pyplot as plt def loadDataSet(fileName): dataMat = [] with open(fileName) as f: for line in f.readlines(): line = line.strip().split('\t') dataMat.append(line) dataMat = np.array(dataMat).astype(np.float)32) return dataMat def distEclud(vecA, vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2))) def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m): centmin = min(dataSet[:,i]) centmax = max(dataSet[:,i]) center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return center def kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = distMeans(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 for cent in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssment data = loadDataSet('testSet.txt') muCentroids, clusterAssing = kMeans(data,4) fig = plt.figure(0) ax = fig.add_subplot(111) ax.scatter(data[:,0],data[:,1], c = clusterAssing[:,0].A) plt.show() print(clusterAssing)
2Binary K-means algorithm
The K-means algorithm may converge to a local minimum instead of a global minimum. One measure of clustering effect is the sum of squared errors (SSE). Since squares are taken, points closer to the principle center are more emphasized. To overcome the problem that the K-means algorithm may converge to a local minimum, someone has proposed the binary K-means algorithm.
First, consider all points as one cluster, then divide the cluster into two, then select the cluster that can maximize the reduction of SSE among all clusters for division, until the specified number of clusters is reached.
Pseudocode
Consider all points as one cluster
Calculate SSE
while the number of clusters is less than k:
for each cluster:
Calculate the total error
Perform k-means clustering on the given cluster (k=2)
Calculate the total error of dividing the cluster into two
Select the cluster that minimizes the error for division operation
python implementation
import numpy as np import matplotlib.pyplot as plt def loadDataSet(fileName): dataMat = [] with open(fileName) as f: for line in f.readlines(): line = line.strip().split('\t') dataMat.append(line) dataMat = np.array(dataMat).astype(np.float)32) return dataMat def distEclud(vecA, vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2))) def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m): centmin = min(dataSet[:,i]) centmax = max(dataSet[:,i]) center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return center def kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = distMeans(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 for cent in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssment def biKmeans(dataSet,k,distMeans = distEclud): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroid0 = np.mean(dataSet,axis=0).tolist() centList = [centroid0] for j in range(m): clusterAssment[j,1] = distMeans(dataSet[j,:],np.mat(centroid0))**2 while (len(centList)<k): lowestSSE = np.inf for i in range(len(centList)): ptsInCurrCluster = dataSet[np.nonzero(clusterAssment[:,0].A == i)[0],:] centroidMat,splitClustAss = kMeans(ptsInCurrCluster,2,distMeans) sseSplit = np.sum(splitClustAss[:,1]) sseNotSplit = np.sum(clusterAssment[np.nonzero(clusterAssment[:,0].A != i)[0],1]) if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat.copy() bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit print('the best cent to split is ',bestCentToSplit) # print('the len of the bestClust' bestClustAss[np.nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) bestClustAss[np.nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit clusterAssment[np.nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:] = bestClustAss.copy() centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] centList.append(bestNewCents[1,:].tolist()[0]) return np.mat(centList),clusterAssment data = loadDataSet('testSet2.txt') muCentroids, clusterAssing = biKmeans(data,3) fig = plt.figure(0) ax = fig.add_subplot(111) ax.scatter(data[:,0],data[:,1],c = clusterAssing[:,0].A,cmap=plt.cm.Paired) ax.scatter(muCentroids[:,0],muCentroids[:,1]) plt.show() print(clusterAssing) print(muCentroids)
Code and dataset download:K-means
That's all for this article. I hope it will be helpful to everyone's learning and that everyone will support the Yelling Tutorial more.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to notice#w3Please send an email to codebox.com (replace # with @ when sending an email) to report any violations, and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.