English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Decision trees are commonly used for classification in machine learning.
Advantages: low computational complexity, easy-to-understand output results, insensitive to missing intermediate values, and can handle irrelevant feature data.
Disadvantages: may cause overfitting issues.
Applicable data types: numeric and nominal.
1.Information gain
The purpose of dividing the dataset is to make unordered data more ordered. One way to organize chaotic and unordered data is to use information theory to measure information. Usually, information gain is used, and information gain refers to the reduction of information entropy before and after data division. The more unordered the information, the greater the entropy, and the feature with the highest information gain is the best choice.
Entropy is defined as the expected value of information, the information definition of symbol xi is:
Among them, p(xi) is the probability of the classification.
Entropy, that is, the expected value of information, is:
The code for calculating entropy is as follows:
def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts: labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0 for key in labelCounts: shannonEnt = shannonEnt - (labelCounts[key]/number of entries)*math.log)2(labelCounts[key]/number of entries) return shannonEnt
It can be divided according to the information entropy, by obtaining the maximum information gain method to divide the dataset.
2.Divide the dataset
.Dividing the dataset is to extract all elements that meet the requirements.
def splitDataSet(dataSet, axis, value): retDataset = [] for featVec in dataSet: if featVec[axis] is equal to value: newVec = featVec[:axis] newVec.extend(featVec[axis+1:]) retDataset.append(newVec) return retDataset
3.Choose the best way to divide the dataset
Information gain is the reduction of entropy or the reduction of information disorder.
def chooseBestFeatureToSplit(dataSet): numFeatures = length of dataSet[0] - 1 bestInfoGain = 0 bestFeature = -1 baseEntropy = calcShannonEnt(dataSet) for i in range(numFeatures): allValue = [example[i] for example in dataSet]#List comprehension, create a new list allValue = set(allValue)#The fastest way to get unique values in a list newEntropy = 0 for value in allValue: splitset = splitDataSet(dataSet, i, value) newEntropy = newEntropy + length of splitset/length of dataSet*calcShannonEnt(splitset) infoGain = baseEntropy - newEntropy if infoGain is greater than bestInfoGain: bestInfoGain = infoGain bestFeature = i return bestFeature
4.Recursively create the decision tree
The ending condition is: the program has traversed all the attributes of the divided dataset, or all instances under each branch have the same classification.
When the dataset has processed all attributes but the class label is not unique, use the majority voting method to determine the type of leaf nodes.
def majorityCnt(classList): classCount = {} for value in classList: if value is not in classCount: classCount[value] = 0 classCount[value] += 1 classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True) return classCount[0][0]
.Generate the decision tree:
def createTree(dataSet, labels): classList = [example[-1for example in dataSet] labelsCopy = labels[:] if the count of classList[classList[0]] in classList is equal to the length of classList: return classList[0] if len(dataSet[0]) == 1: return majorityCnt(classList) bestFeature = chooseBestFeatureToSplit(dataSet) bestLabel = labelsCopy[bestFeature] myTree = {bestLabel:{}} featureValues = [example[bestFeature] for example in dataSet] featureValues = set(featureValues) del(labelsCopy[bestFeature]) for value in featureValues: subLabels = labelsCopy[:] myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels) return myTree
5.Test the algorithm——using decision tree classification
.Same recursive method is used to obtain the classification results.
def classify(inputTree, featLabels, testVec): currentFeat = list(inputTree.keys())[0] secondTree = inputTree[currentFeat] try: featureIndex = featLabels.index(currentFeat) except ValueError as err: print('yes') try: for value in secondTree.keys(): if value == testVec[featureIndex]: if type(secondTree[value]).__name__ == 'dict': classLabel = classify(secondTree[value], featLabels, testVec) else: classLabel = secondTree[value] return classLabel except AttributeError: print(secondTree)
6.Complete codeHere is
import numpy as np import math import operator def createDataSet(): dataSet = [[1,1,'yes'], [1,1,'yes'], [1,0,'no'], [0,1,'no'], [0,1,'no'],] label = ['no surfacing', 'flippers'] return dataSet, label def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts: labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0 for key in labelCounts: shannonEnt = shannonEnt - (labelCounts[key]/number of entries)*math.log)2(labelCounts[key]/number of entries) return shannonEnt def splitDataSet(dataSet, axis, value): retDataset = [] for featVec in dataSet: if featVec[axis] is equal to value: newVec = featVec[:axis] newVec.extend(featVec[axis+1:]) retDataset.append(newVec) return retDataset def chooseBestFeatureToSplit(dataSet): numFeatures = length of dataSet[0] - 1 bestInfoGain = 0 bestFeature = -1 baseEntropy = calcShannonEnt(dataSet) for i in range(numFeatures): allValue = [example[i] for example in dataSet] allValue = set(allValue) newEntropy = 0 for value in allValue: splitset = splitDataSet(dataSet, i, value) newEntropy = newEntropy + length of splitset/length of dataSet*calcShannonEnt(splitset) infoGain = baseEntropy - newEntropy if infoGain is greater than bestInfoGain: bestInfoGain = infoGain bestFeature = i return bestFeature def majorityCnt(classList): classCount = {} for value in classList: if value is not in classCount: classCount[value] = 0 classCount[value] += 1 classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True) return classCount[0][0] def createTree(dataSet, labels): classList = [example[-1for example in dataSet] labelsCopy = labels[:] if the count of classList[classList[0]] in classList is equal to the length of classList: return classList[0] if len(dataSet[0]) == 1: return majorityCnt(classList) bestFeature = chooseBestFeatureToSplit(dataSet) bestLabel = labelsCopy[bestFeature] myTree = {bestLabel:{}} featureValues = [example[bestFeature] for example in dataSet] featureValues = set(featureValues) del(labelsCopy[bestFeature]) for value in featureValues: subLabels = labelsCopy[:] myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels) return myTree def classify(inputTree, featLabels, testVec): currentFeat = list(inputTree.keys())[0] secondTree = inputTree[currentFeat] try: featureIndex = featLabels.index(currentFeat) except ValueError as err: print('yes') try: for value in secondTree.keys(): if value == testVec[featureIndex]: if type(secondTree[value]).__name__ == 'dict': classLabel = classify(secondTree[value], featLabels, testVec) else: classLabel = secondTree[value] return classLabel except AttributeError: print(secondTree) if __name__ == "__main__": dataset,label = createDataSet() myTree = createTree(dataset,label) a = [1,1] print(classify(myTree,label,a))
7.Programming Tips
Difference between extend and append
newVec.extend(featVec[axis+1:]) retDataset.append(newVec)
extend([]), adds each element of the list to the new list in turn
append() adds the content in parentheses as an item to the new list
List Comprehension
Creating a New List
allValue = [example[i] for example in dataSet]
Extract Unique Elements from List
allValue = set(allValue)
List/Tuple Sorting, sorted() Function
classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True)
List Copy
labelsCopy = labels[:]
Code and Dataset Download:Decision Tree
That's all for this article. I hope it will be helpful to everyone's learning and also hope everyone will support the Yelling Tutorial more.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting, and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.)