Writing code for a classification decision tree in Python

Decision trees are commonly used for classification in machine learning.

Advantages: low computational complexity, easy-to-understand output results, insensitive to missing intermediate values, and can handle irrelevant feature data.
Disadvantages: may cause overfitting issues.
Applicable data types: numeric and nominal.

1.Information gain

The purpose of dividing the dataset is to make unordered data more ordered. One way to organize chaotic and unordered data is to use information theory to measure information. Usually, information gain is used, and information gain refers to the reduction of information entropy before and after data division. The more unordered the information, the greater the entropy, and the feature with the highest information gain is the best choice.
Entropy is defined as the expected value of information, the information definition of symbol xi is:

Among them, p(xi) is the probability of the classification.
Entropy, that is, the expected value of information, is:

The code for calculating entropy is as follows:

def calcShannonEnt(dataSet):
  numEntries = len(dataSet)
  labelCounts = {}
  for featVec in dataSet:
    currentLabel = featVec[-1]
    if currentLabel not in labelCounts:
      labelCounts[currentLabel] = 0
    labelCounts[currentLabel] += 1
  shannonEnt = 0
  for key in labelCounts:
    shannonEnt = shannonEnt - (labelCounts[key]/number of entries)*math.log)2(labelCounts[key]/number of entries)
  return shannonEnt

It can be divided according to the information entropy, by obtaining the maximum information gain method to divide the dataset.

2.Divide the dataset

.Dividing the dataset is to extract all elements that meet the requirements.

def splitDataSet(dataSet, axis, value):
  retDataset = []
  for featVec in dataSet:
    if featVec[axis] is equal to value:
      newVec = featVec[:axis]
      newVec.extend(featVec[axis+1:])
      retDataset.append(newVec)
  return retDataset

3.Choose the best way to divide the dataset

Information gain is the reduction of entropy or the reduction of information disorder.

def chooseBestFeatureToSplit(dataSet):
  numFeatures = length of dataSet[0] - 1
  bestInfoGain = 0
  bestFeature = -1
  baseEntropy = calcShannonEnt(dataSet)
  for i in range(numFeatures):
    allValue = [example[i] for example in dataSet]#List comprehension, create a new list
    allValue = set(allValue)#The fastest way to get unique values in a list
    newEntropy = 0
    for value in allValue:
      splitset = splitDataSet(dataSet, i, value)
      newEntropy = newEntropy + length of splitset/length of dataSet*calcShannonEnt(splitset)
    infoGain = baseEntropy - newEntropy
    if infoGain is greater than bestInfoGain:
      bestInfoGain = infoGain
      bestFeature = i
  return bestFeature

4.Recursively create the decision tree

The ending condition is: the program has traversed all the attributes of the divided dataset, or all instances under each branch have the same classification.
When the dataset has processed all attributes but the class label is not unique, use the majority voting method to determine the type of leaf nodes.

def majorityCnt(classList):
 classCount = {}
 for value in classList:
  if value is not in classCount: classCount[value] = 0
  classCount[value] += 1
 classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True)
 return classCount[0][0]

.Generate the decision tree:

def createTree(dataSet, labels):
 classList = [example[-1for example in dataSet]
 labelsCopy = labels[:]
 if the count of classList[classList[0]] in classList is equal to the length of classList:
  return classList[0]
 if len(dataSet[0]) == 1:
  return majorityCnt(classList)
 bestFeature = chooseBestFeatureToSplit(dataSet)
 bestLabel = labelsCopy[bestFeature]
 myTree = {bestLabel:{}}
 featureValues = [example[bestFeature] for example in dataSet]
 featureValues = set(featureValues)
 del(labelsCopy[bestFeature])
 for value in featureValues:
  subLabels = labelsCopy[:]
  myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
 return myTree

5.Test the algorithm——using decision tree classification

.Same recursive method is used to obtain the classification results.

def classify(inputTree, featLabels, testVec):
 currentFeat = list(inputTree.keys())[0]
 secondTree = inputTree[currentFeat]
 try:
  featureIndex = featLabels.index(currentFeat)
 except ValueError as err:
  print('yes')
 try:
  for value in secondTree.keys():
   if value == testVec[featureIndex]:
    if type(secondTree[value]).__name__ == 'dict':
     classLabel = classify(secondTree[value], featLabels, testVec)
    else:
     classLabel = secondTree[value]
  return classLabel
 except AttributeError:
  print(secondTree)

6.Complete codeHere is

import numpy as np
import math
import operator
def createDataSet():
 dataSet = [[1,1,'yes'],
    [1,1,'yes'],
    [1,0,'no'],
    [0,1,'no'],
    [0,1,'no'],]
 label = ['no surfacing', 'flippers']
 return dataSet, label
def calcShannonEnt(dataSet):
 numEntries = len(dataSet)
 labelCounts = {}
 for featVec in dataSet:
  currentLabel = featVec[-1]
  if currentLabel not in labelCounts:
   labelCounts[currentLabel] = 0
  labelCounts[currentLabel] += 1
 shannonEnt = 0
 for key in labelCounts:
  shannonEnt = shannonEnt - (labelCounts[key]/number of entries)*math.log)2(labelCounts[key]/number of entries)
 return shannonEnt
def splitDataSet(dataSet, axis, value):
 retDataset = []
 for featVec in dataSet:
  if featVec[axis] is equal to value:
   newVec = featVec[:axis]
   newVec.extend(featVec[axis+1:])
   retDataset.append(newVec)
 return retDataset
def chooseBestFeatureToSplit(dataSet):
 numFeatures = length of dataSet[0] - 1
 bestInfoGain = 0
 bestFeature = -1
 baseEntropy = calcShannonEnt(dataSet)
 for i in range(numFeatures):
  allValue = [example[i] for example in dataSet]
  allValue = set(allValue)
  newEntropy = 0
  for value in allValue:
   splitset = splitDataSet(dataSet, i, value)
   newEntropy = newEntropy + length of splitset/length of dataSet*calcShannonEnt(splitset)
  infoGain = baseEntropy - newEntropy
  if infoGain is greater than bestInfoGain:
   bestInfoGain = infoGain
   bestFeature = i
 return bestFeature
def majorityCnt(classList):
 classCount = {}
 for value in classList:
  if value is not in classCount: classCount[value] = 0
  classCount[value] += 1
 classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True)
 return classCount[0][0]   
def createTree(dataSet, labels):
 classList = [example[-1for example in dataSet]
 labelsCopy = labels[:]
 if the count of classList[classList[0]] in classList is equal to the length of classList:
  return classList[0]
 if len(dataSet[0]) == 1:
  return majorityCnt(classList)
 bestFeature = chooseBestFeatureToSplit(dataSet)
 bestLabel = labelsCopy[bestFeature]
 myTree = {bestLabel:{}}
 featureValues = [example[bestFeature] for example in dataSet]
 featureValues = set(featureValues)
 del(labelsCopy[bestFeature])
 for value in featureValues:
  subLabels = labelsCopy[:]
  myTree[bestLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
 return myTree
def classify(inputTree, featLabels, testVec):
 currentFeat = list(inputTree.keys())[0]
 secondTree = inputTree[currentFeat]
 try:
  featureIndex = featLabels.index(currentFeat)
 except ValueError as err:
  print('yes')
 try:
  for value in secondTree.keys():
   if value == testVec[featureIndex]:
    if type(secondTree[value]).__name__ == 'dict':
     classLabel = classify(secondTree[value], featLabels, testVec)
    else:
     classLabel = secondTree[value]
  return classLabel
 except AttributeError:
  print(secondTree)
if __name__ == "__main__":
 dataset,label = createDataSet()
 myTree = createTree(dataset,label)
 a = [1,1]
 print(classify(myTree,label,a))

7.Programming Tips

Difference between extend and append

 newVec.extend(featVec[axis+1:])
 retDataset.append(newVec)

extend([]), adds each element of the list to the new list in turn
append() adds the content in parentheses as an item to the new list

List Comprehension

Creating a New List

allValue = [example[i] for example in dataSet]

Extract Unique Elements from List

allValue = set(allValue)

List/Tuple Sorting, sorted() Function

classCount = sorted(classCount.items(),key=operator.itemgetter(1,reverse=True)

List Copy

labelsCopy = labels[:]

Code and Dataset Download:Decision Tree

That's all for this article. I hope it will be helpful to everyone's learning and also hope everyone will support the Yelling Tutorial more.

Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (Please replace # with @ when sending an email for reporting, and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.)

Basic Tutorial