Python Implementation of Decision Tree

This article shares the specific code for implementing decision trees in Python for everyone's reference, the specific content is as follows

Algorithm advantages and disadvantages:

Advantages: low computational complexity, easy to understand output results, insensitive to missing intermediate values, can handle irrelevant feature data

Disadvantages: may cause overfitting problems

Applicable data types: numerical and nominal

Algorithm idea:

1.The overall idea of decision tree construction:

The decision tree is essentially like an if-The structure of else is the same, and its result is a tree that can be continuously judged and selected from the root to the leaf node, but the if-Else must not be what we want to set, what we need to do is to provide a method, so that the computer can get the decision tree we need according to this method. The key point of this method is how to select valuable features from so many features and choose them in the best order from root to leaf. After this, we can also recursively construct a decision tree

2.Information gain

The maximum principle of dividing the data set is to make unordered data more ordered. Since this also involves the issue of the order and disorder of information, it is natural to think of the information entropy. Here we also use information entropy (another method is Gini impurity). The formula is as follows:

The requirements that the data needs to meet:

1 The data must be a list composed of list elements, and all elements in the columns must have the same data length
2 The last column of the data or the last element of each instance should be the category label of the current instance

Function:

calcShannonEnt(dataSet)

Calculate the Shannon entropy of the data set in two steps: the first step is to calculate the frequency, and the second step is to calculate the Shannon entropy according to the formula

splitDataSet(dataSet, aixs, value)

Divide the data set, put all the values that satisfy X[aixs] == value together, and return a well-divided set (excluding the aixs attribute used for division, because it is not needed)

chooseBestFeature(dataSet)

Choose the best attributes for division, the idea is very simple, that is, divide each attribute and see which one is good. Here a set is used to select unique elements from the list, which is a very fast method

majorityCnt(classList)

Since we recursively construct decision trees based on the consumption of attributes, there may be a situation where the last attribute is used up but the classification has not been completed. In this case, majority voting is used to calculate the classification of the node.

createTree(dataSet, labels)

Based on recursive construction of decision trees. Here, the label is more for the name of the classification feature, in order to make it more readable and easier to understand later.

#coding=utf-8-8
import operator
from math import log
import time
def createDataSet():
  dataSet=[[1,1,'yes'],
      [1,1,'yes'],
      [1,0,'no'],
      [0,1,'no'],
      [0,1,'no']
  labels = ['no surfaceing','flippers']
  return dataSet, labels
# Calculate Shannon entropy
def calcShannonEnt(dataSet):
  numEntries = len(dataSet)
  labelCounts = {}
  for feaVec in dataSet:
    currentLabel = feaVec[-1]
    if currentLabel not in labelCounts:
      labelCounts[currentLabel] = 0
    labelCounts[currentLabel] += 1
  shannonEnt = 0.0
  for key in labelCounts:
    prob = float(labelCounts[key])/numEntries
    shannonEnt -= * log(prob, 2)
  return shannonEnt
def splitDataSet(dataSet, axis, value):
  retDataSet = []
  for featVec in dataSet:
    if featVec[axis] == value:
      reducedFeatVec = featVec[:axis]
      reducedFeatVec.extend(featVec[axis+1:])
      retDataSet.append(reducedFeatVec)
  return retDataSet
def chooseBestFeatureToSplit(dataSet):
  numFeatures = len(dataSet[0]) - 1#Because the last item of the dataset is the label
  baseEntropy = calcShannonEnt(dataSet)
  bestInfoGain = 0.0
  bestFeature = -1
  for i in range(numFeatures):
    featList = [example[i] for example in dataSet]
    uniqueVals = set(featList)
    newEntropy = 0.0
    for value in uniqueVals:
      subDataSet = splitDataSet(dataSet, i, value)
      prob = len(subDataSet) / float(len(dataSet))
      newEntropy += * calcShannonEnt(subDataSet)
    infoGain = baseEntropy -newEntropy
    if infoGain > bestInfoGain:
      bestInfoGain = infoGain
      bestFeature = i
  return bestFeature
#Because we calculate the decision tree recursively based on the consumption of attributes, it may happen that the attributes are used up at the end, but the classification
#If the calculation is not finished yet, the majority voting method will be used to calculate the node classification
def majorityCnt(classList):
  classCount = {}
  for vote in classList:
    if vote not in classCount.keys():
      classCount[vote] = 0
    classCount[vote] += 1
  return max(classCount)     
def createTree(dataSet, labels):
  classList = [example[-1for example in dataSet]
  if classList.count(classList[0]) == len(classList):#If the categories are the same, stop the division
    return classList[0]
  if the length of dataSet[0] equals 1:# All features have been used
    return majorityCnt(classList)
  bestFeat = chooseBestFeatureToSplit(dataSet)
  bestFeatLabel = labels[bestFeat]
  myTree = {bestFeatLabel:{}}
  del(labels[bestFeat])
  featValues = [example[bestFeat] for example in dataSet]
  uniqueVals = set(featValues)
  for value in uniqueVals:
    subLabels = labels[:]# To avoid changing the original list content, a copy was made
    myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, ", ") 
                    bestFeat, value), subLabels)
  return myTree
def main():
  data, label = createDataSet()
  t1 = time.clock()
  myTree = createTree(data, label)
  t2 = time.clock()
  print myTree
  print 'execute for ', t2-t1
if __name__=='__main__':
  main()

That's all for this article. I hope it will be helpful to everyone's learning and that everyone will support the Yelling Tutorial more.

Statement: The content of this article is from the Internet, the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please send an email to report any infringement (replace # with @ in the email address), and provide relevant evidence. Once verified, this site will immediately delete the infringing content.

Basic Tutorial