English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Learning Python decision tree classification algorithm

From this chapter, we enter the formal algorithm learning.

Firstly, we learn the classic and effective classification algorithm: decision tree classification algorithm.

1、decision tree algorithm

Decision trees use a tree structure to classify the attributes of samples, which is the most intuitive classification algorithm, and can also be used for regression. However, it may be difficult for some special logical classifications. Typical ones like XOR logic, decision trees are not good at solving such problems.
The construction of decision trees is not unique, unfortunately, the construction of the optimal decision tree belongs to the NP problem. Therefore, how to construct a good decision tree is the focus of research.
J. Ross Quinlan in1975proposed to introduce the concept of information entropy into the construction of decision trees, which is the famous ID3algorithm. Subsequent C4.5, C5.0, CART, etc. are improvements of this method.

Entropy is the degree of 'disorder' and 'chaos'. When first encountering this concept, you may be a bit confused. To quickly understand how to use information entropy gain to divide attributes, you can refer to this brother's article: Python Machine Learning Decision Tree Algorithm

If you still do not understand, please see the following example.

Assuming we want to build an automatic decision tree to select good apples, for simplicity, I only let it learn the following4samples:
Samples  Red  Large  Good Apple 
0         1        1         1 
1         1        0         1 
2         0        1         0 
3         0        0         0 

samples have2attributes, A0 indicating whether it is a red apple. A1indicating whether it is a large apple.

Then, the information entropy of this sample before classification is S = -(1/2 * log(1/2) + 1/2 * log(1/2)) = 1.

The information entropy is1indicating that the current state is the most chaotic and disordered.

In this example, only2attributes. Then, it is natural that there can only be2trees, as shown in the figure below:

It is obvious that the decision tree on the left, which uses A0 (red) as the basis for division, is superior to the one on the right using A1(Size) as the basis for decision tree division.
Of course, this is an intuitive understanding. For a quantitative analysis, it is necessary to calculate the information entropy gain of each division situation.
Firstly, choose A0 as the division, and the information entropy of each sub-node is calculated as follows:
0,1The leaf node has2positive examples, 0 negative examples. The information entropy is: e1 = -(2/2 * log(2/2) + 0/2 * log(0/2)) = 0。
2,3The leaf node has 0 positive examples,2negative examples. The information entropy is: e2 = -(0/2 * log(0/2) + 2/2 * log(2/2)) = 0。

Therefore, the weighted sum of the information entropy of each sub-node is selected as the information entropy of A0 after the division: E = e1*2/4 + e2*2/4 = 0。
Choose A0 to divide the information entropy gain G(S, A0) = S - E = 1 - 0 = 1.

In fact, the leaf nodes of the decision tree represent that all belong to the same category, so the information entropy must be 0.

Similarly, if you first choose A1to divide, and the calculation of the information entropy of each sub-node is as follows:
0,2sub-nodes have1positive examples,1negative examples. The information entropy is: e1 = -(1/2 * log(1/2) + 1/2 * log(1/2)) = 1.
1,3sub-nodes have1positive examples,1negative examples. The information entropy is: e2 = -(1/2 * log(1/2) + 1/2 * log(1/2)) = 1.
Therefore, choose A1The information entropy after the division is the weighted sum of the information entropy of each sub-node: E = e1*2/4 + e2*2/4 = 1. That is, it's the same as not dividing!
Choose A1Information entropy gain G(S, A)1)=S - E = 1 - 1 = 0.
Therefore, before each division, we only need to calculate the division with the maximum information gain of entropy.

2Dataset

For the sake of explanation and understanding, we use the following extremely simple test dataset:
1.5 50 thin 
1.5 60 fat 
1.6 40 thin 
1.6 60 fat 
1.7 60 thin 
1.7 80 fat 
1.8 60 thin 
1.8 90 fat 
1.9 70 thin 
1.9 80 fat 

This dataset has a total of10samples, each sample has2attributes, namely height and weight, and the third column is the category label, indicating 'fat' or 'thin'. The data is stored in1.txt.

Our task is to train a decision tree classifier that can classify whether a person is fat or thin based on height and weight.
(The data is subjective judgment of the author, with certain logic, but please ignore its rationality)

Decision trees are quite natural for binary logic of 'yes' or 'no'. But what if the height and weight in this dataset are continuous values?

Although it's a bit麻烦, it's not a problem. Just find the midpoint that divides these continuous values into different intervals, and it becomes a binary logic problem.
In this example, the task of the decision tree is to find some critical values in height and weight, and classify the samples into two groups based on whether they are greater or less than these critical values, building the decision tree from the top down.

Implementing it using Python's machine learning libraries is quite simple and elegant.

3Python Implementation

Python implementation as follows:

# -*- coding: utf-8 -*- 
import numpy as np 
import scipy as sp 
from sklearn import tree 
from sklearn.metrics import precision_recall_curve 
from sklearn.metrics import classification_report 
from sklearn.cross_validation import train_test_split 
''''' Data Input ''' 
data  = [] 
labels = [] 
with open("data\\1.txt) as ifile: 
    for line in ifile: 
      tokens = line.strip().split(' ') 
      data.append([float(tk) for tk in tokens[:-1]) 
      labels.append(tokens[-1] 
x = np.array(data) 
labels = np.array(labels) 
y = np.zeros(labels.shape) 
'''''Label conversion to 0'''/1 ''' 
y[labels=='fat'] =1 
'''''Splitting training data and test data''' 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .2) 
'''''Using information entropy as the standard for splitting, training the decision tree''' 
clf = tree.DecisionTreeClassifier(criterion='entropy') 
print(clf) 
clf.fit(x_train, y_train) 
'''''Writing the decision tree structure to the file''' 
with open("tree.dot", 'w') as f: 
  f = tree.export_graphviz(clf, out_file=f) 
'''''The coefficients reflect the influence of each feature. The larger the value, the greater the role of the feature in classification''' 
print(clf.feature_importances_) 
'''''Printing the test results''' 
answer = clf.predict(x_train) 
print(x_train) 
print(answer) 
print(y_train) 
print(np.mean(answer == y_train)) 
'''''Accuracy and Recall''' 
precision, recall, thresholds = precision_recall_curve(y_train, clf.predict(x_train)) 
answer = clf.predict_proba(x)[:1] 
print(classification_report(y, answer, target_names = ['thin', 'fat'])) 

The output results are similar to the following:
[ 0.2488562  0.7511438]
array([[  1.6,  60. ],
       [  1.7,  60. ],
       [  1.9,  80. ],
       [  1.5,  50. ],
       [  1.6,  40. ],
       [  1.7,  80. ],
       [  1.8,  90. ],
       [  1.5,  60. ])
array([ 1, 0.,  1, 0., 0.,  1,  1,  1.])
array([ 1, 0.,  1, 0., 0.,  1,  1,  1.])
1.0
             precision    recall  f1-score   support
       thin       0.83      1.00      0.91         5
        fat        1.00      0.80      0.89         5
avg / total       1.00      1.00      1.00         8
array([ 0.,  1, 0.,  1, 0.,  1, 0.,  1. 0., 0.])
array([ 0.,  1, 0.,  1, 0.,  1, 0.,  1, 0.,  1.])

It can be seen that the accuracy of testing the trained data is100%. However, when all the data are tested at the end,1test samples are classified incorrectly.
It shows that the decision tree in this example absorbs the rules of the training set very well, but the predictive power is slightly worse.
There are3It should be pointed out that this will be used in future machine learning.

1, splitting training data and test data.

This is done to facilitate cross-validation. Cross-validation is used to fully test the stability of the classifier.
in the code2represents the random selection20% of the data is used as test data. The rest80% are used for training the decision tree.
That is to say10samples are randomly selected8training.

2, different influence factors of features.

The difference in the influence weight of different features of the sample can be very large. It is also very important to see the influence degree of each sample on the classification after the classification is completed.
In this example, the weight of height is 0.25, weight is 0.75It can be seen that the importance of weight is much higher than that of height. For the judgment of obesity, this is also quite logical.

3, accuracy and recall.

this2is an important standard for evaluating the accuracy of classification. For example, at the end of the code, all10The results of testing
Test results: array([ 0.,  1, 0.,  1, 0.,  1, 0.,  1. 0., 0.])
Real results: array([ 0.,  1, 0.,  1, 0.,  1, 0.,  1, 0.,  1.])
The accuracy rate for thin is 0.83. Because the classifier classified6thin, of which5thin, so the accuracy rate for thin is5/6=0.83.
The recall rate for thin is1.00. Because there are5thin, while the classifier classified them all correctly (although it split a fat into thin! ), recall rate5/5=1.
The accuracy rate for fat is1.00. No more words.
The recall rate for fat is 0.80. Because there are5fat, while the classifier only classified4An (split a fat into thin! ), recall rate4/5=0.80.
In many cases, especially when the difficulty of data classification is high, accuracy and recall are often contradictory. You may need to find the best balance point according to your needs.
For example, in this example, your goal is to ensure that the fat people you find are real fat people (accuracy), or to ensure that you find as many fat people as possible (recall).

The code also writes the structure of the decision tree into the tree.dot. Open this file, and it is easy to draw the decision tree, and you can also see more classification information of the decision tree.
The tree.dot of this article is as follows:

digraph Tree { 
0 [label="X[1] <= 55.0000\nentropy = 0.954434002925
samples = 8", shape="box"] ; 
1 [label="entropy = 0.0000\nsamples = 2
value = [ 2. 0.]", shape="box"] ; 
0 -> 1 ; 
2 [label="X[1] <= 70.0000\nentropy = 0.650022421648
samples = 6", shape="box"] ; 
0 -> 2 ; 
3 [label="X[0] <= 1.6500\nentropy = 0.918295834054
samples = 3", shape="box"] ; 
2 -> 3 ; 
4 [label="entropy = 0.0000\nsamples = 2
value = [ 0. 2].", shape="box"] ; 
3 -> 4 ; 
5 [label="entropy = 0.0000\nsamples = 1
value = [ 1. 0.]", shape="box"] ; 
3 -> 5 ; 
6 [label="entropy = 0.0000\nsamples = 3
value = [ 0. 3].", shape="box"] ; 
2 -> 6 ; 
} 

Based on this information, the decision tree should look like this:

That's all for this article. I hope it will be helpful to everyone's learning and also hope everyone will support the Yelling Tutorial more.

Declaration: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to notice#w3Please report violations by sending an email to codebox.com (replace # with @ when sending an email), and provide relevant evidence. Once verified, this site will immediately delete the suspected infringing content.

You May Also Like