English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Principles and methods of implementing random forest in Python

Introduction

To obtain the main features of the data through random forest

1

Random forest is a highly flexible machine learning method with broad application prospects, ranging from marketing to healthcare insurance. It can be used for modeling marketing simulations, statistics on customer sources, retention, and churn. It can also be used to predict the risk of diseases and the susceptibility of patients.

According to the generation method of individual learners, the current ensemble learning methods can be roughly divided into two major categories: the serialized methods with strong dependency between individual learners, which must be generated sequentially, and the parallelized methods with no strong dependency between individual learners, which can be generated simultaneously;

The former representative is Boosting, and the latter representatives are Bagging and 'Random Forest' (Random
Forest)

Random forest, based on the Bagging integration of decision trees as base learners, further introduces random attribute selection (i.e., random feature selection) in the training process of decision trees.

In simple terms, random forest is the integration of decision trees, but there are two differences:

(2) Feature selection difference: The n classification features of each decision tree are randomly selected from all features (n is a parameter that we need to adjust ourselves).
Random forest, simply put, for example, predicting salary, is to build multiple decision trees job, age, house, and then according to the various features of the quantity to be predicted (teacher,39, suburb) in the corresponding decision tree target value probability (salary<5000, salary>=5000), thus, determining the probability of the predicted quantity (such as, predicting P(salary<5000) = 0.3)

Random forest is capable of both regression and classification. It has the characteristics of handling big data and is very helpful for estimating or variable is an important basic data modeling.

Parameter description:

The two most important parameters are 'n_estimators' and 'max_features'.

'n_estimators': Represents the number of trees in the forest. In theory, the more, the better. However, this is accompanied by an increase in computation time. But it is not necessarily better to get larger, the best prediction effect will appear at a reasonable number of trees.

'max_features': Randomly select a subset of feature sets and use them to split nodes. The fewer the number of subsets, the faster the variance will decrease, but at the same time, the bias will increase faster. According to good practical experience. If it is a regression problem, then:

'max_features' equals 'n_features', and if it is a classification problem, 'max_features' equals the square root of 'n_features'.

If you want to get better results, you must set 'max_depth' to None, and 'min_sample_split'=1.
Also remember to perform cross_validated (cross-validation), in addition to remembering that in the random forest, 'bootstrap' should be set to True. But in the extra-In the 'trees' parameter, set 'bootstrap' to False.

2Random forest implementation in Python

2.1Demo1

Implement the basic functions of random forest

#Random Forest
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor 
import numpy as np 
from sklearn.datasets import load_iris 
iris=load_iris() 
#print iris#iris has 4 attributes: sepal width, sepal length, petal width, petal length; the label is the type of flower: setosa, versicolour, virginica 
print(iris['target'].shape)
rf=RandomForestRegressor()# Here, default parameter settings are used 
rf.fit(iris.data[:150], iris.target[:150])# Train the model 
# Randomly select two samples with different predictions 
instance=iris.data[[100,109] 
print(instance)
rf.predict(instance[[0]])
print('instance 0 prediction;', rf.predict(instance[[0]]))
print('instance 1 prediction; ',rf.predict(instance[[1]))
print(iris.target[100], iris.target[109)] 

Running results

(150,)
[[ 6.3  3.3  6.   2.5]
 [ 7.2  3.6  6.1  2.5]
instance 0 prediction; [ 2]
instance 1 prediction; [ 2]
2 2

2.2 Demo2

3Comparison of methods

# random forest test
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())    
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())    
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

Run results:}

0.979408793821
0.999607843137
0.999898989899

2.3 Demo3-Implementation of feature selection

#Random Forest2
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor 
import numpy as np 
from sklearn.datasets import load_iris 
iris=load_iris() 
from sklearn.model_selection import cross_val_score, ShuffleSplit 
X = iris["data"] 
Y = iris["target"] 
names = iris["feature_names"] 
rf = RandomForestRegressor() 
scores = [] 
for i in range(X.shape[1]) 
 score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", 
    cv=ShuffleSplit(len(X), 3, .3)) 
 scores.append((round(np.mean(score), 3), names[i])) 
print(sorted(scores, reverse=True))

Run results:}

[(0.89300000000000002, 'petal width (cm)'), (0.82099999999999995, 'petal length
(cm)'), (0.13, 'sepal length (cm)'), (-0.79100000000000004, 'sepal width (cm)')]

2.4 demo4-Random Forest

I wanted to use the following code to build a random random forest decision tree, but the problem encountered is that the program keeps running and cannot respond, and it needs to be debugged.

#Random Forest4
#coding:utf-8 
import csv 
from random import seed 
from random import randrange 
from math import sqrt 
def loadCSV(filename):#load data, store each line into a list 
 dataSet = [] 
 with open(filename, 'r') as file: 
 csvReader = csv.reader(file) 
 for line in csvReader: 
  dataSet.append(line) 
 return dataSet 
# Convert all columns except the label column to float type 
def column_to_float(dataSet): 
 featLen = len(dataSet[0]) - 1 
 for data in dataSet: 
 for column in range(featLen): 
  data[column] = float(data[column].strip()) 
# Split the data set randomly into N pieces for convenience of cross-validation, one piece is the test set, and the other four pieces are training sets 
def spiltDataSet(dataSet, n_folds): 
 fold_size = int(len(dataSet)) / n_folds) 
 dataSet_copy = list(dataSet) 
 dataSet_spilt = [] 
 for i in range(n_folds): 
 fold = [] 
 while len(fold) < fold_size: # Here we cannot use if, because if only works at the first judgment, while executes the loop until the condition is not met 
  index = randrange(len(dataSet_copy)) 
  fold.append(dataSet_copy.pop(index)) # pop() function is used to remove an element from a list (by default the last element) and return its value. 
 dataSet_spilt.append(fold) 
 return dataSet_spilt 
# Construct data subset 
def get_subsample(dataSet, ratio): 
 subdataSet = [] 
 lenSubdata = round(len(dataSet)) * ratio)#return a floating-point number 
 while len(subdataSet) < lenSubdata: 
 index = randrange(len(dataSet)) - 1) 
 subdataSet.append(dataSet[index]) 
 # print len(subdataSet) 
 return subdataSet 
# Split data set 
def data_spilt(dataSet, index, value): 
 left = [] 
 right = [] 
 for row in dataSet: 
 if row[index] < value: 
  left.append(row) 
 else: 
  right.append(row) 
 return left, right 
# Calculate the split cost 
def spilt_loss(left, right, class_values): 
 loss = 0.0 
 for class_value in class_values: 
 left_size = len(left) 
 if left_size != 0: # Prevent division by zero 
  prop = [row[-1] for row in left].count(class_value) / float(left_size) 
  loss += (prop * (1.0 - prop)) 
 right_size = len(right) 
 if right_size != 0: 
  prop = [row[-1] for row in right].count(class_value) / float(right_size) 
  loss += (prop * (1.0 - prop)) 
 return loss 
# Select any n features and find the best feature for splitting among these n features 
def get_best_spilt(dataSet, n_features): 
 features = [] 
 class_values = list(set(row[-1for row in dataSet)) 
 b_index, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None 
 while len(features) < n_features: 
 index = randrange(len(dataSet[0])) - 1) 
 if index not in features: 
  features.append(index) 
 # print 'features:', features 
 for index in features: # Find the most suitable index for the node (minimum loss) 
 for row in dataSet: 
  left, right = data_spilt(dataSet, index, row[index]) # Split into left and right branches at this node 
  loss = spilt_loss(left, right, class_values) 
  if loss < b_loss: # Find the minimum split cost 
  b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right 
 # print b_loss 
 # print type(b_index) 
 return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right} 
# Determine the output label 
def decide_label(data): 
 output = [row[-1] for row in data] 
 return max(set(output), key=output.count) 
# Subdivision, the process of continuously building leaf nodes 
def sub_spilt(root, n_features, max_depth, min_size, depth): 
 left = root['left'] 
 # print left 
 right = root['right'] 
 del (root['left']) 
 del (root['right']) 
 # print depth 
 if not left or not right: 
 root['left'] = root['right'] = decide_label(left + right) 
 # print 'testing' 
 return 
 if depth > max_depth: 
 root['left'] = decide_label(left) 
 root['right'] = decide_label(right) 
 return 
 if len(left) < min_size: 
 root['left'] = decide_label(left) 
 else: 
 root['left'] = get_best_spilt(left, n_features) 
 # print 'testing_left' 
 sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1) 
 if len(right) < min_size: 
 root['right'] = decide_label(right) 
 else: 
 root['right'] = get_best_spilt(right, n_features) 
 # print 'testing_right' 
 sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1) 
 # Construct a decision tree 
def build_tree(dataSet, n_features, max_depth, min_size): 
 root = get_best_spilt(dataSet, n_features) 
 sub_spilt(root, n_features, max_depth, min_size, 1) 
 return root 
# 预测测试集结果 
def predict(tree, row): 
 predictions = [] 
 if row[tree['index']] < tree['value']: 
 if isinstance(tree['left'], dict): 
  return predict(tree['left'], row) 
 else: 
  return tree['left'] 
 else: 
 if isinstance(tree['right'], dict): 
  return predict(tree['right'], row) 
 else: 
  return tree['right'] 
  # predictions=set(predictions) 
def bagging_predict(trees, row): 
 predictions = [predict(tree, row) for tree in trees] 
 return max(set(predictions), key=predictions.count) 
# 创建随机森林 
def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees): 
 trees = [] 
 for i in range(n_trees): 
 train = get_subsample(train, ratio)#从切割的数据集中选取子集 
 tree = build_tree(train, n_features, max_depth, min_size) 
 # print 'tree %d: '%i,tree 
 trees.append(tree) 
 # predict_values = [predict(trees,row) for row in test] 
 predict_values = [bagging_predict(trees, row) for row in test] 
 return predict_values 
# Calculate accuracy 
def accuracy(predict_values, actual): 
 correct = 0 
 for i in range(len(actual)): 
 if actual[i] == predict_values[i]: 
  correct += 1 
 return correct / float(len(actual)) 
if __name__ == '__main__': 
 seed(1) 
 dataSet = loadCSV(r'G:\0研究生\tianchiCompetition\训练小样本2.csv') 
 column_to_float(dataSet) 
 n_folds = 5 
 max_depth = 15 
 min_size = 1 
 ratio = 1.0 
 # n_features=sqrt(len(dataSet)-1) 
 n_features = 15 
 n_trees = 10 
 folds = spiltDataSet(dataSet, n_folds)# Initially, split the dataset 
 scores = [] 
 for fold in folds: 
 train_set = folds[ 
   :] # Here, it cannot be simply used as train_set=folds, because this is a reference. When the value of train_set changes, the value of folds will also change. Therefore, a copy form should be used. (L[:]) can copy sequences, D.copy() can copy dictionaries, and list can generate copies list(L) 
 train_set.remove(fold)# Select the training set 
 # print len(folds) 
 train_set = sum(train_set, []) # Combine multiple fold lists into a single train_set list 
 # print len(train_set) 
 test_set = [] 
 for row in fold: 
  row_copy = list(row) 
  row_copy[-1] = None 
  test_set.append(row_copy) 
  # for row in test_set: 
  # print row[-1] 
 actual = [row[-1for row in fold] 
 predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees) 
 accur = accuracy(predict_values, actual) 
 scores.append(accur) 
 print ('Trees is %d' % n_trees) 
 print ('scores:%s' % scores) 
 print ('mean score:%s' % (sum(scores) / float(len(scores))) 

2.5 Random Forest classification of sonic data

# CART on the Bank Note dataset
from random import seed
from random import randrange
from csv import reader
# Load a CSV file
def load_csv(filename):
 file = open(filename, "r")
 lines = reader(file)
 dataset = list(lines)
 return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
 for row in dataset:
 row[column] = float(row[column].strip())
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
 dataset_split = list()
 dataset_copy = list(dataset)
 fold_size = int(len(dataset) / n_folds)
 for i in range(n_folds):
 fold = list()
 while len(fold) < fold_size:
  index = randrange(len(dataset_copy))
  fold.append(dataset_copy.pop(index))
 dataset_split.append(fold)
 return dataset_split
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
 correct = 0
 for i in range(len(actual)):
 if actual[i] == predicted[i]:
  correct += 1
 return correct / float(len(actual)) * 100.0
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
 folds = cross_validation_split(dataset, n_folds)
 scores = list()
 for fold in folds:
 train_set = list(folds)
 train_set.remove(fold)
 train_set = sum(train_set, [])
 test_set = list()
 for row in fold:
  row_copy = list(row)
  test_set.append(row_copy)
  row_copy[-1] = None
 predicted = algorithm(train_set, test_set, *args)
 actual = [row[-1for row in fold]
 accuracy = accuracy_metric(actual, predicted)
 scores.append(accuracy)
 return scores
# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
 left, right = list(), list()
 for row in dataset:
 if row[index] < value:
  left.append(row)
 else:
  right.append(row)
 return left, right
# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
 gini = 0.0
 for class_value in class_values:
 for group in groups:
  size = len(group)
  if size == 0:
  continue
  proportion = [row[-1] for row in group].count(class_value) / float(size)
  gini += (proportion * (1.0 - proportion))
 return gini
# Select the best split point for a dataset
def get_split(dataset):
 class_values = list(set(row[-1] for row in dataset))
 b_index, b_value, b_score, b_groups = 999, 999, 999, None
 for index in range(len(dataset[0]))-1)
 for row in dataset:
  groups = test_split(index, row[index], dataset)
  gini = gini_index(groups, class_values)
  if gini < b_score:
  b_index, b_value, b_score, b_groups = index, row[index], gini, groups
 print ({'index':b_index, 'value':b_value})
 return {'index':b_index, 'value':b_value, 'groups':b_groups}
# Create a terminal node value
def to_terminal(group):
 outcomes = [row[-1] for row in group]
 return max(set(outcomes), key=outcomes.count)
# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
 left, right = node['groups']
 del(node['groups'])
 # check for a no split
 if not left or not right:
 node['left'] = node['right'] = to_terminal(left + right)
 return
 # check for max depth
 if depth >= max_depth:
 node['left'], node['right'] = to_terminal(left), to_terminal(right)
 return
 # process left child
 if len(left) <= min_size:
 node['left'] = to_terminal(left)
 else:
 node['left'] = get_split(left)
 split(node['left'], max_depth, min_size, depth)+1)
 # process right child
 if len(right) <= min_size:
 node['right'] = to_terminal(right)
 else:
 node['right'] = get_split(right)
 split(node['right'], max_depth, min_size, depth)+1)
# Build a decision tree
def build_tree(train, max_depth, min_size):
 root = get_split(train)
 split(root, max_depth, min_size, 1)
 return root
# Make a prediction with a decision tree
def predict(node, row):
 if row[node['index']] < node['value']:
 if isinstance(node['left'], dict):
  return predict(node['left'], row)
 else:
  return node['left']
 else:
 if isinstance(node['right'], dict):
  return predict(node['right'], row)
 else:
  return node['right']
# Classification and Regression Tree Algorithm
def decision_tree(train, test, max_depth, min_size):
 tree = build_tree(train, max_depth, min_size)
 predictions = list()
 for row in test:
 prediction = predict(tree, row)
 predictions.append(prediction)
 return(predictions)
# Test CART on Bank Note dataset
seed(1)
# load and prepare data
filename = r'G:\0pythonstudy\decision_tree\sonar.all-data.csv'
dataset = load_csv(filename)
# Convert string attributes to integers
for i in range(len(dataset[0]))-1)
 str_column_to_float(dataset, i)
# evaluate algorithm
n_folds = 5
max_depth = 5
min_size = 10
scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores)))

Run results:}

{'index': 38, 'value': 0.0894}
{'index': 36, 'value': 0.8459}
{'index': 50, 'value': 0.0024}
{'index': 15, 'value': 0.0906}
{'index': 16, 'value': 0.9819}
{'index': 10, 'value': 0.0785}
{'index': 16, 'value': 0.0886}
{'index': 38, 'value': 0.0621}
{'index': 5, 'value': 0.0226}
{'index': 8, 'value': 0.0368}
{'index': 11, 'value': 0.0754}
{'index': 0, 'value': 0.0239}
{'index': 8, 'value': 0.0368}
{'index': 29, 'value': 0.1671}
{'index': 46, 'value': 0.0237}
{'index': 38, 'value': 0.0621}
{'index': 14, 'value': 0.0668}
{'index': 4, 'value': 0.0167}
{'index': 37, 'value': 0.0836}
{'index': 12, 'value': 0.0616}
{'index': 7, 'value': 0.0333}
{'index': 33, 'value': 0.8741}
{'index': 16, 'value': 0.0886}
{'index': 8, 'value': 0.0368}
{'index': 33, 'value': 0.0798}
{'index': 44, 'value': 0.0298}
Scores: [48.78048780487805, 70.73170731707317, 58.536585365853654, 51.2195121951
2195, 39.02439024390244]
Mean Accuracy: 53.659%
Press any key to continue... . . .

Knowledge point:

1. Load CSV file

from csv import reader
# Load a CSV file
def load_csv(filename):
 file = open(filename, "r")
 lines = reader(file)
 dataset = list(lines)
 return dataset
filename = r'G:\0pythonstudy\decision_tree\sonar.all-data.csv'
dataset=load_csv(filename)
print(dataset)

2. Convert the data to float format

# Convert string column to float
def str_column_to_float(dataset, column):
 for row in dataset:
 row[column] = float(row[column].strip())
 # print(row[column])
# Convert string attributes to integers
for i in range(len(dataset[0]))-1)
 str_column_to_float(dataset, i)

3. Convert the classification string in the last column to 0,1Integer

def str_column_to_int(dataset, column):
 class_values = [row[column] for row in dataset]# Generate a list of class labels
 # print(class_values)
 unique = set(class_values)# Obtain different elements from the list using set
 print(unique)
 lookup = dict()# Define a dictionary
 # print(enumerate(unique))
 for i, value in enumerate(unique):
 lookup[value] = i
 # print(lookup)
 for row in dataset:
 row[column] = lookup[row[column]]
 print(lookup['M'])

4、Split the dataset into K parts

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
 dataset_split = list()# Generate an empty list
 dataset_copy = list(dataset)
 print(len(dataset_copy))
 print(len(dataset))
 #print(dataset_copy)
 fold_size = int(len(dataset) / n_folds)
 for i in range(n_folds):
 fold = list()
 while len(fold) < fold_size:
  index = randrange(len(dataset_copy))
  # print(index)
  fold.append(dataset_copy.pop(index))# Use .pop() to delete the elements inside (equivalent to transferring), ensuring that these k elements are all different.
 dataset_split.append(fold)
 return dataset_split
n_folds=5 
folds = cross_validation_split(dataset, n_folds)#k份元素各不相同的训练集

5. Calculate accuracy

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
 correct = 0
 for i in range(len(actual)):
 if actual[i] == predicted[i]:
  correct += 1
 return correct / float(len(actual)) * 100.0# This is the expression for binary classification accuracy

6. Binary classification for each column

# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
 left, right = list(), list()# Initialize two empty lists
 for row in dataset:
 if row[index] < value:
  left.append(row)
 else:
  right.append(row)
 return left, right # Return two lists, each classifying the specified rows (index) based on the value.

7. Use the Gini coefficient to obtain the best split point

# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
 gini = 0.0
 for class_value in class_values:
 for group in groups:
  size = len(group)
  if size == 0:
  continue
  proportion = [row[-1] for row in group].count(class_value) / float(size)
  gini += (proportion * (1.0 - proportion))
 return gini
# Select the best split point for a dataset
def get_split(dataset):
 class_values = list(set(row[-1] for row in dataset))
 b_index, b_value, b_score, b_groups = 999, 999, 999, None
 for index in range(len(dataset[0]))-1)
 for row in dataset:
  groups = test_split(index, row[index], dataset)
  gini = gini_index(groups, class_values)
  if gini < b_score:
  b_index, b_value, b_score, b_groups = index, row[index], gini, groups
 # print(groups)
 print ({'index':b_index, 'value':b_value,'score':gini})
 return {'index':b_index, 'value':b_value, 'groups':b_groups}

This code directly applies the definition to calculate the gini index, which is not difficult to understand. Understanding the optimal split point may be more challenging. Here, a two-level iteration is used, one for different columns and another for different rows. Additionally, the gini coefficient is updated in each iteration.

8、Decision Tree Generation

# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
 left, right = node['groups']
 del(node['groups'])
 # check for a no split
 if not left or not right:
 node['left'] = node['right'] = to_terminal(left + right)
 return
 # check for max depth
 if depth >= max_depth:
 node['left'], node['right'] = to_terminal(left), to_terminal(right)
 return
 # process left child
 if len(left) <= min_size:
 node['left'] = to_terminal(left)
 else:
 node['left'] = get_split(left)
 split(node['left'], max_depth, min_size, depth)+1)
 # process right child
 if len(right) <= min_size:
 node['right'] = to_terminal(right)
 else:
 node['right'] = get_split(right)
 split(node['right'], max_depth, min_size, depth)+1)

Here, recursive programming is used to continuously generate left and right branches.

9.Build a decision tree

# Build a decision tree
def build_tree(train, max_depth, min_size):
 root = get_split(train)
 split(root, max_depth, min_size, 1)
 return root 
tree=build_tree(train_set, max_depth, min_size)
print(tree)

10、Predict the test set

# Build a decision tree
def build_tree(train, max_depth, min_size):
 root = get_split(train)# Obtain the best split point, index value, groups
 split(root, max_depth, min_size, 1)
 return root 
# tree=build_tree(train_set, max_depth, min_size)
# print(tree) 
# Make a prediction with a decision tree
def predict(node, row):
 print(row[node['index']])
 print(node['value'])
 if row[node['index']] < node['value']:# Use the test set to substitute the best split point trained, the split point has deviation, further compare by searching the left and right branches of the tree.
 if isinstance(node['left'], dict):# If it is a dictionary type, perform the operation
  return predict(node['left'], row)
 else:
  return node['left']
 else:
 if isinstance(node['right'], dict):
  return predict(node['right'], row)
 else:
  return node['right']
tree = build_tree(train_set, max_depth, min_size)
predictions = list()
for row in test_set:
 prediction = predict(tree, row)
 predictions.append(prediction)

11.Evaluate decision tree

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
 folds = cross_validation_split(dataset, n_folds)
 scores = list()
 for fold in folds:
 train_set = list(folds)
 train_set.remove(fold)
 train_set = sum(train_set, [])
 test_set = list()
 for row in fold:
  row_copy = list(row)
  test_set.append(row_copy)
  row_copy[-1] = None
 predicted = algorithm(train_set, test_set, *args)
 actual = [row[-1for row in fold]
 accuracy = accuracy_metric(actual, predicted)
 scores.append(accuracy)
 return scores 

That's all for the content of this article. Hope it will be helpful to everyone's learning, and also hope everyone will support the Yelling Tutorial more.

Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (When reporting, please replace # with @) for complaints, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.

You May Also Like