English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Introduction
To obtain the main features of the data through random forest
1
Random forest is a highly flexible machine learning method with broad application prospects, ranging from marketing to healthcare insurance. It can be used for modeling marketing simulations, statistics on customer sources, retention, and churn. It can also be used to predict the risk of diseases and the susceptibility of patients.
According to the generation method of individual learners, the current ensemble learning methods can be roughly divided into two major categories: the serialized methods with strong dependency between individual learners, which must be generated sequentially, and the parallelized methods with no strong dependency between individual learners, which can be generated simultaneously;
The former representative is Boosting, and the latter representatives are Bagging and 'Random Forest' (Random
Forest)
Random forest, based on the Bagging integration of decision trees as base learners, further introduces random attribute selection (i.e., random feature selection) in the training process of decision trees.
In simple terms, random forest is the integration of decision trees, but there are two differences:
(2) Feature selection difference: The n classification features of each decision tree are randomly selected from all features (n is a parameter that we need to adjust ourselves).
Random forest, simply put, for example, predicting salary, is to build multiple decision trees job, age, house, and then according to the various features of the quantity to be predicted (teacher,39, suburb) in the corresponding decision tree target value probability (salary<5000, salary>=5000), thus, determining the probability of the predicted quantity (such as, predicting P(salary<5000) = 0.3)
Random forest is capable of both regression and classification. It has the characteristics of handling big data and is very helpful for estimating or variable is an important basic data modeling.
Parameter description:
The two most important parameters are 'n_estimators' and 'max_features'.
'n_estimators': Represents the number of trees in the forest. In theory, the more, the better. However, this is accompanied by an increase in computation time. But it is not necessarily better to get larger, the best prediction effect will appear at a reasonable number of trees.
'max_features': Randomly select a subset of feature sets and use them to split nodes. The fewer the number of subsets, the faster the variance will decrease, but at the same time, the bias will increase faster. According to good practical experience. If it is a regression problem, then:
'max_features' equals 'n_features', and if it is a classification problem, 'max_features' equals the square root of 'n_features'.
If you want to get better results, you must set 'max_depth' to None, and 'min_sample_split'=1.
Also remember to perform cross_validated (cross-validation), in addition to remembering that in the random forest, 'bootstrap' should be set to True. But in the extra-In the 'trees' parameter, set 'bootstrap' to False.
2Random forest implementation in Python
2.1Demo1
Implement the basic functions of random forest
#Random Forest from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_iris iris=load_iris() #print iris#iris has 4 attributes: sepal width, sepal length, petal width, petal length; the label is the type of flower: setosa, versicolour, virginica print(iris['target'].shape) rf=RandomForestRegressor()# Here, default parameter settings are used rf.fit(iris.data[:150], iris.target[:150])# Train the model # Randomly select two samples with different predictions instance=iris.data[[100,109] print(instance) rf.predict(instance[[0]]) print('instance 0 prediction;', rf.predict(instance[[0]])) print('instance 1 prediction; ',rf.predict(instance[[1])) print(iris.target[100], iris.target[109)]
Running results
(150,)
[[ 6.3 3.3 6. 2.5]
[ 7.2 3.6 6.1 2.5]
instance 0 prediction; [ 2]
instance 1 prediction; [ 2]
2 2
2.2 Demo2
3Comparison of methods
# random forest test from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.tree import DecisionTreeClassifier X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0) clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean()) clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean()) clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean())
Run results:}
0.979408793821
0.999607843137
0.999898989899
2.3 Demo3-Implementation of feature selection
#Random Forest2 from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_iris iris=load_iris() from sklearn.model_selection import cross_val_score, ShuffleSplit X = iris["data"] Y = iris["target"] names = iris["feature_names"] rf = RandomForestRegressor() scores = [] for i in range(X.shape[1]) score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", cv=ShuffleSplit(len(X), 3, .3)) scores.append((round(np.mean(score), 3), names[i])) print(sorted(scores, reverse=True))
Run results:}
[(0.89300000000000002, 'petal width (cm)'), (0.82099999999999995, 'petal length
(cm)'), (0.13, 'sepal length (cm)'), (-0.79100000000000004, 'sepal width (cm)')]
2.4 demo4-Random Forest
I wanted to use the following code to build a random random forest decision tree, but the problem encountered is that the program keeps running and cannot respond, and it needs to be debugged.
#Random Forest4 #coding:utf-8 import csv from random import seed from random import randrange from math import sqrt def loadCSV(filename):#load data, store each line into a list dataSet = [] with open(filename, 'r') as file: csvReader = csv.reader(file) for line in csvReader: dataSet.append(line) return dataSet # Convert all columns except the label column to float type def column_to_float(dataSet): featLen = len(dataSet[0]) - 1 for data in dataSet: for column in range(featLen): data[column] = float(data[column].strip()) # Split the data set randomly into N pieces for convenience of cross-validation, one piece is the test set, and the other four pieces are training sets def spiltDataSet(dataSet, n_folds): fold_size = int(len(dataSet)) / n_folds) dataSet_copy = list(dataSet) dataSet_spilt = [] for i in range(n_folds): fold = [] while len(fold) < fold_size: # Here we cannot use if, because if only works at the first judgment, while executes the loop until the condition is not met index = randrange(len(dataSet_copy)) fold.append(dataSet_copy.pop(index)) # pop() function is used to remove an element from a list (by default the last element) and return its value. dataSet_spilt.append(fold) return dataSet_spilt # Construct data subset def get_subsample(dataSet, ratio): subdataSet = [] lenSubdata = round(len(dataSet)) * ratio)#return a floating-point number while len(subdataSet) < lenSubdata: index = randrange(len(dataSet)) - 1) subdataSet.append(dataSet[index]) # print len(subdataSet) return subdataSet # Split data set def data_spilt(dataSet, index, value): left = [] right = [] for row in dataSet: if row[index] < value: left.append(row) else: right.append(row) return left, right # Calculate the split cost def spilt_loss(left, right, class_values): loss = 0.0 for class_value in class_values: left_size = len(left) if left_size != 0: # Prevent division by zero prop = [row[-1] for row in left].count(class_value) / float(left_size) loss += (prop * (1.0 - prop)) right_size = len(right) if right_size != 0: prop = [row[-1] for row in right].count(class_value) / float(right_size) loss += (prop * (1.0 - prop)) return loss # Select any n features and find the best feature for splitting among these n features def get_best_spilt(dataSet, n_features): features = [] class_values = list(set(row[-1for row in dataSet)) b_index, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None while len(features) < n_features: index = randrange(len(dataSet[0])) - 1) if index not in features: features.append(index) # print 'features:', features for index in features: # Find the most suitable index for the node (minimum loss) for row in dataSet: left, right = data_spilt(dataSet, index, row[index]) # Split into left and right branches at this node loss = spilt_loss(left, right, class_values) if loss < b_loss: # Find the minimum split cost b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right # print b_loss # print type(b_index) return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right} # Determine the output label def decide_label(data): output = [row[-1] for row in data] return max(set(output), key=output.count) # Subdivision, the process of continuously building leaf nodes def sub_spilt(root, n_features, max_depth, min_size, depth): left = root['left'] # print left right = root['right'] del (root['left']) del (root['right']) # print depth if not left or not right: root['left'] = root['right'] = decide_label(left + right) # print 'testing' return if depth > max_depth: root['left'] = decide_label(left) root['right'] = decide_label(right) return if len(left) < min_size: root['left'] = decide_label(left) else: root['left'] = get_best_spilt(left, n_features) # print 'testing_left' sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1) if len(right) < min_size: root['right'] = decide_label(right) else: root['right'] = get_best_spilt(right, n_features) # print 'testing_right' sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1) # Construct a decision tree def build_tree(dataSet, n_features, max_depth, min_size): root = get_best_spilt(dataSet, n_features) sub_spilt(root, n_features, max_depth, min_size, 1) return root # 预测测试集结果 def predict(tree, row): predictions = [] if row[tree['index']] < tree['value']: if isinstance(tree['left'], dict): return predict(tree['left'], row) else: return tree['left'] else: if isinstance(tree['right'], dict): return predict(tree['right'], row) else: return tree['right'] # predictions=set(predictions) def bagging_predict(trees, row): predictions = [predict(tree, row) for tree in trees] return max(set(predictions), key=predictions.count) # 创建随机森林 def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees): trees = [] for i in range(n_trees): train = get_subsample(train, ratio)#从切割的数据集中选取子集 tree = build_tree(train, n_features, max_depth, min_size) # print 'tree %d: '%i,tree trees.append(tree) # predict_values = [predict(trees,row) for row in test] predict_values = [bagging_predict(trees, row) for row in test] return predict_values # Calculate accuracy def accuracy(predict_values, actual): correct = 0 for i in range(len(actual)): if actual[i] == predict_values[i]: correct += 1 return correct / float(len(actual)) if __name__ == '__main__': seed(1) dataSet = loadCSV(r'G:\0研究生\tianchiCompetition\训练小样本2.csv') column_to_float(dataSet) n_folds = 5 max_depth = 15 min_size = 1 ratio = 1.0 # n_features=sqrt(len(dataSet)-1) n_features = 15 n_trees = 10 folds = spiltDataSet(dataSet, n_folds)# Initially, split the dataset scores = [] for fold in folds: train_set = folds[ :] # Here, it cannot be simply used as train_set=folds, because this is a reference. When the value of train_set changes, the value of folds will also change. Therefore, a copy form should be used. (L[:]) can copy sequences, D.copy() can copy dictionaries, and list can generate copies list(L) train_set.remove(fold)# Select the training set # print len(folds) train_set = sum(train_set, []) # Combine multiple fold lists into a single train_set list # print len(train_set) test_set = [] for row in fold: row_copy = list(row) row_copy[-1] = None test_set.append(row_copy) # for row in test_set: # print row[-1] actual = [row[-1for row in fold] predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees) accur = accuracy(predict_values, actual) scores.append(accur) print ('Trees is %d' % n_trees) print ('scores:%s' % scores) print ('mean score:%s' % (sum(scores) / float(len(scores)))
2.5 Random Forest classification of sonic data
# CART on the Bank Note dataset from random import seed from random import randrange from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "r") lines = reader(file) dataset = list(lines) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list() dataset_copy = list(dataset) fold_size = int(len(dataset) / n_folds) for i in range(n_folds): fold = list() while len(fold) < fold_size: index = randrange(len(dataset_copy)) fold.append(dataset_copy.pop(index)) dataset_split.append(fold) return dataset_split # Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) scores = list() for fold in folds: train_set = list(folds) train_set.remove(fold) train_set = sum(train_set, []) test_set = list() for row in fold: row_copy = list(row) test_set.append(row_copy) row_copy[-1] = None predicted = algorithm(train_set, test_set, *args) actual = [row[-1for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores # Split a dataset based on an attribute and an attribute value def test_split(index, value, dataset): left, right = list(), list() for row in dataset: if row[index] < value: left.append(row) else: right.append(row) return left, right # Calculate the Gini index for a split dataset def gini_index(groups, class_values): gini = 0.0 for class_value in class_values: for group in groups: size = len(group) if size == 0: continue proportion = [row[-1] for row in group].count(class_value) / float(size) gini += (proportion * (1.0 - proportion)) return gini # Select the best split point for a dataset def get_split(dataset): class_values = list(set(row[-1] for row in dataset)) b_index, b_value, b_score, b_groups = 999, 999, 999, None for index in range(len(dataset[0]))-1) for row in dataset: groups = test_split(index, row[index], dataset) gini = gini_index(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups print ({'index':b_index, 'value':b_value}) return {'index':b_index, 'value':b_value, 'groups':b_groups} # Create a terminal node value def to_terminal(group): outcomes = [row[-1] for row in group] return max(set(outcomes), key=outcomes.count) # Create child splits for a node or make terminal def split(node, max_depth, min_size, depth): left, right = node['groups'] del(node['groups']) # check for a no split if not left or not right: node['left'] = node['right'] = to_terminal(left + right) return # check for max depth if depth >= max_depth: node['left'], node['right'] = to_terminal(left), to_terminal(right) return # process left child if len(left) <= min_size: node['left'] = to_terminal(left) else: node['left'] = get_split(left) split(node['left'], max_depth, min_size, depth)+1) # process right child if len(right) <= min_size: node['right'] = to_terminal(right) else: node['right'] = get_split(right) split(node['right'], max_depth, min_size, depth)+1) # Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train) split(root, max_depth, min_size, 1) return root # Make a prediction with a decision tree def predict(node, row): if row[node['index']] < node['value']: if isinstance(node['left'], dict): return predict(node['left'], row) else: return node['left'] else: if isinstance(node['right'], dict): return predict(node['right'], row) else: return node['right'] # Classification and Regression Tree Algorithm def decision_tree(train, test, max_depth, min_size): tree = build_tree(train, max_depth, min_size) predictions = list() for row in test: prediction = predict(tree, row) predictions.append(prediction) return(predictions) # Test CART on Bank Note dataset seed(1) # load and prepare data filename = r'G:\0pythonstudy\decision_tree\sonar.all-data.csv' dataset = load_csv(filename) # Convert string attributes to integers for i in range(len(dataset[0]))-1) str_column_to_float(dataset, i) # evaluate algorithm n_folds = 5 max_depth = 5 min_size = 10 scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size) print('Scores: %s' % scores) print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores)))
Run results:}
{'index': 38, 'value': 0.0894}
{'index': 36, 'value': 0.8459}
{'index': 50, 'value': 0.0024}
{'index': 15, 'value': 0.0906}
{'index': 16, 'value': 0.9819}
{'index': 10, 'value': 0.0785}
{'index': 16, 'value': 0.0886}
{'index': 38, 'value': 0.0621}
{'index': 5, 'value': 0.0226}
{'index': 8, 'value': 0.0368}
{'index': 11, 'value': 0.0754}
{'index': 0, 'value': 0.0239}
{'index': 8, 'value': 0.0368}
{'index': 29, 'value': 0.1671}
{'index': 46, 'value': 0.0237}
{'index': 38, 'value': 0.0621}
{'index': 14, 'value': 0.0668}
{'index': 4, 'value': 0.0167}
{'index': 37, 'value': 0.0836}
{'index': 12, 'value': 0.0616}
{'index': 7, 'value': 0.0333}
{'index': 33, 'value': 0.8741}
{'index': 16, 'value': 0.0886}
{'index': 8, 'value': 0.0368}
{'index': 33, 'value': 0.0798}
{'index': 44, 'value': 0.0298}
Scores: [48.78048780487805, 70.73170731707317, 58.536585365853654, 51.2195121951
2195, 39.02439024390244]
Mean Accuracy: 53.659%
Press any key to continue... . . .
Knowledge point:
1. Load CSV file
from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "r") lines = reader(file) dataset = list(lines) return dataset filename = r'G:\0pythonstudy\decision_tree\sonar.all-data.csv' dataset=load_csv(filename) print(dataset)
2. Convert the data to float format
# Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # print(row[column]) # Convert string attributes to integers for i in range(len(dataset[0]))-1) str_column_to_float(dataset, i)
3. Convert the classification string in the last column to 0,1Integer
def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset]# Generate a list of class labels # print(class_values) unique = set(class_values)# Obtain different elements from the list using set print(unique) lookup = dict()# Define a dictionary # print(enumerate(unique)) for i, value in enumerate(unique): lookup[value] = i # print(lookup) for row in dataset: row[column] = lookup[row[column]] print(lookup['M'])
4、Split the dataset into K parts
# Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list()# Generate an empty list dataset_copy = list(dataset) print(len(dataset_copy)) print(len(dataset)) #print(dataset_copy) fold_size = int(len(dataset) / n_folds) for i in range(n_folds): fold = list() while len(fold) < fold_size: index = randrange(len(dataset_copy)) # print(index) fold.append(dataset_copy.pop(index))# Use .pop() to delete the elements inside (equivalent to transferring), ensuring that these k elements are all different. dataset_split.append(fold) return dataset_split n_folds=5 folds = cross_validation_split(dataset, n_folds)#k份元素各不相同的训练集
5. Calculate accuracy
# Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0# This is the expression for binary classification accuracy
6. Binary classification for each column
# Split a dataset based on an attribute and an attribute value def test_split(index, value, dataset): left, right = list(), list()# Initialize two empty lists for row in dataset: if row[index] < value: left.append(row) else: right.append(row) return left, right # Return two lists, each classifying the specified rows (index) based on the value.
7. Use the Gini coefficient to obtain the best split point
# Calculate the Gini index for a split dataset def gini_index(groups, class_values): gini = 0.0 for class_value in class_values: for group in groups: size = len(group) if size == 0: continue proportion = [row[-1] for row in group].count(class_value) / float(size) gini += (proportion * (1.0 - proportion)) return gini # Select the best split point for a dataset def get_split(dataset): class_values = list(set(row[-1] for row in dataset)) b_index, b_value, b_score, b_groups = 999, 999, 999, None for index in range(len(dataset[0]))-1) for row in dataset: groups = test_split(index, row[index], dataset) gini = gini_index(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups # print(groups) print ({'index':b_index, 'value':b_value,'score':gini}) return {'index':b_index, 'value':b_value, 'groups':b_groups}
This code directly applies the definition to calculate the gini index, which is not difficult to understand. Understanding the optimal split point may be more challenging. Here, a two-level iteration is used, one for different columns and another for different rows. Additionally, the gini coefficient is updated in each iteration.
8、Decision Tree Generation
# Create child splits for a node or make terminal def split(node, max_depth, min_size, depth): left, right = node['groups'] del(node['groups']) # check for a no split if not left or not right: node['left'] = node['right'] = to_terminal(left + right) return # check for max depth if depth >= max_depth: node['left'], node['right'] = to_terminal(left), to_terminal(right) return # process left child if len(left) <= min_size: node['left'] = to_terminal(left) else: node['left'] = get_split(left) split(node['left'], max_depth, min_size, depth)+1) # process right child if len(right) <= min_size: node['right'] = to_terminal(right) else: node['right'] = get_split(right) split(node['right'], max_depth, min_size, depth)+1)
Here, recursive programming is used to continuously generate left and right branches.
9.Build a decision tree
# Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train) split(root, max_depth, min_size, 1) return root tree=build_tree(train_set, max_depth, min_size) print(tree)
10、Predict the test set
# Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train)# Obtain the best split point, index value, groups split(root, max_depth, min_size, 1) return root # tree=build_tree(train_set, max_depth, min_size) # print(tree) # Make a prediction with a decision tree def predict(node, row): print(row[node['index']]) print(node['value']) if row[node['index']] < node['value']:# Use the test set to substitute the best split point trained, the split point has deviation, further compare by searching the left and right branches of the tree. if isinstance(node['left'], dict):# If it is a dictionary type, perform the operation return predict(node['left'], row) else: return node['left'] else: if isinstance(node['right'], dict): return predict(node['right'], row) else: return node['right'] tree = build_tree(train_set, max_depth, min_size) predictions = list() for row in test_set: prediction = predict(tree, row) predictions.append(prediction)
11.Evaluate decision tree
# Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) scores = list() for fold in folds: train_set = list(folds) train_set.remove(fold) train_set = sum(train_set, []) test_set = list() for row in fold: row_copy = list(row) test_set.append(row_copy) row_copy[-1] = None predicted = algorithm(train_set, test_set, *args) actual = [row[-1for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores
That's all for the content of this article. Hope it will be helpful to everyone's learning, and also hope everyone will support the Yelling Tutorial more.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (When reporting, please replace # with @) for complaints, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.