Learning Data Fitting and Generalized Linear Regression Algorithm in Python

Predictive problems in machine learning are usually divided into2Category: Regression and Classification.

In simple terms, regression is the prediction of numbers, while classification is the labeling and categorization of data.

This article discusses how to use Python for basic data fitting and how to analyze the error of fitting results.

In this example, a2a function plus random perturbation to generate500 points, and then try to generate1,2,10The data is fitted using a polynomial of 0th degree.

The purpose of fitting is to allow a polynomial function to be fitted to the training data, which can fit the existing data well and make predictions for unknown data.

The code is as follows:

import matplotlib.pyplot as plt 
import numpy as np 
import scipy as sp 
from scipy.stats import norm 
from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import PolynomialFeatures 
from sklearn import linear_model 
''''' Data Generation ''' 
x = np.arange(0, 1, 0.002) 
y = norm.rvs(0, size=500, scale=0.001) 
y = y + x**2 
''''' Root Mean Square Error ''' 
def rmse(y_test, y): 
 return sp.sqrt(sp.mean((y_test - y) ** 2)) 
''''' The degree of excellence compared to the mean, between [0~1]. 0 indicates less than the mean.1represents a perfect prediction. This version's implementation is based on scikit-on the scikit-learn official documentation ''' 
def R2(y_test, y_true): 
 return 1 - ((y_test - y_true)**2).sum() / ((y_true - y_true.mean())**2).sum() 
''''' This is the version from Conway & White's 'Analysis of Machine Learning Case Studies''' 
def R22(y_test, y_true): 
 y_mean = np.array(y_true) 
 y_mean[:] = y_mean.mean() 
 return 1 - rmse(y_test, y_true) / rmse(y_mean, y_true) 
plt.scatter(x, y, s=5) 
degree = [1,2,100] 
y_test = [] 
y_test = np.array(y_test) 
for d in degree: 
 clf = Pipeline([('poly', PolynomialFeatures(degree=d)), 
     (['linear', LinearRegression(fit_intercept=False)]) 
 clf.fit(x[:, np.newaxis], y) 
 y_test = clf.predict(x[:, np.newaxis]) 
 print(clf.named_steps['linear'].coef_) 
 print('rmse=%.2f'2f, R2=%.2f2f, R22=%.2f2f, clf.score=%.2f2f' % 
  (rmse(y_test, y), 
  R2(y_test, y), 
  R22(y_test, y), 
  clf.score(x[:, np.newaxis], y)))  
 plt.plot(x, y_test, linewidth=2) 
plt.grid() 
plt.legend(['1','2','100'], loc='upper left') 
plt.show()

The display result of the program running is as follows:

[-0.16140183　 0.99268453]
rmse=0.13, R2=0.82, R22=0.58, clf.score=0.82
[ 0.00934527 -0.03591245　 1.03065829]
rmse=0.11, R2=0.88, R22=0.66, clf.score=0.88
[　 6.07130354e-02　 -1.02247150e+00　　 6.66972089e+01　 -1.85696012e+04
......
-9.43408707e+12　 -9.78954604e+12　 -9.99872105e+12　 -1.00742526e+13
-1.00303296e+13　 -9.88198843e+12　 -9.64452002e+12　 -9.33298267e+12
　 -1.00580760e+12]
rmse=0.10, R2=0.89, R22=0.67, clf.score=0.89
The coef_ displayed is the polynomial parameter. For example1The fitting result is
y = 0.99268453x -0.16140183
Here we need to pay attention to the following points:
1, error analysis.
to do regression analysis, the common errors mainly include root mean square error (RMSE) and R-Squares (R2）.
RMSEis the mean square root of the error between the predicted value and the actual value. This measurement method is very popular (the evaluation method of Netflix machine learning competition), which is a quantitative weighing method.
R2method is to compare the prediction value with the mean value obtained by using only the mean, to see how much better it can be. The range is usually in (0,1indicates that it is even worse than not making any predictions and taking the mean directly, while1represents the perfect match between all predictions and actual results.
R2Different literature has slightly different methods. For example, the function R2is based on the calculation method of scikit-learn official documentation implementation, which is consistent with the result of clf.score function.
and R22The implementation of the function comes from Conway's book 'Analysis of Machine Learning Case Studies', which is different from the one he uses2The RMSE ratio is used to calculate R2.
we see that the polynomial order is1When the polynomial order is2can also reach 0.82.2order polynomial has increased to 0.88. The R100, R2has only increased to 0.89.
2, overfitting.
Using10The fitting effect of the 0th order polynomial is indeed higher, but the predictive ability of the model is extremely poor.
And pay attention to the polynomial coefficients, which appear in a large number of large values, even reaching10of12power.
Here we modify the code, to500 samples, including the last2from the training set have been removed. However, in the test, all500 samples.
clf.fit(x[:498, np.newaxis], y[:498])
The polynomial fitting results after this modification are as follows:

[-0.17933531　 1.0052037 ]
rmse=0.12, R2=0.85, R22=0.61, clf.score=0.85
[-0.01631935　 0.01922011　 0.99193521]
rmse=0.10, R2=0.9, R22=0.69, clf.score=0.90
...
rmse=0.21, R2=0.57, R22=0.34, clf.score=0.57
It is just missing the last2training samples, the red line (10The prediction of the 0th order polynomial fitting) has a significant bias, R2it also sharply dropped to 0.57.
Looking back1,2The fitting results of the order of polynomial, R2instead, it has slightly increased.
This indicates that high-order polynomials overfit the training data, including a large amount of noise, resulting in the complete loss of predictive ability for the data trend. As mentioned earlier,10The coefficient values of the 0th order polynomial fitting are immense. It is natural to think of avoiding the generation of such an anomalous fitting function by limiting the size of these coefficient values during the fitting process.
The basic principle is to take the sum of the absolute values of all coefficients of the fitted polynomial.1Regularization) or sum of squares (L2Regularization) is added to the penalty model and a penalty factor w is specified to avoid the emergence of such anomalous coefficients.
This idea is applied to Ridge regression (using L2Regularization), Lasso method (using L1Regularization), Elastic Net (Elastic net, using L1+L2Regularization) and other methods can effectively avoid overfitting. More principles can be referred to in relevant materials.
Let's take Ridge regression as an example to see100 times polynomial fitting is effective. Modify the code as follows:
clf = Pipeline([('poly', PolynomialFeatures(degree=d)),
　　　　　　　　　　　　　　　　　　　 ('linear', linear_model.Ridge ())])
clf.fit(x[:400, np.newaxis], y[:400])

The results are as follows:

[ 0.　　　　　　　　　 0.75873781]
rmse=0.15, R2=0.78, R22=0.53, clf.score=0.78
[ 0.　　　　　　　　　 0.35936882　 0.52392172]
rmse=0.11, R2=0.87, R22=0.64, clf.score=0.87
[　 0.00000000e+00　　 2.63903249e-01　　 3.14973328e-01　　 2.43389461e-01
　　 1.67075328e-01　　 1.10674280e-01　　 7.30672237e-02　　 4.88605804e-02
　　 ......
　　 3.70018540e-11　　 2.93631291e-11　　 2.32992690e-11　　 1.84860002e-11
　　 1.46657377e-11]
rmse=0.10, R2=0.9, R22=0.68, clf.score=0.90
We can see that,100 times polynomial coefficient parameters become very small. Most of them are close to 0.
It is also worth noting that after using penalty models such as Ridge regression,1th and2th order polynomial regression's R2The value may be slightly lower than the basic linear regression.
However, such a model, even if using100 times polynomial, in training400 samples, prediction500 samples, not only has a smaller R2Not only has the error, but also has excellent predictive ability.

That's all for this article. I hope it will be helpful to everyone's learning and I also hope everyone will support the Yelling Tutorial more.

Statement: The content of this article is from the Internet, the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, does not edit the content artificially, and does not assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#oldtoolbag.com (When reporting via email, please replace # with @) to report, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.

Basic Tutorial