English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Statistical Functions in Pandas

Operation example of Pandas statistical functions

Statistical methods help to understand and analyze the behavior of data. Now, we will learn some statistical functions that can be applied to Pandas objects.

Percentage Change

Series, DataFrames, and Panels all have the function pct_change(). This function compares each element with its previous element and calculates the percentage change.

 import pandas as pd
 import numpy as np
 s = pd.Series([1,2,3,4,5,4]
 print(s.pct_change()
 df = pd.DataFrame(np.random.randn(5, 2))
 print(df.pct_change())

Running Result:

 0            NaN
1   1.000000
2   0.500000
3   0.333333
4   0.250000
5  -0.200000
dtype: float64
          0          1
0            NaN            NaN
1  -15.151902   0.174730
2  -0.746374   -1.449088
3  -3.582229   -3.165836
4   15.601150  -1.860434

By default, pct_change() operates on columns; if you want to apply the same row wisely, please use axis = 1( ) parameters.

Covariance

Covariance is applied to sequence data. The series object has a method cov to calculate the covariance between series objects. NA will be automatically excluded.

Cov Series

 import pandas as pd
 import numpy as np
 s1 = pd.Series(np.random.randn(10))
 s2 = pd.Series(np.random.randn(10))
 print(s1.cov(s2))

Running Result:

   -0.12978405324

When applying the covariance method to DataFrame, it will calculate the cov between all columns.

 import pandas as pd
 import numpy as np
 frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
 print(frame['a'].cov(frame['b']))
 print(frame.cov())

Running Result:

 -0.58312921152741437
           a                b                c                d                e
a   1.780628   -0.583129   -0.185575    0.003679    -0.136558
b  -0.583129    1.297011    0.136530   -0.523719     0.251064
c  -0.185575    0.136530        0.915227   -0.053881    -0.058926
d 0.003679   -0.523719   -0.053881    1.521426    -0.487694
e  -0.136558    0.251064   -0.058926   -0.487694     0.960761

Observe the cov value between column a and b in the first statement, which is the same as the value returned by cov on DataFrame.

Correlation

Correlation shows the linear relationship between any two value arrays (sequences). There are many methods to calculate correlation, such as pearson (default), spearman, and kendall.

 import pandas as pd
 import numpy as np
 frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
 print(frame['a'].corr(frame['b']))
 print(frame.corr())

Running Result:

 -0.383712785514
           a         b         c         d           e
a   1.000000  -0.383713  -0.145368   0.002235   -0.104405
b  -0.383713   1.000000 0.125311  -0.372821    0.224908
c  -0.145368   0.125311   1.000000  -0.045661   -0.062840
d 0.002235  -0.372821  -0.045661   1.000000   -0.403380
e  -0.104405   0.224908  -0.062840  -0.403380    1.000000

If there are any non-numeric columns in the DataFrame, they will be automatically excluded.

Data ranking

Data ranking ranks each element in the element array. If there is a tie, the average rank is assigned.

 import pandas as pd
 import numpy as np
 s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
 s['d'] = s['b']  # so there's a tie
 print(s.rank())

Running Result:

 a  1.0
b  3.5
c  2.0
d  3.5
e  5.0
dtype: float64

Rank can choose to order the parameters in ascending order, the default is true; if false, the data is ranked in reverse, and larger values are assigned smaller ranks.

Rank supports the use of the method parameter:

average − The average level of the groups in parallel. min − The lowest rank in the group. max − The highest level in the group. first − The order of row and column allocation in the array where they appear.