SQL Operations in Pandas Indexing and Data Querying in Pandas

Statistical Functions in Pandas

Operation example of Pandas statistical functions

Statistical methods help to understand and analyze the behavior of data. Now, we will learn some statistical functions that can be applied to Pandas objects.

Percentage Change

Series, DataFrames, and Panels all have the function pct_change(). This function compares each element with its previous element and calculates the percentage change.

Example

　import pandas as pd
　import numpy as np
　s = pd.Series([1,2,3,4,5,4]
　print(s.pct_change()
　df = pd.DataFrame(np.random.randn(5,　2))
　print(df.pct_change())

Running Result:

　0            NaN
1　　　1.000000
2　　　0.500000
3　　　0.333333
4　　　0.250000
5　　-0.200000
dtype:　float64
　　　　　　　　　　0　　　　　　　　　　1
0            NaN            NaN
1　　-15.151902　　　0.174730
2　　-0.746374　　　-1.449088
3　　-3.582229　　　-3.165836
4　　　15.601150　　-1.860434

By default, pct_change() operates on columns; if you want to apply the same row wisely, please use axis = 1( ) parameters.

Covariance

Covariance is applied to sequence data. The series object has a method cov to calculate the covariance between series objects. NA will be automatically excluded.

Cov Series

Example

　import pandas as pd
　import numpy as np
　s1　= pd.Series(np.random.randn(10))
　s2　= pd.Series(np.random.randn(10))
　print(s1.cov(s2))

Running Result:

　　　-0.12978405324

When applying the covariance method to DataFrame, it will calculate the cov between all columns.

Example

　import pandas as pd
　import numpy as np
　frame = pd.DataFrame(np.random.randn(10,　5), columns=['a', 'b', 'c', 'd', 'e'])
　print(frame['a'].cov(frame['b']))
　print(frame.cov())

Running Result:

　-0.58312921152741437
　　　　　　　　　　　a                b                c                d                e
a　　　1.780628　　　-0.583129　　　-0.185575　　　　0.003679　　　　-0.136558
b　　-0.583129　　　　1.297011　　　　0.136530　　　-0.523719　　　　　0.251064
c　　-0.185575　　　　0.136530        0.915227　　　-0.053881　　　　-0.058926
d　0.003679　　　-0.523719　　　-0.053881　　　　1.521426　　　　-0.487694
e　　-0.136558　　　　0.251064　　　-0.058926　　　-0.487694　　　　　0.960761

Observe the cov value between column a and b in the first statement, which is the same as the value returned by cov on DataFrame.

Correlation

Correlation shows the linear relationship between any two value arrays (sequences). There are many methods to calculate correlation, such as pearson (default), spearman, and kendall.

Example

　import pandas as pd
　import numpy as np
　frame = pd.DataFrame(np.random.randn(10,　5), columns=['a', 'b', 'c', 'd', 'e'])
　print(frame['a'].corr(frame['b']))
　print(frame.corr())

Running Result:

　-0.383712785514
　　　　　　　　　　　a　　　　　　　　　b　　　　　　　　　c　　　　　　　　　d　　　　　　　　　　　e
a　　　1.000000　　-0.383713　　-0.145368　　　0.002235　　　-0.104405
b　　-0.383713　　　1.000000　0.125311　　-0.372821　　　　0.224908
c　　-0.145368　　　0.125311　　　1.000000　　-0.045661　　　-0.062840
d　0.002235　　-0.372821　　-0.045661　　　1.000000　　　-0.403380
e　　-0.104405　　　0.224908　　-0.062840　　-0.403380　　　　1.000000

If there are any non-numeric columns in the DataFrame, they will be automatically excluded.

Data ranking

Data ranking ranks each element in the element array. If there is a tie, the average rank is assigned.

Example

　import pandas as pd
　import numpy as np
　s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
　s['d'] = s['b']  # so there's a tie
　print(s.rank())

Running Result:

　a　　1.0
b　　3.5
c　　2.0
d　　3.5
e　　5.0
dtype:　float64

Rank can choose to order the parameters in ascending order, the default is true; if false, the data is ranked in reverse, and larger values are assigned smaller ranks.

Rank supports the use of the method parameter:

average − The average level of the groups in parallel. min − The lowest rank in the group. max − The highest level in the group. first − The order of row and column allocation in the array where they appear.

SQL Operations in Pandas Indexing and Data Querying in Pandas

Pandas Tutorial

Statistical Functions in Pandas

Percentage Change

Covariance

Cov Series

Correlation

Data ranking