English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Descriptive Statistics in Pandas

Pandas Descriptive Statistics Operation Example

DataFrame is used for a large number of descriptive statistics, computations, and other related operations. Most of them are aggregations, such as sum(), mean(), but some aggregations (such as sumsum()) will produce objects of the same size. Generally, these methods use the axis parameter, like ndarray. {sum, std, ...} and can be specified by name or integer DataFrame − Index (axis=0, default), Column (axis=1)

Let's create a DataFrame and use this object for all operations in this chapter.

Instance

 import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 # Create a DataFrame
 df = pd.DataFrame(d)
 print(df)

Running Results:

        Age  Name   Rating
0   25   Tom     4.23
1   26   James   3.24
2   25   Ricky   3.98
3   23   Vin     2.56
4   30   Steve   3.20
5   29   Smith   4.60
6   23   Jack    3.80
7   34   Lee     3.78
8   40   David   2.98
9   30   Gasper  4.80
10  51   Betina  4.10
11  46   Andres  3.65

sum()

Return the sum of the values of the requested axis. By default, the axis is the index (axis=0)

 import pandas as pd
 import numpy as np
  
 # Create a Series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.sum())

Running Results:

    Age                                                    382
Name     TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating                                               44.92
dtype: object

Each individual column is added with a string

axis=1

This syntax will output the following content.

 import pandas as pd
 import numpy as np
  
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
  
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.sum(1))

Running Results:

    0    29.23
1    29.24
2    28.98
3    25.56
4    33.20
5    33.60
6    26.80
7    37.78
8    42.98
9    34.80
10   55.10
11   49.65
dtype: float64

mean()

Returns the average value

 import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.mean())

Running Results:

    Age       31.833333
Rating     3.743333
dtype: float64

std()

Returns the Bressel standard deviation of numerical columns.

 import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.std())

Running Results:

    Age       9.232682
Rating    0.661628
dtype: float64

Functions & Description

Now let's understand the functions under descriptive statistics in Python Pandas. The following table lists important functions:

NumberMethodDescription
1count()Non-empty number
2sum()Total
3mean()Mean
4median()Median
5mode()Mode
6std()Standard deviation
7min()Minimum value
8max()Maximum value
9abs()Absolute value
10prod()Product
11cumsum()Cumulative sum
12cumprod()Cumulative product
Note: − Since DataFrame is a heterogeneous data structure, generic operations do not apply to all functions.
    Functions such as sum(), cumsum() can be used for numerical and character (or) string data elements without any error. Although character sets are not commonly used, no exceptions will be thrown.
  • When a DataFrame contains character or string data, functions such as abs(), cumprod() will raise an exception because such operations cannot be performed.

Summarize data

  import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.describe())

Running Results:

                Age         Rating
count    12.000000      12.000000
mean     31.833333       3.743333
std       9.232682       0.661628
min      23.000000       2.560000
25%      25.000000       3.230000
50%      29.500000       3.790000
75%      35.500000       4.132500
max      51.000000       4.800000

This function provides the mean, std, and IQR values. And, the function does not include character columns and the given summary of numerical columns. 'include' is the parameter used to pass necessary information about which columns need to be considered when summarizing. The value list; by default, it is 'number'.

object − Summarize string columnsnumber − Summarize numerical columnsall − Summarize all columns together (it should not be treated as a list value)

Below we use the following statements in the program and execute and output:

  import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df.describe(include=['object']))

Running Results:

           Name
count       12
unique      12
top      Ricky
freq         1

Below we use the following statements in the program and execute and output:

  import pandas as pd
 import numpy as np
 #Create a series dictionary
 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    '''Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
 }
 #Create a DataFrame
 df = pd.DataFrame(d)
 print(df. describe(include='all'))

Running Results:

           Age Name Rating
count   12.000000        12    12.000000
unique NaN        12          NaN
top NaN Ricky NaN
freq NaN         1          NaN
mean    31.833333       NaN     3.743333
std      9.232682       NaN 0.661628
min     23.000000 NaN     2.560000
25%     25.000000 NaN     3.230000
50%     29.500000 NaN     3.790000
75%     35.500000 NaN     4.132500
max     51.000000 NaN     4.800000