SQL Operations in Pandas Data Loss in Pandas

GroupBy in Pandas

Pandas GroupBy operation example

Any groupby operation will perform the following operations on the original object:

Split objects Apply function Merge results

In many cases, we divide the data into several groups and then apply some functions to each subset. In the Apply function, we can perform the following operations-

Aggregation − Calculate summary statistics Transformation − Grouping operations Filter − Filter data under certain conditions

Now we create a DataFrame object and perform all operations on it.

Example

#import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　print(df)

The running results are as follows:

　　　　　　Points　Rank　Team　Year
0　　　　　　876　　　　　　1　　　Riders　　　2014
1　　　　　　789　　　　　　2　　　Riders　　　2015
2　　　　　　863　　　　　　2　　　Devils　　　2014
3　　　　　　673　　　　　　3　　　Devils　　　2015
4　　　　　　741　　　　　　3　　　　Kings　　　2014
5　　　　　　812　　　　　　4　　　　kings　　　2015
6　　　　　　756　　　　　　1　　　　Kings　　　2016
7　　　　　　788　　　　　　1　　　　Kings　　　2017
8　　　　　　694　　　　　　2　　　Riders　　　2016
9　　　　　　701　　　　　　4　　　Royals　　　2014
10　　　　　804　　　　　　1　　　Royals　　　2015
11　　　　　690　　　　　　2　　　Riders　　　2017

Split data into groups

Objects can be split into any object. There are many ways to split objects, such as:

obj.groupby('key') obj.groupby(['key1','key2']) obj.groupby(key,axis=1)

Now let's see how to apply the groupby object to the DataFrame object

Example

# import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　print(df.groupby('Team'))

The running results are as follows:

View groups

Example

# import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　print(df.groupby('Team').groups)

The running results are as follows:

　　　{'Kings': Int64Index([4,　6,　7], dtype='int64'),
　'Devils': Int64Index([2,　3], dtype='int64'),
　'Riders': Int64Index([0,　1,　8,　11], dtype='int64'),
　'Royals': Int64Index([9,　10], dtype='int64'),
　'kings' : Int64Index([5], dtype='int64})

Example

Group by multiple columns

Example

# import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　print(df.groupby(['Team','Year']).groups)

The running results are as follows:

　　　{('Kings',　2014): Int64Index([4], dtype='int64'),
　　('Royals',　2014): Int64Index([9], dtype='int64'),
　　('Riders',　2014): Int64Index([0], dtype='int64'),
　　('Riders',　2015): Int64Index([1], dtype='int64'),
　　('Kings',　2016): Int64Index([6], dtype='int64'),
　　('Riders',　2016): Int64Index([8], dtype='int64'),
　　('Riders',　2017): Int64Index([11], dtype='int64'),
　　('Devils',　2014): Int64Index([2], dtype='int64'),
　　('Devils',　2015): Int64Index([3], dtype='int64'),
　　('kings',　2015): Int64Index([5], dtype='int64'),
　　('Royals',　2015): Int64Index([10], dtype='int64'),
　　('Kings',　2017): Int64Index([7], dtype='int64})

Traversing groups

With the groupby object, we can iterate over the object similar to itertools.obj

Example

# import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　grouped = df.groupby('Year')
　for name, group in grouped:
　　　　print(name)
　　　　print(group)

The running results are as follows:

　　2014
　　　Points　　Rank　　　　　Team　　　Year
0　　　　　876　　　　　1　　　Riders　　　2014
2　　　　　863　　　　　2　　　Devils　　　2014
4　　　　　741　　　　　3　　　Kings　　　　2014
9　　　　　701　　　　　4　　　Royals　　　2014
2015
　　　Points　　Rank　　　　　Team　　　Year
1　　　　　789　　　　　2　　　Riders　　　2015
3　　　　　673　　　　　3　　　Devils　　　2015
5　　　　　812　　　　　4　　　　kings　　　2015
10　　　　804　　　　　1　　　Royals　　　2015
2016
　　　Points　　Rank　　　　　Team　　　Year
6　　　　　756　　　　　1　　　　Kings　　　2016
8　　　　　694　　　　　2　　　Riders　　　2016
2017
　　　Points     Rank            Team     Year
7　　　　　788　　　　　1　　　Kings　　　2017
11　　　　690　　　　　2　　Riders　　　2017

By default, the label name of the groupby object is the same as the group name.

Select group p

Using the get_group() method, we can select a group.

Example

# import the pandas library
　import　pandas　as　pd
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　grouped = df.groupby('Year')
　print(grouped.get_group(2014))

The running results are as follows:

　　　　　Points     Rank             Team     Year
0　　　　　876　　　　　1　　　Riders　　　　2014
2　　　　　863　　　　　2　　　Devils　　　　2014
4　　　　　741　　　　　3　　　Kings　　　　　2014
9　　　　　701　　　　　4　　　Royals　　　　2014

Aggregate

Aggregation functions return an aggregated value for each group. Once a group object is created, several aggregation operations can be performed on the grouped data.

One obvious method is to use the sum or equivalent agg method for aggregation.

Example

# import the pandas library
　import　pandas　as　pd
　import　numpy　as　np
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　grouped = df.groupby('Year')
　print(grouped['Points'].agg(np.mean))

The running results are as follows:

　　Year
2014　　　795.25
2015　　　769.50
2016　　　725.00
2017　　　739.00
Name: Points, dtype: float64

Another way to view the size of each group is to apply the size() function.

Example

import　pandas　as　pd
　import　numpy　as　np
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　Attribute Access in Python Pandas
　grouped = df.groupby('Team')
　print(grouped.agg(np.size))

The running results are as follows:

　　　　　　　Points         Rank         Year
Team
Devils　　　　　　　　2　　　　　　2　　　　　　2
Kings　　　　　　　　　3　　　　　　3　　　　　　3
Riders　　　　　　　　4　　　　　　4　　　　　　4
Royals　　　　　　　　2　　　　　　2　　　　　　2
kings　　　　　　　　　1　　　　　　1　　　　　　1

Applying multiple aggregation functions at once

With the grouped Series, you can also pass a list or dictionary of functions for aggregation and generate a DataFrame as output-

Example

# import the pandas library
　import　pandas　as　pd
　import　numpy　as　np
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　grouped = df.groupby('Team')
　print(grouped['Points'].agg([np.sum, np.mean, np.std]))

The running results are as follows:

　　Team         sum         mean             std
Devils　　　1536　　　768.000000　　　134.350288
Kings　　　　2285　　　761.666667　　　　24.006943
Riders　　　3049　　　762.250000　　　　88.567771
Royals　　　1505　　　752.500000　　　　72.831998
kings　　　　　812　　　812.000000       NaN

Transformation

Performing transformations on groups or columns returns an index, the size of which is the same as the size of the object being grouped. Therefore, the transformation should return a result that is the same size as the group block.

Example

# import the pandas library
　import　pandas　as　pd
　import　numpy　as　np
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　grouped = df.groupby('Team')
　score = lambda x: (x　-　x.mean())　/　x.std()*10
　print(grouped.transform(score))

The running results are as follows:

　　　　　Points　　　　　　　　Rank　　　　　　　　Year
0　　　12.843272　　-15.000000　　-11.618950
1　　　3.020286　　　　　5.000000　　　-3.872983
2　　　7.071068　　　　-7.071068　　　-7.071068
3　　-7.071068　　　　　7.071068　　　　7.071068
4　　-8.608621　　　　11.547005　　-10.910895
5　　　　　　　　NaN　　　　　　　　　　NaN　　　　　　　　　NaN
6　　-2.360428　　　　-5.773503　　　　2.182179
7　　10.969049　　　　-5.773503　　　　8.728716
8　　-7.705963　　　　　5.000000　　　　3.872983
9　　-7.071068　　　　　7.071068　　　-7.071068
10　　7.071068　　　　-7.071068　　　　7.071068
11　-8.157595　　　　　5.000000　　　11.618950

Filter

Filter data based on defined conditions and return a subset of the data. The filter() function is used to screen data.

Example

　import　pandas　as　pd
　import　numpy　as　np
　ipl_data　=　{'Team':　['Riders',　'Riders',　'Devils',　'Devils',　'Kings',
　　　　'kings',　'Kings',　'Kings',　'Riders',　'Royals',　'Royals',　'Riders'],
　　　　'Rank':　[1,　2,　2,　3,　3,4　,1　,1,2　,　4,1,2],
　　　　'Year':　[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
　　　　'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
　df　=　pd.DataFrame(ipl_data)
　print(df.groupby('Team').filter(lambda　x:　len(x)　>=　3))

The running results are as follows:

　　　　　　Points　　Rank　　　　　Team　　　Year
0　　　　　　876　　　　　1　　　Riders　　　2014
1　　　　　　789　　　　　2　　　Riders　　　2015
4　　　　　　741　　　　　3　　　Kings　　　　2014
6　　　　　　756　　　　　1　　　Kings　　　　2016
7　　　　　　788　　　　　1　　　Kings　　　　2017
8　　　　　　694　　　　　2　　　Riders　　　2016
11　　　　　690　　　　　2　　　Riders　　　2017

SQL Operations in Pandas Data Loss in Pandas

Pandas Tutorial

GroupBy in Pandas

Split data into groups

Example

View groups

Example

Traversing groups

Select group p

Aggregate

Applying multiple aggregation functions at once

Transformation

Filter