English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

SQL Operations in Pandas

Specific examples of Pandas SQL operations

Since many potential Pandas users are familiar with SQL, this page aims to provide some examples to illustrate how to perform various SQL operations using Pandas.

 import pandas as pd
 url = 'https://raw.github.com/pandasdev/
 pandas/master/pandas/tests/data/tips.csv
 tips=pd.read_csv(url)
 print(tips.head())

The running result is as follows:

      total_bill    tip    sex    smoker    day    time    size
0        16.99  1.01   Female    No    Sun    Dinner      2
1        10.34  1.66     Male    No    Sun    Dinner      3
2        21.01  3.50    Male    No    Sun    Dinner      3
3        23.68  3.31     Male    No    Sun    Dinner      2
4        24.59  3.61   Female    No    Sun    Dinner      4

query

In SQL, selection is done using a comma-separated list of the columns you have selected (or using*to select all columns):

 SELECT total_bill, tip, smoker, time
 from tips
 LIMIT 5;

Column selection can be completed by passing a list of column names to DataFrame using Pandas:

 tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

Let's look at a complete example:

 import pandas as pd
 url = 'https://raw.github.com/pandasdev/
 pandas/master/pandas/tests/data/tips.csv
  
 tips=pd.read_csv(url)
 print(tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

The running result is as follows:

   total_bill    tip    smoker    time
0       16.99  1.01      No    Dinner
1       10.34  1.66      No    Dinner
2       21.01  3.50    No    Dinner
3       23.68  3.31      No    Dinner
4       24.59  3.61      No    Dinner

Calling a DataFrame without a column name list will display all columns (similar to the comma-separated list of columns in SQL's*).

WHERE clause query

Filtering in SQL is done through the WHERE clause.

 SELECT * from tips WHERE time = 'Dinner' LIMIT 5;

DataFrame can be filtered in many ways. The most intuitive method is to use boolean indexing.

 tips[tips['time'] == 'Dinner'].head(5)

Let's look at a complete example

 import pandas as pd
 url = 'https://raw.github.com/pandasdev/
 pandas/master/pandas/tests/data/tips.csv
 tips=pd.read_csv(url)
 print(tips[tips['time'] == 'Dinner']).head(5)

The running result is as follows:

     total_bill    tip    sex    smoker    day    time    size
0       16.99  1.01   Female    No    Sun    Dinner    2
1       10.34  1.66     Male    No    Sun    Dinner    3
2       21.01  3.50    Male    No    Sun    Dinner    3
3       23.68  3.31     Male    No    Sun    Dinner    2
4       24.59  3.61   Female    No    Sun    Dinner    4

The above statement will generate a series of True / Pass a False object to DataFrame and return all rows with True.

GroupBy grouping

This operation retrieves the number of records in each group of the entire dataset. For example, query the gender grouping and count:

 SELECT sex, count(*)
 from tips
 GROUP BY sex;

The operation in Pandas is as follows:

 tips.groupby('sex').size()

Let's look at a complete example

 import pandas as pd
 url = 'https://raw.github.com/pandasdev/
 pandas/master/pandas/tests/data/tips.csv
 tips=pd.read_csv(url)
 print tips.groupby('sex').size()

The running result is as follows:

 sex
 Female 87
 Male 157
 dtype: int64

Query the number of rows

SQL uses LIMIT to return N rows:

 SELECT * from tips
 LIMIT 5 ;

Operations in Pandas are as follows:

 tips.head(5)

Let's look at a complete example

 import pandas as pd
 url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv
 tips=pd.read_csv(url)
 tips = tips[['smoker', 'day', 'time']].head(5)
 print tips

The running result is as follows:

     smoker day time
0 No Sun Dinner
1      No Sun Dinner
2      No Sun Dinner
3      No Sun Dinner
4      No Sun Dinner