English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Concatenation in Pandas

Pandas join operation example

Pandas has a comprehensive set of high-performance in-memory join operations that are very similar to those in relational databases such as SQL.
Pandas provides a single function merge as the entry point for all standard database join operations between DataFrame objects

 pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
 left_index=False, right_index=False, sort=True)

Here, we use the following parameters:

left − A DataFrame object. right − Another DataFrame object. on − The column (name) is added on top. It must be found in both the left and right DataFrame objects. left_on − The columns of the left DataFrame are used as keys. They can be column names or arrays of length equal to the length of the DataFrame. right_on − The columns of the right DataFrame are used as keys. They can be column names or arrays of length equal to the length of the DataFrame. left_index − If True, the index (row label) of the left DataFrame is used as its connection key. If the DataFrame has a MultiIndex (hierarchical), the number of levels must match the number of connection keys in the right DataFrame. right_index − The same usage as left_index for the correct data frame. how − One of “left”, “right”, “outer”, “inner”. The default is internal. Each method is described below. sort − The sorted result data frame adds the dictionary order key. By default, it is set to True, and setting it to False in many cases will greatly improve performance.

Now let's create two different DataFrames and perform merge operations on them.

# import the pandas library
 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame(
    {'id': [1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5})
 print(left
 print(right)

The running result is as follows:

      Name  id   subject_id
0   Alex   1         sub1
1    Amy   2         sub2
2  Allen   3         sub4
3  Alice   4         sub6
4  Ayoung  5         sub5
    Name  id   subject_id
0  Billy   1         sub2
1  Brian   2         sub4
2  Bran    3         sub3
3  Bryce   4         sub6
4  Betty   5         sub5

Merging two dataframes on a single key

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
 ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, on='id'))

The running result is as follows:

     Name  id  subject_id_x   Name_y   subject_id_y
0  Alex      1          sub1    Billy           sub2
1  Amy       2          sub2    Brian           sub4
2  Allen     3          sub4     Bran           sub3
3  Alice     4          sub6    Bryce           sub6
4  Ayoung    5          sub5    Betty           sub5

Merging two dataframes on multiple keys

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
 ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, on=['id', 'subject_id']))

The running result is as follows:

      Name_x   id   subject_id   Name_y
0    Alice    4         sub6    Bryce
1   Ayoung    5         sub5    Betty

Merging using the 'how' parameter

The 'how' parameter of merge specifies how to determine which keys to include in the result table. If the combination key does not appear in either the left or right table, the value in the join table is NA.

Here is a summary of how to choose and their SQL equivalent names:

Merge MethodSQL EquivalentDescription
leftLEFT OUTER JOINUsing the key of the left object
rightRIGHT OUTER JOINUsing the correct object's key
outerFULL OUTER JOINUsing combined keys
innerINNER JOINUsing the intersection of keys

Left Join

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
    ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, on='subject_id', how='left'))

The running result is as follows:

      Name_x id_x subject_id Name_y id_y
0  Alex      1         sub1      NaN  NaN
1      Amy      2         sub2    Billy    1.0
2    Allen      3         sub4    Brian    2.0
3    Alice      4         sub6    Bryce    4.0
4   Ayoung      5         sub5    Betty    5.0

Right Join

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
    ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, on='subject_id', how='right'))

The running result is as follows:

      Name_x  id_x   subject_id   Name_y   id_y
0  Amy   2.0  sub2    Billy      1
1    Allen   3.0  sub4    Brian      2
2    Alice   4.0  sub6    Bryce      4
3   Ayoung   5.0  sub5    Betty      5
4      NaN NaN  sub3     Bran      3

Outer Join

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
    ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, how='outer', on='subject_id'))

The running result is as follows:

      Name_x  id_x   subject_id   Name_y   id_y
0  Alex   1.0  sub1      NaN  NaN
1      Amy   2.0  sub2    Billy    1.0
2    Allen   3.0  sub4    Brian    2.0
3    Alice   4.0  sub6    Bryce    4.0
4   Ayoung   5.0  sub5    Betty    5.0
5      NaN NaN  sub3     Bran    3.0

Inner Join

The join operation is performed on the index. The join operation accepts the object it calls. Therefore, a.join(b) is not equal to b.join(a).

 import pandas as pd
 left = pd.DataFrame({
    ']1,2,3,4,5,
    'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
    'subject_id':['sub1','sub2','sub4','sub6','sub5']})
 right = pd.DataFrame({
    ']1,2,3,4,5,
    '],
    'subject_id':['sub2','sub4','sub3','sub6','sub5']})
 print(pd.merge(left, right, on='subject_id', how='inner'))

The running result is as follows:

      Name_x id_x subject_id Name_y id_y
0  Amy      2         sub2    Billy      1
1    Allen      3         sub4    Brian      2
2    Alice      4         sub6    Bryce      4
3   Ayoung      5         sub5    Betty      5