English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Text Processing in Pandas

Pandas text processing operation examples

In this chapter, we will use basic Series / Index discusses string operations. In the following chapters, we will learn how to apply these string functions to DataFrames.

Pandas provides a set of string functions that can easily manipulate string data. Most importantly, these functions ignore (or exclude) missing/ NaN values.

Almost all of these methods can be used for Python string functions (see: https://docs.python.org/3/library/stdtypes.html#string-methods)

). Therefore, convert the Series object to a String object, then perform the operation.

Let's see how each operation is executed.Method
lower()Convert the strings in the index to lowercase./Description
upper()Convert the strings in the index to lowercase./Convert the strings in the index to uppercase.
len()Calculate the length of the string.
strip()Help remove spaces from both sides of the series/Remove spaces (including newline characters) from each string in the index.
split(' ')Split each string with the given pattern.
cat(sep=' ')/td>Concatenate the series with the given delimiter/Index elements.
get_dummies()Return a DataFrame with a one-hot encoding value.
contains(pattern)If the substring is contained in the element, return a boolean True for each element, otherwise return False.
replace(a,b)Replace the value of a with b.
repeat(value)Repeat each element a specified number of times.
count(pattern)Return the number of times the pattern appears in each element.
startswith(pattern)If the series/Return true if the element in the index starts with the pattern.
endswith(pattern)If the series/Return true if the element in the index ends with the pattern.
find(pattern)Return the first position of the first occurrence of the pattern.
findall(pattern)Return a list of all patterns that appear.
swapcaseCase Folding
islower()<Check the series/Check if each character in each string in the index is lowercase. Returns a boolean
isupper()Check the series/Check if each character in each string in the index is uppercase. Returns a boolean value.
isnumeric()Check the series/Check if each character in each string in the index is a number. Returns a boolean value.

Let's create a Series to see how all the above functions work.

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s

Running Result:

 0    Tom
 1 William Rick
 2 John
 3 Alber@t
 4 NaN
 5 1234
 6 Steve Smith
 dtype: object

lower()

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.lower()

Running Result:

 0 tom
 1 william rick
 2 john
 3 alber@t
 4 NaN
 5 1234
 6 steve smith
 dtype: object

upper()

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.upper()

Running Result:

 0 TOM
 1 WILLIAM RICK
 2 JOHN
 3 ALBER@T
 4 NaN
 5 1234
 6 STEVE SMITH
 dtype: object

len()

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.len()

Running Result:

 0 3.0
 1 12.0
 2 4.0
 3 7.0
 4 NaN
 5 4.0
 6 10.0
 dtype: float64

strip()

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s
 print ("After Stripping:")
 print s.str.strip()

Running Result:

 0    Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 After Stripping:
 0    Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object

split(pattern)

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s
 print ('Split Pattern:)
 print s.str.split(' ')

Running Result:

 0    Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 Split Pattern:
 0    [Tom, , , , , , , , , , ]
 1 [, , , , , William, Rick]
 2 [John]
 3 [Alber@t]
 dtype: object

cat(sep=pattern)

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s.str.cat(sep='_')

Running Result:

   Tom _ William Rick_John_Alber@t

get_dummies()

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s.str.get_dummies()

Running Result:

   William Rick        Alber@t        John        Tom
0        0        0        0        0        0        0        0     1
1             1         0        0        0        0
2             0        0        0      1     0
3             0         1      0        0

contains ()

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s.str.contains(' ')

Running Result:

 0    True
 1  True
 2  False
 3  False
 dtype: bool

replace(a,b)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s
 print ('After replacing @ with $:)
 print s.str.replace('@',')
 )

Running Result:

 0    Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 After replacing @ with $:
 0    Tom
 1 William Rick
 2 John
 3 Alber$t
 dtype: object

repeat(value)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print s.str.repeat(2)

Running Result:

0        Tom        Tom
1   William Rick        William Rick
2                  JohnJohn
3                  Alber@tAlber@t
dtype: object

count(pattern)

 import pandas as pd
  
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print ('The number of 'm' in each string:')
 print s.str.count('m')

Running Result:

 The number of 'm' in each string:
 0 1
 1 1
 2 0
 3 0

startswith(pattern)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print ('Strings that start with 'T':)
 print s.str.startwith('T')

Running Result:

 0    True
 1  False
 2  False
 3  False
 dtype: bool

endswith(pattern)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print ('Strings that end with 't':)
 print s.str.endswith('t')

Running Result:

 Strings that end with 't':
 0 False
 1  False
 2  False
 3  True
 dtype: bool

find(pattern)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print(s.str.find('e'))

Running Result:

 0 -1
 1 -1
 2 -1
 3 3
 dtype: int64

" -1” indicates that no matches were found in the elements.

findall(pattern)

 import pandas as pd
 s = pd.Series(['Tom ', '  William Rick', 'John', 'Albert'])
 print(s.str.findall('e'))

Running Result:

 0 []
 1 []
 2 []
 3 [e]
 dtype: object

An empty list ([]) indicates that no matches were found in the elements

swapcase()

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Albert'])
 print(s.str.swapcase())

Running Result:

 0 tOM
 1 wILLIAM rICK
 2 jOHN
 3 aLBER@T
 dtype: object

islower()

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Albert'])
 print(s.str.islower())

Running Result:

 0 False
 1  False
 2  False
 3  False
 dtype: bool

isupper()

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Albert'])
 print(s.str.isupper())

Running Result:

 0 False
 1  False
 2  False
 3  False
 dtype: bool

isnumeric()

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Albert'])
 print(s.str.isnumeric())

Running Result:

 0 False
 1  False
 2  False
 3  False
 dtype: bool