English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Pandas text processing operation examples
In this chapter, we will use basic Series / Index discusses string operations. In the following chapters, we will learn how to apply these string functions to DataFrames.
Pandas provides a set of string functions that can easily manipulate string data. Most importantly, these functions ignore (or exclude) missing/ NaN values.
Almost all of these methods can be used for Python string functions (see: https://docs.python.org/3/library/stdtypes.html#string-methods)
). Therefore, convert the Series object to a String object, then perform the operation.
Let's see how each operation is executed. | Method |
lower() | Convert the strings in the index to lowercase./Description |
upper() | Convert the strings in the index to lowercase./Convert the strings in the index to uppercase. |
len() | Calculate the length of the string. |
strip() | Help remove spaces from both sides of the series/Remove spaces (including newline characters) from each string in the index. |
split(' ') | Split each string with the given pattern. |
cat(sep=' ')/td> | Concatenate the series with the given delimiter/Index elements. |
get_dummies() | Return a DataFrame with a one-hot encoding value. |
contains(pattern) | If the substring is contained in the element, return a boolean True for each element, otherwise return False. |
replace(a,b) | Replace the value of a with b. |
repeat(value) | Repeat each element a specified number of times. |
count(pattern) | Return the number of times the pattern appears in each element. |
startswith(pattern) | If the series/Return true if the element in the index starts with the pattern. |
endswith(pattern) | If the series/Return true if the element in the index ends with the pattern. |
find(pattern) | Return the first position of the first occurrence of the pattern. |
findall(pattern) | Return a list of all patterns that appear. |
swapcase | Case Folding |
islower()< | Check the series/Check if each character in each string in the index is lowercase. Returns a boolean |
isupper() | Check the series/Check if each character in each string in the index is uppercase. Returns a boolean value. |
isnumeric() | Check the series/Check if each character in each string in the index is a number. Returns a boolean value. |
Let's create a Series to see how all the above functions work.
import pandas as pd import numpy as np s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith']) print s
Running Result:
0 Tom 1 William Rick 2 John 3 Alber@t 4 NaN 5 1234 6 Steve Smith dtype: object
import pandas as pd import numpy as np s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith']) print s.str.lower()
Running Result:
0 tom 1 william rick 2 john 3 alber@t 4 NaN 5 1234 6 steve smith dtype: object
import pandas as pd import numpy as np s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith']) print s.str.upper()
Running Result:
0 TOM 1 WILLIAM RICK 2 JOHN 3 ALBER@T 4 NaN 5 1234 6 STEVE SMITH dtype: object
import pandas as pd import numpy as np s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith']) print s.str.len()
Running Result:
0 3.0 1 12.0 2 4.0 3 7.0 4 NaN 5 4.0 6 10.0 dtype: float64
import pandas as pd import numpy as np s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s print ("After Stripping:") print s.str.strip()
Running Result:
0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After Stripping: 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object
import pandas as pd import numpy as np s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s print ('Split Pattern:) print s.str.split(' ')
Running Result:
0 Tom 1 William Rick 2 John 3 Alber@t dtype: object Split Pattern: 0 [Tom, , , , , , , , , , ] 1 [, , , , , William, Rick] 2 [John] 3 [Alber@t] dtype: object
import pandas as pd import numpy as np s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s.str.cat(sep='_')
Running Result:
Tom _ William Rick_John_Alber@t
import pandas as pd import numpy as np s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s.str.get_dummies()
Running Result:
William Rick Alber@t John Tom 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 2 0 0 0 1 0 3 0 1 0 0
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s.str.contains(' ')
Running Result:
0 True 1 True 2 False 3 False dtype: bool
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s print ('After replacing @ with $:) print s.str.replace('@',') )
Running Result:
0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After replacing @ with $: 0 Tom 1 William Rick 2 John 3 Alber$t dtype: object
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print s.str.repeat(2)
Running Result:
0 Tom Tom 1 William Rick William Rick 2 JohnJohn 3 Alber@tAlber@t dtype: object
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print ('The number of 'm' in each string:') print s.str.count('m')
Running Result:
The number of 'm' in each string: 0 1 1 1 2 0 3 0
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print ('Strings that start with 'T':) print s.str.startwith('T')
Running Result:
0 True 1 False 2 False 3 False dtype: bool
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print ('Strings that end with 't':) print s.str.endswith('t')
Running Result:
Strings that end with 't': 0 False 1 False 2 False 3 True dtype: bool
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print(s.str.find('e'))
Running Result:
0 -1 1 -1 2 -1 3 3 dtype: int64
" -1” indicates that no matches were found in the elements.
import pandas as pd s = pd.Series(['Tom ', ' William Rick', 'John', 'Albert']) print(s.str.findall('e'))
Running Result:
0 [] 1 [] 2 [] 3 [e] dtype: object
An empty list ([]) indicates that no matches were found in the elements
import pandas as pd s = pd.Series(['Tom', 'William Rick', 'John', 'Albert']) print(s.str.swapcase())
Running Result:
0 tOM 1 wILLIAM rICK 2 jOHN 3 aLBER@T dtype: object
import pandas as pd s = pd.Series(['Tom', 'William Rick', 'John', 'Albert']) print(s.str.islower())
Running Result:
0 False 1 False 2 False 3 False dtype: bool
import pandas as pd s = pd.Series(['Tom', 'William Rick', 'John', 'Albert']) print(s.str.isupper())
Running Result:
0 False 1 False 2 False 3 False dtype: bool
import pandas as pd s = pd.Series(['Tom', 'William Rick', 'John', 'Albert']) print(s.str.isnumeric())
Running Result:
0 False 1 False 2 False 3 False dtype: bool