English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Python Basic Tutorial

Python Flow Control

Python Functions

Python Data Types

Python File Operations

Python Objects and Classes

Python Date and Time

Advanced Python Knowledge

Python Reference Manual

Python Regular Expressions (RegEx)

In this tutorial, you will learn about regular expressions (RegEx) and use Python's re module with RegEx (with the help of examples).

Regular expressions (RegEx) are a sequence of characters that define search patterns. For example,

^a...s$

The above code defines a RegEx pattern. The pattern is:withastarting withsending withAny five-letter string.

Patterns defined using RegEx can be used for string matching.

ExpressionStringMatch?
^a...s$absNo match
aliasMatches
abyssMatches
AliasNo match
An abacusNo match

Python has a module named reRegEx. Here is an example:

import re
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
  print("Search successful.")
else:
  print("Search not successful.")

Here, we use the re.match() function to search for patterns in the test string. If the search is successful, this method will return a match object. If not, it will return None.

reThe module defines some other functions that can be used with RegEx. Before we delve into that, let's learn about regular expressions themselves.

If you are already familiar with the basics of RegEx, please skip toPython RegEx.

Specify the pattern using regular expressions

To specify a regular expression, meta-characters are used. In the above example, ^ and $ are meta-characters.

Meta characters

Meta characters are characters that the RegEx engine interprets in a special way. Here is a list of meta characters:

[] . ^ $ * + ? {} () \ |

[] - Square brackets

Square brackets specify the set of characters you want to match.

ExpressionStringMatch?
[abc]a1matches
ac2matches
Hey JudeNo match
abc de ca5matches

In this case, [abc] will match, if you want to match a string that contains any of the a, b, or c.

You can also use-The characters within the square brackets represent a range of characters.

  • [a-e] is the same as [abcde].

  • [1-4]] is the same as [1234].

  • [0-39]] is the same as [01239].

You can use the insertion symbol ^ at the beginning of the square brackets to complement (invert) the character set.

  • [^abc] indicates any character exceptaorborcoutsideofany character.

  • [^0-9] indicates any non-digit character.

.- Dot

The dot matches any single character (except newline '\n').

ExpressionStringMatch?
..aNo match
ac1matches
acd1matches
acde2matches (contains4characters)

^- Insertion symbol

The insertion symbol ^ is used to check if a string starts with a specific character.

ExpressionStringMatch?
^aa1matches
abc1matches
bacNo match
^ababc1matches
acbno matches (starts with a but not followed by b)

$- Dollar

The dollar sign $ is used to check if a string iswitha specific characterend.

ExpressionStringMatch?
a$a1matches
formula1matches
cabNo match

*- Asterisk

The asterisk symbol **Matcheszero or morethe remaining pattern.

ExpressionStringMatch?
ma*nmn1matches
man1matches
maaan1matches
mainno matches (no n after a)
woman1matches

+- Plus sign

The plus sign + will+Matchesone or morethe remaining pattern.

ExpressionStringMatch?
ma+nmnno matches (no a character)
man1matches
maaan1matches
mainno matches (a followed by n)
woman1matches

?- Question mark

The question mark ? will matchzero or one occurrencethe remaining pattern.

ExpressionStringMatch?
ma?nmn1matches
man1matches
maaanno matches (more than one a character)
mainno matches (a followed by n)
woman1matches

{}- Braces

Consider the following code: {n,m}. This means that at leastn timesstyle, and at mostm timesstyle.

ExpressionStringMatch?
a{2,3}abc datNo match
abc daat1matches (at) daat
aabc daaat2matches (locatedaabc and )daaat
aabc daaaat2matches (locatedaabc and )daaaat

Let's try another example. RegEx [0-9]{2, 4matches at least2digit but not more than4digit

ExpressionStringMatch?
[0-9]{2,4}ab123csde1a match (matches at) ab123csde
12 and 3456732matches (at)12 and 345673
1 and 2No match

|- Vertical bar

The vertical bar | is used for alternation (or operator).

ExpressionStringMatch?
a|bcdeNo match
ade1a match (matches atade)
acdbea3matches (at)acdbea

In this case, a|b matches any string that includesaorb'sString

()- Parentheses

Parentheses () are used to group subpatterns. For example, (a|b|c)xz matches any string that containsaorborcMatches and is followed byof xzString

ExpressionStringMatch?
(a|b|c)xzab xzNo match
abxz1matches (matching at) abxz
axz cabxz2matches (at)axzbc cabxz

\- Backslash

The backslash \ is used to escape various characters, including all meta-characters. For example,

\$aIf the string contains $ followed by a then matches a. In this case, the $RegEx engine does not interpret it in a special way.

If you are unsure whether a character has a special meaning, you can place a \ in front of it. This ensures that the character is not treated as a special character.

Special sequences

Special sequences make common patterns easier to write. Here is a list of special sequences:

\A -matches if the specified character is at the beginning of the string.

ExpressionStringMatch?
\Athethe sunMatches
In the sunNo match

\b -matches if the specified character is at the beginning or end of a word.

ExpressionStringMatch?
\bfoofootballMatches
a footballMatches
afootballNo match
foo\bthe fooMatches
the afoo testMatches
the afootestNo match

\B-with \b. If the specified characternot inat the beginning or end of a word, then matches.

ExpressionStringMatch?
\BfoofootballNo match
a footballNo match
afootballMatches
foo\Bthe fooNo match
the afoo testNo match
the afootestMatches

\d-Matches any decimal digit. Equivalent to [0-9]

ExpressionStringMatch?
\d12abc33matches (at)12abc3
PythonNo match

\D-Matches any non-decimal digit. Equivalent to [^0-9]

ExpressionStringMatch?
\D1ab34"503matches (at)1ab34"50
1345No match

\s-Matches any position in the string that contains a whitespace character. Equivalent to [ \t\n\r\f\v].

ExpressionStringMatch?
\sPython RegEx1matches
PythonRegExNo match

\S-Matches any position in the string that contains a non-whitespace character. Equivalent to [^ \t\n\r\f\v].

ExpressionStringMatch?
\Sa b2matches (at) a b
   No match

\w-Matches any alphanumeric character (numbers and letters). Equivalent to [a-zA-Z0-9_]. By the way, the underscore _ is also considered an alphanumeric character.

ExpressionStringMatch?
\w12&": ;c3matches (at)12&": ;c
%"> !No match

\W-Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

ExpressionStringMatch?
\W1a2%c1matches (in)1a2%c
PythonNo match

\Z -Matches if the specified character is at the end of the string.

ExpressionStringMatch?
\ZPythonI like Python1matches
I like PythonNo match
Python is fun.No match

Tip:To build and test regular expressions, you can use a RegEx tester tool, such asregexThis tool can not only help you create regular expressions but also help you learn them.

Now that you have learned the basics of RegEx, let's discuss how to use RegEx in Python code.

Python regular expressions

Python has a module named re for regular expressions. To use it, we need to import the module.

import re

This module defines some functions and constants that can be used with RegEx.

re.findall()

re.findall() method returns a list of strings containing all the matches.

Example1:re.findall()

# Program to extract numbers from a string
import re
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d'+'
result = re.findall(pattern, string) 
print(result)
# Output: ['12>>> match.groups()89>>> match.groups()34]

If the pattern is not found, re.findall() returns an empty list.

re.split()

The split method splits the matched string and returns a list of strings where the split occurs.

Example2:re.split()

import re
string = 'Twelve:12 Eighty nine:89.
pattern = '\d'+'
result = re.split(pattern, string) 
print(result)
# Output: ['Twelve:', 'Eighty nine:', '.']

If the pattern is not found, re.split() returns a list containing an empty string.

You can pass the maxsplit parameter to the re.split() method. This is the maximum number of splits to be performed.

import re
string = 'Twelve:12 Eighty nine:89 Nine:9.
pattern = '\d'+'
# maxsplit = 1
# Split only at the first occurrence
result = re.split(pattern, string,) 1( 
print(result)
# Output: ['Twelve:', 'Eighty nine:']89 Nine:9.']

By the way, the default value of maxsplit is 0; the default value is 0. This means splitting all matching results.

re.sub()

Syntax of re.sub():

re.sub(pattern, replace, string)

This method returns a string in which the matched items are replaced with the content of the replace variable.

Example3:re.sub()

# Program to remove all spaces
import re
# Multiline string
string = 'abc 12\
de 23 \n f45 6'
# Match all whitespace characters
pattern = '\s+'
# Empty string
replace = ''
new_string = re.sub(pattern, replace, string) 
print(new_string)
# Output: abc12de23f456

If the pattern is not found, re.sub() returns the original string.

You can passcountIt is passed as the fourth argument to the re.sub() method. If omitted, the result is 0. This will replace all occurrences of the match.

import re
# Multiline string
string = 'abc 12\
de 23 \n f45 6'
# Match all whitespace characters
pattern = '\s+'
replace = ''
new_string = re.sub(r'\s}}+', replace, string, 1( 
print(new_string)
# Output:
# abc12de 23
# f45 6

re.subn()

re.subn() is similar to re.sub(), expecting it to return a tuple containing2a tuple containing the new string and the number of replacements made.

Example4: re.subn()

# Program to remove all spaces
import re
# Multiline string
string = 'abc 12\
de 23 \n f45 6'
# Match all whitespace characters
pattern = '\s+'
# Empty string
replace = ''
new_string = re.subn(pattern, replace, string) 
print(new_string)
# Output: ('abc12de23f456', 4(

re.search()

The re.search() method takes two parameters: pattern and string. This method searches for the first occurrence of the RegEx pattern in the string.

If the search is successful, re.search() returns a match object. If not, it returns None.

match = re.search(pattern, str)

Example5: re.search()

import re
string = "Python is fun"
# Check if "Python" is at the beginning
match = re.search('\APython', string)
if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  
# Output: pattern found inside the string

Here,matchcontains a match object.

match object

You can usedir()The function retrieves methods and properties of the match object.

Some commonly used methods and properties of the match object are:

match.group()

The group() method returns the matching part of the string.

Example6: Match object

import re
string = '39801 356, 2102 1111'
# Three digits, followed by a space, then two digits
pattern = '(\d{3}) (\d{2})'
# The match variable contains a Match object.
match = re.search(pattern, string) 
if match:
  print(match.group())
else:
  print("pattern not found")
# Output: 801 35

Here,match变量包含一个match对象。

match3}) Variable contains a match object.2}) Our pattern (\d{3}) (\d{2}) have two subgroups (\d{

}) and (\d{1(
'801'
}) and (\d{2(
'35'
}) and (\d{1, 2(
>>> match.group(801>>> match.groups()35('
}) You can get a part of the string of these bracketed subgroups. That's it:
>>> match.group(801>>> match.groups()35('

', ''

)

The start() function returns the index of the beginning of the matched substring. Similarly, end() returns the end index of the matched substring. match.start(), match.end() and match.span()
2
>>> match.start()
8

>>> match.end()

The span() function returns a tuple containing the start and end indices of the matched part.
>>> match.span()2, 8(

)

The re attribute of the match object returns a regular expression object. Similarly, the string attribute returns the passed string. match.re and match.string

>>> match.re
re.compile('(\\d{3}) (\d{2})')
>>> match.string
'39801 356, 2102 1111'

We have introduced all the commonly used methods defined in the re module. If you want to learn more information, please visitPython 3 re module.

Use the r prefix before RegEx

If you use the r prefix before the regular expressionrorRthe prefix, indicates a raw string. For example, '\n' is a new line, while r'\n' represents two characters: a backslash \ followed by n.

The backslash \ is used to escape various characters, including all meta characters. However, usingrThe prefix \ treats it as a normal character.

Example7Use the r prefix for raw strings

import re
string = '\n and \r are escape sequences.'
result = re.findall(r'[\n\r]', string) 
print(result)
# Output: ['\n', '\r']