Regular Expressions in Python: Match and Process Text Efficiently

Introduction

In this chapter, you will learn regular expressions (regex) in Python, a powerful way to search, validate, and transform text. Regex is widely used for tasks like checking emails, extracting numbers, and cleaning raw text data. Once you understand the basics, many text-processing problems become much easier to solve.

Prerequisites

  • Python 3.10+ installed
  • Basic understanding of strings, conditions, and loops
  • Ability to run .py files in terminal or IDE

What Is Regex

Regex is a pattern language used to match text.

In Python, regex functions are provided by the built-in re module.

Common scenarios:

  • validate input format
  • extract target content
  • replace matched text

1) Import re and Basic Match

Use re.search() to find pattern in text.

python
import re
 
text = "My score is 95."
 
# Search first number sequence
match = re.search(r"\d+", text)
 
if match:
    print(match.group())  # 95

r"..."" means raw string, which is recommended for regex patterns.

2) Core Regex Symbols (Beginner Set)

  • . : any character (except newline)
  • \d : digit (0-9)
  • \w : word character (letter, digit, underscore)
  • \s : whitespace
  • + : one or more
  • * : zero or more
  • ? : zero or one
  • ^ : start of string
  • $ : end of string
  • [] : character class
  • () : group

Example:

python
import re
 
print(bool(re.search(r"^Hello", "Hello Python")))  # True
print(bool(re.search(r"\d+$", "Room 12")))         # True
print(bool(re.search(r"\w+", "abc_123")))          # True

3) findall() and finditer()

Use these to get multiple matches.

python
import re
 
text = "Math: 88, English: 92, Chinese: 95"
 
# Get all number substrings
all_scores = re.findall(r"\d+", text)
print(all_scores)  # ['88', '92', '95']

finditer() gives match objects with positions:

python
import re
 
text = "A1 B22 C333"
for m in re.finditer(r"\d+", text):
    print(m.group(), m.start(), m.end())

4) sub() for Text Replacement

Use re.sub() to replace matched patterns.

python
import re
 
text = "Contact: 13812345678"
 
# Mask middle digits in phone number
masked = re.sub(r"(\d{3})\d{4}(\d{4})", r"\1****\2", text)
print(masked)  # Contact: 138****5678

This is very useful for privacy-safe logging.

Tip

Pattern Debug Habit

Start with a simple pattern first, confirm results, then add complexity step by step.

5) Common Validation Patterns

Email (Simple Demo Pattern)

python
import re
 
email = "user@example.com"
is_valid = bool(re.fullmatch(r"[\w\.-]+@[\w\.-]+\.\w+", email))
print(is_valid)

Phone Number (Simple Demo Pattern)

python
import re
 
phone = "13812345678"
is_valid = bool(re.fullmatch(r"1\d{10}", phone))
print(is_valid)

Username (Letters, digits, underscore, 4-16 chars)

python
import re
 
username = "python_user1"
is_valid = bool(re.fullmatch(r"\w{4,16}", username))
print(is_valid)

6) Real Mini Example: Student Text Parser

Parse score lines from messy text input.

python
import re
 
raw_text = """
Emma:95
Liam:88
Noah:  92
Olivia:76
"""
 
# Extract name-score pairs
matches = re.findall(r"([A-Za-z]+)\s*:\s*(\d+)", raw_text)
 
# Convert to dictionary
score_map = {name: int(score) for name, score in matches}
 
print(score_map)

This pattern is common in log parsing and simple ETL tasks.

Warning

Regex can become unreadable if overcomplicated.
If a pattern is too hard to understand, split logic into smaller steps.

Common Beginner Mistakes

Mistake 1: Forgetting Raw String Prefix r

Without raw strings, escaping backslashes becomes error-prone.

Mistake 2: Using Greedy Patterns Unintentionally

Patterns like .* can match too much if not constrained.

Mistake 3: Treating Demo Patterns as Production-Safe

Real-world email/URL validation can be more complex than beginner regex examples.

Surprise Practice Challenge

Build a "Message Sanitizer":

  1. Input a sentence containing phone numbers and emails
  2. Mask phone middle digits
  3. Replace email usernames with ***
  4. Extract all numbers from original text
  5. Print sanitized result and extracted number list

If you finish this, you can already use regex for practical data-cleaning workflows.

FAQ

Should I memorize all regex syntax?

No. Learn common symbols first, then check references when needed.

Is regex always the best text-processing tool?

Not always. For simple fixed formats, normal string methods may be clearer.

What is the difference between search() and fullmatch()?

search() finds a match anywhere; fullmatch() requires the entire string to match.

Why do regex patterns look hard to read?

They are compact by design. Use comments, small patterns, and helper functions to keep code maintainable.