Regular Expressions in Python: Match and Process Text Efficiently
Introduction
In this chapter, you will learn regular expressions (regex) in Python, a powerful way to search, validate, and transform text. Regex is widely used for tasks like checking emails, extracting numbers, and cleaning raw text data. Once you understand the basics, many text-processing problems become much easier to solve.
Prerequisites
- Python
3.10+installed - Basic understanding of strings, conditions, and loops
- Ability to run
.pyfiles in terminal or IDE
What Is Regex
Regex is a pattern language used to match text.
In Python, regex functions are provided by the built-in re module.
Common scenarios:
- validate input format
- extract target content
- replace matched text
1) Import re and Basic Match
Use re.search() to find pattern in text.
import re
text = "My score is 95."
# Search first number sequence
match = re.search(r"\d+", text)
if match:
print(match.group()) # 95r"..."" means raw string, which is recommended for regex patterns.
2) Core Regex Symbols (Beginner Set)
.: any character (except newline)\d: digit (0-9)\w: word character (letter, digit, underscore)\s: whitespace+: one or more*: zero or more?: zero or one^: start of string$: end of string[]: character class(): group
Example:
import re
print(bool(re.search(r"^Hello", "Hello Python"))) # True
print(bool(re.search(r"\d+$", "Room 12"))) # True
print(bool(re.search(r"\w+", "abc_123"))) # True3) findall() and finditer()
Use these to get multiple matches.
import re
text = "Math: 88, English: 92, Chinese: 95"
# Get all number substrings
all_scores = re.findall(r"\d+", text)
print(all_scores) # ['88', '92', '95']finditer() gives match objects with positions:
import re
text = "A1 B22 C333"
for m in re.finditer(r"\d+", text):
print(m.group(), m.start(), m.end())4) sub() for Text Replacement
Use re.sub() to replace matched patterns.
import re
text = "Contact: 13812345678"
# Mask middle digits in phone number
masked = re.sub(r"(\d{3})\d{4}(\d{4})", r"\1****\2", text)
print(masked) # Contact: 138****5678This is very useful for privacy-safe logging.
Tip
Pattern Debug Habit
Start with a simple pattern first, confirm results, then add complexity step by step.
5) Common Validation Patterns
Email (Simple Demo Pattern)
import re
email = "user@example.com"
is_valid = bool(re.fullmatch(r"[\w\.-]+@[\w\.-]+\.\w+", email))
print(is_valid)Phone Number (Simple Demo Pattern)
import re
phone = "13812345678"
is_valid = bool(re.fullmatch(r"1\d{10}", phone))
print(is_valid)Username (Letters, digits, underscore, 4-16 chars)
import re
username = "python_user1"
is_valid = bool(re.fullmatch(r"\w{4,16}", username))
print(is_valid)6) Real Mini Example: Student Text Parser
Parse score lines from messy text input.
import re
raw_text = """
Emma:95
Liam:88
Noah: 92
Olivia:76
"""
# Extract name-score pairs
matches = re.findall(r"([A-Za-z]+)\s*:\s*(\d+)", raw_text)
# Convert to dictionary
score_map = {name: int(score) for name, score in matches}
print(score_map)This pattern is common in log parsing and simple ETL tasks.
Warning
Regex can become unreadable if overcomplicated.
If a pattern is too hard to understand, split logic into smaller steps.
Common Beginner Mistakes
Mistake 1: Forgetting Raw String Prefix r
Without raw strings, escaping backslashes becomes error-prone.
Mistake 2: Using Greedy Patterns Unintentionally
Patterns like .* can match too much if not constrained.
Mistake 3: Treating Demo Patterns as Production-Safe
Real-world email/URL validation can be more complex than beginner regex examples.
Surprise Practice Challenge
Build a "Message Sanitizer":
- Input a sentence containing phone numbers and emails
- Mask phone middle digits
- Replace email usernames with
*** - Extract all numbers from original text
- Print sanitized result and extracted number list
If you finish this, you can already use regex for practical data-cleaning workflows.
FAQ
Should I memorize all regex syntax?
No. Learn common symbols first, then check references when needed.
Is regex always the best text-processing tool?
Not always. For simple fixed formats, normal string methods may be clearer.
What is the difference between search() and fullmatch()?
search() finds a match anywhere; fullmatch() requires the entire string to match.
Why do regex patterns look hard to read?
They are compact by design. Use comments, small patterns, and helper functions to keep code maintainable.