Introduction
Regular expressions — commonly called regex or regexp — are sequences of characters that define a search pattern. They are one of the most powerful tools in a developer's toolkit, enabling you to search, validate, extract, and transform text with a single compact expression.
Whether you are validating an email address in a web form, extracting data from logs, or performing a complex find-and-replace across thousands of files, regular expressions let you express what you are looking for in a concise, portable notation understood by virtually every programming language and most text editors.
At their core, regular expressions describe a language — a set of strings that match the pattern. The expression \d{3}-\d{4} describes any string containing three digits, a hyphen, and four digits. The expression ^[A-Z] describes any string that starts with an uppercase letter. Once you understand the vocabulary, you can compose arbitrarily complex patterns to match exactly what you need.
A Brief History of Regular Expressions
The story of regular expressions stretches back to the foundations of theoretical computer science:
- 1951 — Mathematician Stephen Kleene formalizes the concept of regular languages and introduces the Kleene star (
*) notation as part of his work on neural nets and automata theory. - 1968 — Ken Thompson implements regular expressions in the QED text editor and later in Unix tools like
grep,sed, andawk, bringing regex into everyday developer practice. - 1986 — POSIX standardizes two flavors: BRE (Basic Regular Expressions) and ERE (Extended Regular Expressions), ensuring interoperability across Unix systems.
- 1997 — Philip Hazel creates the PCRE (Perl Compatible Regular Expressions) library, which becomes the de-facto standard for powerful regex features including lookaheads, lookbehinds, and named captures.
- 1999 — ECMAScript 3 standardizes JavaScript's
RegExpobject, bringing regex to the browser. - 2015 — ES6 adds the
u(Unicode) andy(sticky) flags. - 2018 — ES2018 adds named capture groups (
(?<name>...)) and lookbehind assertions ((?<=...)/(?<!...)).
POSIX vs PCRE vs JavaScript RegExp
| Feature | BRE/ERE (POSIX) | PCRE | JavaScript RegExp |
|---|---|---|---|
| Lookahead | ✗ | ✓ | ✓ |
| Lookbehind | ✗ | ✓ | ✓ (ES2018+) |
| Named groups | ✗ | ✓ | ✓ (ES2018+) |
| Non-greedy | ✗ | ✓ | ✓ |
| Unicode | Limited | ✓ | ✓ (with u flag) |
| Backreferences | ✓ | ✓ | ✓ |
Core Syntax Reference
Quick Reference Table
| Pattern | Meaning |
|---|---|
. |
Any character except newline |
^ |
Start of string / start of line (with m flag) |
$ |
End of string / end of line (with m flag) |
\d |
Any digit [0-9] |
\D |
Any non-digit |
\w |
Word character [a-zA-Z0-9_] |
\W |
Non-word character |
\s |
Whitespace (space, tab, newline, …) |
\S |
Non-whitespace |
[abc] |
Character class — matches a, b, or c |
[^abc] |
Negated class — matches anything except a, b, c |
[a-z] |
Character range |
* |
0 or more (greedy) |
+ |
1 or more (greedy) |
? |
0 or 1 (greedy) |
{n} |
Exactly n times |
{n,m} |
Between n and m times (greedy) |
*? +? ?? |
Lazy (non-greedy) equivalents |
(abc) |
Capturing group |
(?:abc) |
Non-capturing group |
(?<name>abc) |
Named capturing group |
| |
Alternation — matches left OR right |
(?=...) |
Positive lookahead |
(?!...) |
Negative lookahead |
(?<=...) |
Positive lookbehind |
(?<!...) |
Negative lookbehind |
Character Classes
Character classes let you match one character from a set. [aeiou] matches any single vowel. [a-zA-Z] matches any letter. [^0-9] matches any character that is not a digit.
Shorthand classes are very common:
\dis equivalent to[0-9]\wis equivalent to[a-zA-Z0-9_]\smatches space, tab (\t), newline (\n), carriage return (\r), and other whitespace
Quantifiers: Greedy vs Lazy
By default, quantifiers are greedy — they match as much as possible. Consider the HTML string <b>bold</b> and <i>italic</i>:
<.*> → matches the entire string from <b> to </i> (greedy)
<.*?> → matches <b>, then </b>, then <i>, then </i> (lazy)
Adding ? after a quantifier makes it lazy (non-greedy): it matches as little as possible while still allowing the overall pattern to succeed.
Anchors
^matches the start of the string (or line with themflag).$matches the end of the string (or line withm).\bmatches a word boundary — the transition between a word character and a non-word character.\Bmatches a non-word boundary.
Groups and Backreferences
Capturing groups (...) capture the matched text, which you can refer to later as \1, \2, etc. (backreferences), or access via the match result array.
Non-capturing groups (?:...) group subpatterns without creating a capture, which is more efficient when you do not need the captured value.
Named groups (?<year>\d{4}) let you reference captures by name (match.groups.year in JavaScript), making patterns far more readable.
Lookahead and Lookbehind
These zero-width assertions match a position, not a character:
\d+(?= dollars) → matches digits only if followed by " dollars"
\d+(?! dollars) → matches digits NOT followed by " dollars"
(?<=\$)\d+ → matches digits only if preceded by "$"
(?<!\$)\d+ → matches digits NOT preceded by "$"
Lookaheads and lookbehinds do not consume characters, so the matched text does not include the lookahead/lookbehind portion.
Flags
| Flag | Name | Effect |
|---|---|---|
i |
Case-insensitive | [a-z] also matches [A-Z] |
g |
Global | Find all matches, not just the first |
m |
Multiline | ^ and $ match line boundaries |
s |
dotAll | . matches newline characters too |
u |
Unicode | Enables full Unicode matching; required for \p{} |
y |
Sticky | Match only at lastIndex position |
x |
Verbose/Extended | Allow whitespace and comments (PCRE/Python only) |
The x (verbose) flag is especially valuable for documenting complex patterns:
import re
pattern = re.compile(r"""
^ # start of string
(?P<year>\d{4}) # 4-digit year
-
(?P<month>0[1-9]|1[0-2]) # month 01–12
-
(?P<day>0[1-9]|[12]\d|3[01]) # day 01–31
$
""", re.VERBOSE)
Common Patterns with Real Examples
Email Validation
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Breakdown:
^[a-zA-Z0-9._%+-]+— local part (letters, digits, and select special chars)@— literal at sign[a-zA-Z0-9.-]+— domain name\.[a-zA-Z]{2,}$— TLD of 2+ letters
Note: The RFC 5322 email specification is far more complex. This pattern handles the vast majority of real-world addresses but does not cover edge cases like quoted strings in the local part.
URL Matching
https?://(?:www\.)?[a-zA-Z0-9-]+(?:\.[a-zA-Z]{2,})+(?:/[^\s]*)?
ISO Date (YYYY-MM-DD)
\b\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])\b
This validates month 01–12 and day 01–31. Note it does not validate month-specific day ranges (e.g., February 30 would pass).
IPv4 Address
\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b
Breakdown:
25[0-5]— 250–2552[0-4]\d— 200–249[01]?\d\d?— 0–199
US Phone Number
\+?1?\s?[\(]?\d{3}[\)]?[-.\s]?\d{3}[-.\s]?\d{4}
Matches formats like (555) 123-4567, 555-123-4567, +1 555 123 4567.
HTML Tag
<([a-zA-Z][a-zA-Z0-9]*)\b[^>]*>(.*?)<\/\1>
Uses a backreference \1 to ensure the closing tag matches the opening tag.
Hex Color Code
#(?:[0-9A-Fa-f]{3}){1,2}\b
Matches both 3-digit (#F00) and 6-digit (#FF0000) hex colors.
Regex in Different Programming Languages
JavaScript
// Literal syntax with flags
const regex = /^hello\s+world$/im;
const match = "Hello World".match(regex);
// Constructor syntax (useful for dynamic patterns)
const term = "world";
const dynamic = new RegExp(`hello\\s+${term}`, "im");
// Replace all occurrences
const result = "foo bar foo".replaceAll(/foo/g, "baz");
// Named captures (ES2018+)
const dateRegex = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const { year, month, day } = "2024-03-15".match(dateRegex).groups;
Python
import re
# Compile for reuse
pattern = re.compile(r'^hello\s+world$', re.IGNORECASE | re.MULTILINE)
match = pattern.match("Hello World")
# Find all matches
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
# Substitute with a function
result = re.sub(r'\b\d+\b', lambda m: str(int(m.group()) * 2), "1 plus 2 is 3")
# Named groups
m = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})', "2024-03")
print(m.group('year')) # 2024
Java
import java.util.regex.*;
Pattern p = Pattern.compile("^hello\\s+world$",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher m = p.matcher("Hello World");
boolean found = m.matches();
// Extract groups
Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher dm = datePattern.matcher("Today is 2024-03-15");
if (dm.find()) {
String year = dm.group(1);
}
Go
import "regexp"
re := regexp.MustCompile(`(?im)^hello\s+world$`)
match := re.FindString("Hello World")
// Find all submatches
dateRe := regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)
all := dateRe.FindAllStringSubmatch(text, -1)
for _, m := range all {
year, month, day := m[1], m[2], m[3]
_ = year; _ = month; _ = day
}
// Named groups
namedRe := regexp.MustCompile(`(?P<year>\d{4})-(?P<month>\d{2})`)
match2 := namedRe.FindStringSubmatch("2024-03")
yearIdx := namedRe.SubexpIndex("year")
fmt.Println(match2[yearIdx]) // 2024
Performance and Catastrophic Backtracking
How Backtracking Works
Most regex engines use NFA (Non-deterministic Finite Automaton) based matching, which means they can try multiple paths through the pattern when an early attempt fails. This backtracking is what enables features like lookaheads and backreferences — but it can also be a performance trap.
The Catastrophic Case
Consider the pattern (a+)+ applied to the string "aaaaaX":
- The outer
+tries to match as many groups as possible. - When the engine reaches
Xand fails, it backtracks and tries different ways to split theacharacters among the repetitions of the group. - For a string of length n, there are 2^(n-1) possible splits — leading to exponential time complexity.
(a+)+ on "aaaaaaaaaaaaaaaaaX" → may take seconds or minutes!
Other dangerous patterns include (a|aa)+, (\w+\s*)+, and anything with nested quantifiers over overlapping character classes.
How to Avoid It
- Avoid nested quantifiers over the same character set:
(a+)+→ usea+instead. - Use atomic groups
(?>...)or possessive quantifiersa++(PCRE) to prevent backtracking into already-matched groups. - Be specific: replace
.*with a character class that excludes delimiters (e.g.,[^"]*inside quoted strings). - Anchor your patterns where possible so the engine fails fast.
- Use a timeout in production code when processing untrusted input (Java's
Patterndoes not natively support this; consider a separate thread).
How to Read a Complex Regex
Break ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ into pieces:
^ → anchor: start of string
[a-zA-Z0-9._%+-]+ → one or more allowed local-part characters
@ → literal @
[a-zA-Z0-9.-]+ → one or more domain characters
\. → literal dot (escaped)
[a-zA-Z]{2,} → TLD: 2 or more letters
$ → anchor: end of string
Tip: Use a tool like this regex tester to highlight each captured group and see exactly which part of the input each token matches. Build your pattern incrementally — start with the simplest valid piece, verify it, then extend.
Best Practices
- Compile once, use many times. Pre-compiling a pattern (e.g.,
re.compile()in Python,Pattern.compile()in Java) is far more efficient than re-parsing it on every call. - Prefer non-capturing groups
(?:...)when you do not need the captured value. It signals intent and avoids unnecessary memory allocation. - Use raw strings for patterns. In Python, use
r'\d+'instead of'\\d+'to avoid double-escaping. In JavaScript, the literal syntax/\d+/handles this automatically. - Name your captures.
(?<year>\d{4})is far more maintainable than relying on group index\1. - Test with edge cases: empty strings, strings that are almost-but-not-quite matches, Unicode characters, and very long inputs.
- Document complex patterns with the
x(verbose) flag in Python/PCRE, or with inline comments in your code. - Never use regex for full HTML or XML parsing. Use a proper parser library instead.
- Validate inputs on the server side. Client-side regex validation improves UX but must not be the only line of defense.
FAQ
Q: What is the difference between match() and search() in Python?
A: re.match() only matches at the beginning of the string. re.search() scans the entire string for a match. Use re.fullmatch() to require the pattern to match the entire string.
Q: Why does ^ inside a character class [^abc] mean something different?
A: Inside a character class, ^ as the first character negates the class — it matches any character not in the set. Outside a character class, ^ is an anchor for the start of the string.
Q: Can I use regex to parse HTML?
A: For simple, well-defined extractions from known HTML structures, regex can work. But HTML is not a regular language — it allows arbitrary nesting and optional closing tags. Use a proper HTML parser (e.g., BeautifulSoup in Python, DOMParser in JS) for robust parsing.
Q: What is the difference between greedy and possessive quantifiers?
A: Greedy quantifiers backtrack — they try the maximum match and give back characters if needed. Possessive quantifiers (e.g., a++ in PCRE) never give back — once they match, the match is locked. This prevents catastrophic backtracking but can also cause a match to fail when a greedy quantifier would have succeeded.
Q: How do I match a literal dot . or parenthesis (?
A: Escape them with a backslash: \. matches a literal dot, \( matches a literal open parenthesis.
Q: Is regex case-sensitive by default?
A: Yes. Use the i flag (/pattern/i in JavaScript, re.IGNORECASE in Python) to enable case-insensitive matching.
Q: What does \b match?
A: \b is a zero-width word boundary assertion. It matches the position between a word character (\w) and a non-word character (\W) — for example, between the d in word and the space after it.
Q: How do I test if an entire string matches a pattern?
A: Anchor with ^ and $: ^pattern$. In Python, you can also use re.fullmatch(). In JavaScript, .test() with anchors or check that match()[0].length === input.length.
Use the regex tester on this site to experiment with every pattern in this guide. Paste any pattern, type your test string, and see matches highlighted in real time.