If you are not, like me, this website has helped me right now.
regex101 is an exceptional resource. The Owl book is a great resource as well, that’s how I learned initially.
Some tips I’ve picked up along the way:
- Be careful of backtracking. Branches (
|
) should be sorted by likelihood. Once a regex matches it stops evaluating alternative branches. - Protect exponential backtracking with atomic grouping. A regex will do its best to match a pattern and that means trying every permutation.
- Regexes aren’t always the most efficient. Globs (shell wildcards) are about 8x faster than regexes in basic pattern matching (
\.php$
vs*.php
). - PCREs are the most common implementation of regular expressions.
grep
by default uses a very old variant of regex.alias grep="grep -E"
to avoid escaping branches or evenalias grep="grep -P"
for PCREs. - “regexp” came from Tcl. Perl greatly improved upon the feature subset. PCREs are what you should focus on in most situations.
- Regexes can be separated into NFA and DFA depending upon family. DFA is the original engine that doesn’t support backtracking or lookarounds, but it’s rather fast. Intel’s Hyperscan is one such implementation. NFAs offer the ability to look behind and ahead of matches and what can result in polynomial time if you’re not careful.
Always run your expressions through regex101 with a variety of data to approximate best and worst runtimes.
The guy that wrote the O’reilly “Regular Expressions Cookbook” used to be my manager at Facebook
Just be careful when writing scripts you want to share, since other people won’t have your aliases
At work, for internal systems that take a regular expression as input, we use Google’s re2
which uses a finite state machine with a fixed stack and predictable linear runtime (based on input size), rather than something like PCRE which uses backtracking, a large recursive stack, and potentially exponential runtime. You really don’t want an employee’s bad regex to take down an internal system!