Are you good with regular expressions?

imok · November 18, 2021, 1:26am

If you are not, like me, this website has helped me right now.

SonOfAMotherlessGoat · November 18, 2021, 2:08am

nem · November 19, 2021, 4:33pm

regex101 is an exceptional resource. The Owl book is a great resource as well, that’s how I learned initially.

Some tips I’ve picked up along the way:

Be careful of backtracking. Branches (|) should be sorted by likelihood. Once a regex matches it stops evaluating alternative branches.
Protect exponential backtracking with atomic grouping. A regex will do its best to match a pattern and that means trying every permutation.
Regexes aren’t always the most efficient. Globs (shell wildcards) are about 8x faster than regexes in basic pattern matching (\.php$ vs *.php).
PCREs are the most common implementation of regular expressions. grep by default uses a very old variant of regex. alias grep="grep -E" to avoid escaping branches or even alias grep="grep -P" for PCREs.
“regexp” came from Tcl. Perl greatly improved upon the feature subset. PCREs are what you should focus on in most situations.
Regexes can be separated into NFA and DFA depending upon family. DFA is the original engine that doesn’t support backtracking or lookarounds, but it’s rather fast. Intel’s Hyperscan is one such implementation. NFAs offer the ability to look behind and ahead of matches and what can result in polynomial time if you’re not careful.

Always run your expressions through regex101 with a variety of data to approximate best and worst runtimes.

Daniel · November 21, 2021, 1:58am

The guy that wrote the O’reilly “Regular Expressions Cookbook” used to be my manager at Facebook

Just be careful when writing scripts you want to share, since other people won’t have your aliases

At work, for internal systems that take a regular expression as input, we use Google’s re2 which uses a finite state machine with a fixed stack and predictable linear runtime (based on input size), rather than something like PCRE which uses backtracking, a large recursive stack, and potentially exponential runtime. You really don’t want an employee’s bad regex to take down an internal system!