logo
What the regex?! A Practical Guide to Regular Expressions
Opinion

What the regex?! A Practical Guide to Regular Expressions

The hairiness of the regular expression syntax is mind-boggling but if you can master a few of the core concepts, you have another powerful tool in your toolbox.

Published Nov 9, 2023
My first real programming job was as an intern with the University of Minnesota - Duluth IT department (eons ago). My job was to convert all of the University department data from a SIR database to load into MySQL and to build new web interfaces for each with Perl and CGI scripts. Not only did I get to learn the ins and outs of Perl and regular expressions, but I got my first exposure to writing code to rewrite code. The Perl scripts I was writing used regular expressions to rewrite the old scripts.
The hairiness of the regular expression syntax was mind-boggling at the time (and sometimes still is), but I gained a love for it as a tool and the power it gave me. When I get to work with someone who is new to regular expressions or on a new pattern, I jump at the opportunity. Which brought me to writing this post today.

A regular expression (regex or regexp) is a sequence of characters defining a pattern to search against. Each character in the pattern is a metacharacter with special meaning (i.e. match the start of a string) or it is a regular character with a literal meaning (i.e. match the literal letter a).
Regex is a more precise way of specifying the possible variations of a string. For example, the words:
1
2
donut
doughnut
can be specified more precisely with:
1
do(ugh)?nut

What do we use regex for? Here are a few practical applications:
  • creating a search algorithm to find text on a page
  • validating data like a URL format, telephone number, or password
  • parsing a string into separate parts like an area code and a phone number
  • replacing a generic error message with something more specific and on-brand
  • searching your hard drive for files that mention a specific string
  • searching in your IDE through files
  • validating a markdown document format

I use these five concepts of regex regularly and over the years have (mostly) memorized them for when I need to do pattern matching. For each, I’ll explain what it does, give an example or two, and then you can try it out yourself.

The boolean | operator will match either character or character sequence. For instance, if you wanted to test the user agent string to make sure it was an allowable user agent or to run customized logic for a specific user agent, you could use the boolean | operator.
Example:
1
iPhone|iPad|Android
will match any of the following:
1
2
3
iPhone
iPad
Android
but not:
1
2
BlackBerry
iPod
Need to match a string case-insensitively? Each programming language that supports regex might do it in slightly different ways, so be sure to check yours. In Ruby, it is done with the i character at the end of the regex pattern (surrounded by /).
1
/iPhone|iPad|Android/i
will match:
1
2
3
iPhone
IPAD
anDROID

To match any character, we can use the wildcard. The wildcard is specified by . and matches any character except newlines.
Example:
1
.+ing
will match:
1
2
3
4
boating
sailing
kayaking
surfing
but not:
1
2
3
4
boat
sail
kayak
surf
What’s that +? We’ll cover that in the Quantifiers section.

To match the start or end of a string (just after or before a newline), we can use the anchors ^ and $ respectively.
We could use anchors to validate a URL starts with https:// not http:// and ends with .com.
Example:
1
^https:\/\/.+\.com$
will match:
1
[https://jennapederson.com](https://jennapederson.com/)
but not:
1
2
http://jennapederson.com
https://jennapederson.dev
Note the \ before the two / and .. We use the backslash \ to escape these metachacters to match the literal characters. The list includes [ \ ^ $ . | ? * + ( ).
Wondering about that +? We’ll talk about that in the next section on Quantifiers!

Quantifiers follow either a character or a group and specify how many times to match that character or group.
? will match zero or one occurrence
* will match zero or more occurrences
+ will match one or more occurrences
Less common quantifiers like these allow you to match exactly n times or between min and max times:
{n} will match exactly n occurrences
{min,} will match at least min occurrences
{,max} will match up to max occurrences
{min,max} will match between min and max occurrences
Example:
1
\d{2}-\d{2}-\d{4}
Example:
1
^(\d{1,3})(\d{0,3})(\d{0,4})$
will match:
1
02-02-2020
but not:
1
12 May 2020

Grouping allows us to match groups of characters. We use parenthesis ( and ) to open and close a group in our pattern. Capturing groups lets us operate on them individually. These groups will be captured in an array and can be accessed by index.
Example:
1
(\d{2})-(\d{2})-(\d{4})
will match:
1
02-02-2020
and will capture three groups:
1
2
3
02
02
2020
Note the \d is a character class representing a larger set of characters, in this case, any digit. Other examples would be [0-9] to represent any digit or [A-Z] to represent any upper case letter.
You can also use named groups using ?<group name> and access each group by it’s group name:
1
(?<month>\d{2})-(?<day>\d{2})-(?<year>\d{4})

For a more complex pattern, I will test it using Rubular where I can shove in a bunch of strings to match and fiddle with it until it’s right. This is specific to Ruby and there will be differences if you’re working in other programming languages, but Regex101 can come in handy for that and it provides some pretty handy explanations.
Don’t tell anyone, but I usually just use Rubular because it’s so fantastic with its cheat sheet right there for me. Occasionally I have to drop out to figure out a specific variation for the language I’m writing my regex for.

You could also use an AI coding companion, like Amazon CodeWhisperer, to help you with write your regular expressions (and accompanying tests). Check out this post from my colleague Romain Jourdan on AI coding companions and how he uses it for regular expressions.

Regex pattern matching is a prime candidate for a unit test. I can’t tell you how many times I’ve written a regex pattern, wrapped it in a test, deployed to production, and days, weeks, or months later, I find out there’s another variation of that string we have to match. With the unit test in place, I can start by writing a failing test, fix the regex pattern, and then run my test to make sure I’ve fixed the problem.

There are plenty of challenges and practice tools out there, but Regex Crossword and Hacker Rank are my favorites.

Have you had to write a particularly hairy regex pattern to match a string or to find text in a document or in a codebase or to validate some data? What other uses have you seen regex used for? I’d love to see what others have experienced and are using them for. Follow me on Threads or LinkedIn and share it with me there!