What the regex?! A Practical Guide to Regular Expressions

My first real programming job was as an intern with the University of Minnesota - Duluth IT department (eons ago). My job was to convert all of the University department data from a SIR database to load into MySQL and to build new web interfaces for each with Perl and CGI scripts. Not only did I get to learn the ins and outs of Perl and regular expressions, but I got my first exposure to writing code to rewrite code. The Perl scripts I was writing used regular expressions to rewrite the old scripts.

The hairiness of the regular expression syntax was mind-boggling at the time (and sometimes still is), but I gained a love for it as a tool and the power it gave me. When I get to work with someone who is new to regular expressions or on a new pattern, I jump at the opportunity. Which brought me to writing this post today.

What are regular expressions?

A regular expression (regex or regexp) is a sequence of characters defining a pattern to search against. Each character in the pattern is a metacharacter with special meaning (i.e. match the start of a string) or it is a regular character with a literal meaning (i.e. match the literal letter a).

Regex is a more precise way of specifying the possible variations of a string. For example, the words:

1
2
donut
doughnut

can be specified more precisely with:

1
do(ugh)?nut

Practical Applications

What do we use regex for? Here are a few practical applications:

creating a search algorithm to find text on a page
validating data like a URL format, telephone number, or password
parsing a string into separate parts like an area code and a phone number
replacing a generic error message with something more specific and on-brand
searching your hard drive for files that mention a specific string
searching in your IDE through files
validating a markdown document format

Five concepts

I use these five concepts of regex regularly and over the years have (mostly) memorized them for when I need to do pattern matching. For each, I’ll explain what it does, give an example or two, and then you can try it out yourself.

Boolean Or

The boolean | operator will match either character or character sequence. For instance, if you wanted to test the user agent string to make sure it was an allowable user agent or to run customized logic for a specific user agent, you could use the boolean | operator.

Example:

1
iPhone|iPad|Android

will match any of the following:

1
2
3
iPhone
iPad
Android

but not:

1
2
BlackBerry
iPod

Need to match a string case-insensitively? Each programming language that supports regex might do it in slightly different ways, so be sure to check yours. In Ruby, it is done with the i character at the end of the regex pattern (surrounded by /).

1
/iPhone|iPad|Android/i

will match:

1
2
3
iPhone
IPAD
anDROID

Try it out!

Wildcard

To match any character, we can use the wildcard. The wildcard is specified by . and matches any character except newlines.

Example:

1
.+ing

will match:

1
2
3
4
boating
sailing
kayaking
surfing

but not:

1
2
3
4
boat
sail
kayak
surf

Try it out!

What’s that +? We’ll cover that in the Quantifiers section.

Anchors

To match the start or end of a string (just after or before a newline), we can use the anchors ^ and $ respectively.

We could use anchors to validate a URL starts with https:// not http:// and ends with .com.

Example:

1
^https:\/\/.+\.com$

will match:

1
[https://jennapederson.com](https://jennapederson.com/)

but not:

1
2
http://jennapederson.com
https://jennapederson.dev

Note the \ before the two / and .. We use the backslash \ to escape these metachacters to match the literal characters. The list includes [ \ ^ $ . | ? * + ( ).

Wondering about that +? We’ll talk about that in the next section on Quantifiers!

Try it out!

Quantifiers

Quantifiers follow either a character or a group and specify how many times to match that character or group.

? will match zero or one occurrence

* will match zero or more occurrences

+ will match one or more occurrences

Less common quantifiers like these allow you to match exactly n times or between min and max times:

{n} will match exactly n occurrences

{min,} will match at least min occurrences

{,max} will match up to max occurrences

{min,max} will match between min and max occurrences

Example:

1
\d{2}-\d{2}-\d{4}

Example:

1
^(\d{1,3})(\d{0,3})(\d{0,4})$

will match:

1
02-02-2020

but not:

1
12 May 2020

Try it out!

Grouping

Grouping allows us to match groups of characters. We use parenthesis ( and ) to open and close a group in our pattern. Capturing groups lets us operate on them individually. These groups will be captured in an array and can be accessed by index.

Example:

1
(\d{2})-(\d{2})-(\d{4})

will match:

1
02-02-2020

and will capture three groups:

Try it out!

Note the \d is a character class representing a larger set of characters, in this case, any digit. Other examples would be [0-9] to represent any digit or [A-Z] to represent any upper case letter.

You can also use named groups using ?<group name> and access each group by it’s group name:

1
(?<month>\d{2})-(?<day>\d{2})-(?<year>\d{4})

Regex Tools

For a more complex pattern, I will test it using Rubular where I can shove in a bunch of strings to match and fiddle with it until it’s right. This is specific to Ruby and there will be differences if you’re working in other programming languages, but Regex101 can come in handy for that and it provides some pretty handy explanations.

Don’t tell anyone, but I usually just use Rubular because it’s so fantastic with its cheat sheet right there for me. Occasionally I have to drop out to figure out a specific variation for the language I’m writing my regex for.

Using an AI Coding Companion

You could also use an AI coding companion, like Amazon CodeWhisperer, to help you with write your regular expressions (and accompanying tests). Check out this post from my colleague Romain Jourdan on AI coding companions and how he uses it for regular expressions.

Write a Unit Test

Regex pattern matching is a prime candidate for a unit test. I can’t tell you how many times I’ve written a regex pattern, wrapped it in a test, deployed to production, and days, weeks, or months later, I find out there’s another variation of that string we have to match. With the unit test in place, I can start by writing a failing test, fix the regex pattern, and then run my test to make sure I’ve fixed the problem.

For Practice

There are plenty of challenges and practice tools out there, but Regex Crossword and Hacker Rank are my favorites.

Share Your Favorite Regex Patterns & Uses

Have you had to write a particularly hairy regex pattern to match a string or to find text in a document or in a codebase or to validate some data? What other uses have you seen regex used for? I’d love to see what others have experienced and are using them for. Follow me on Threads or LinkedIn and share it with me there!

Select your cookie preferences

Site Terms, Privacy, and more.

What the regex?! A Practical Guide to Regular Expressions

The hairiness of the regular expression syntax is mind-boggling but if you can master a few of the core concepts, you have another powerful tool in your toolbox.

What are regular expressions?

Practical Applications

Five concepts

Boolean Or

Wildcard

Anchors

Quantifiers

Grouping

Regex Tools

Using an AI Coding Companion

Write a Unit Test

For Practice

Share Your Favorite Regex Patterns & Uses

Comments