Regex is a powerful tool that can be used in a day-to-day programming tasks. regex can be used to match text against a set of patterns, or to search for specific values in a text. regex can be used in various ways, such as to find all occurrences of a certain word or phrase, or to find all the lines that contain a certain string. There are many different regex patterns that can be used in your day-to-day programming tasks. You can find examples of regex patterns in the online resources provided by the Google Search team. You can also create your own regex patterns using the online tools provided by grep and sed. When you use regex in your day-to-day programming tasks, it is important to be aware of the following: • The ^ character is an important character when using regex. The ^ character means that any text following the ^ character will match any text that has been matched before, even if those texts are not consecutive. This is important because it allows you to group together multiple matches into one group, and then use those groupings to search for specific values within the text. • The $ character is also an important character when using regex. The $ character means that any text following the $ character will not match any text that has been matched before, even if those texts are not consecutive. This is important because it allows you to exclude specific matches from being included in your search results.
Regex, short for regular expression, is often used in programming languages for matching patterns in strings, find and replace, input validation, and reformatting text. Learning how to properly use Regex can make working with text much easier.
Regex Syntax, Explained
Regex has a reputation for having horrendous syntax, but it’s much easier to write than it is to read. For example, here is a general regex for an RFC 5322-compliant email validator:
If it looks like someone smashed their face into the keyboard, you’re not alone. But under the hood, all of this mess is actually programming a finite-state machine. This machine runs for each character, chugging along and matching based on rules you’ve set. Plenty of online tools will render railroad diagrams, showing how your Regex machine works. Here’s that same Regex in visual form:
Still very confusing, but it’s a lot more understandable. It’s a machine with moving parts that have rules defining how it all fits together. You can see how someone assembled this; it’s not just a big glob of text.
First Off: Use a Regex Debugger
Before we begin, unless your Regex is particularly short or you’re particularly proficient, you should use an online debugger when writing and testing it. It makes understanding the syntax much easier. We recommend Regex101 and RegExr, both which offer testing and built-in syntax reference.
How Does Regex Work?
For now, let’s focus on something much simpler. This is a diagram from Regulex for a very short (and definitely not RFC 5322 compliant) email-matching Regex:
The Regex engine starts at the left and travels down the lines, matching characters as it goes. Group #1 matches any character except a line break, and will continue to match characters until the next block finds a match. In this case, it stops when it reaches an @ symbol, which means Group #1 captures the name of the email address and everything after matches the domain.
The Regex that defines Group #1 in our email example is:
The parentheses define a capture group, which tells the Regex engine to include the contents of this group’s match in a special variable. When you run a Regex on a string, the default return is the entire match (in this case, the whole email). But it also returns each capture group, which makes this Regex useful for pulling names out of emails.
The period is the symbol for “Any Character Except Newline.” This matches everything on a line, so if you passed this email Regex an address like:
It would match %$#^&%*#%$#^ as the name, even though that’s ludicrous.
The plus (+) symbol is a control structure that means “match the preceding character or group one or more times.” It ensures that the whole name is matched, and not just the first character. This is what creates the loop found on the railroad diagram.
The rest of the Regex is fairly simple to decipher:
The first group stops when it hits the @ symbol. The next group then starts, which again matches multiple characters until it reaches a period character.
Because characters like periods, parentheses, and slashes are used as part of the syntax in Regrex, anytime you want to match those characters you need to properly escape them with a backslash. In this example, to match the period we write . and the parser treats it as one symbol meaning “match a period.”
Character Matching
If you have non-control characters in your Regex, the Regex engine will assume those characters will form a matching block. For example, the Regex:
Will match the word “hello” with any number of e’s. Any other characters need to be escaped to work properly.
Regex also has character classes, which act as shorthand for a set of characters. These can vary based on the Regex implementation, but these few are standard:
. – matches anything except newline. w – matches any “word” character, including digits and underscores. d – matches numbers. b – matches whitespace characters (i. e. , space, tab, newline).
These three all have uppercase counterparts that invert their function. For example, D matches anything that isn’t a number.
Regex also has character-set matching. For example:
Will match either a, b, or c. This acts as one block, and the square brackets are just control structures. Alternatively, you can specify a range of characters:
Or negate the set, which will match any character that isn’t in the set:
Quantifiers
Quantifiers are an important part of Regex. They let you match strings where you don’t know the exact format, but you have a pretty good idea.
The + operator from the email example is a quantifier, specifically the “one or more” quantifier. If we don’t know how long a certain string is, but we know it’s made up of alphanumeric characters (and isn’t empty), we can write:
In addition to +, there’s also:
The * operator, which matches “zero or more. ” Essentially the same as +, except it has the option of not finding a match. The ? operator, which matches “zero or one. ” It has the effect of making a character optional; either it’s there or it isn’t, and it won’t match more than once. Numerical quantifiers. These can be a single number like {3}, which means “exactly 3 times,” or a range like {3-6}. You can leave out the second number to make it unlimited. For example, {3,} means “3 or more times”. Oddly enough, you can’t leave out the first number, so if you want “3 or less times,” you’ll have to use a range.
Greedy and Lazy Quantifiers
Under the hood, the * and + operators are greedy. It matches as much as possible, and gives back what is needed to start the next block. This can be a massive problem.
Here’s an example: say you’re trying to match HTML, or anything else with closing braces. Your input text is:
And you want to match everything within the brackets. You may write something like:
This is the right idea, but it fails for one crucial reason: the Regex engine matches “div>Hello World