`

Regular Expressions in Salesforce Apex®

Regular expression (Regex) is a powerful tool that can be used in many applications yet many programmers seem to be a bit afraid of using them. Below I present short tutorial about what regular expressions are and how to use them in Salesforce Apex.

But maybe before we start with theory let's begin with practice – how can regexes be used? Few examples:

  • Searches with and/or/not operators. Like search for word that ends with „a” OR „e” but NOT starts with „g”.
  • Searches using classes of characters. Like search for every 5-digit number in file.
  • More advanced replacing of text. For example find every 5-digit number and remove last digit.
  • Suming up all the above we can construct very advanced text parsers that find (+replace) specific strings
  • Validating strings with specified format, like postal codes, emails, telephone numbers, etc.

What are the limitations of regexes:

  • Because of the way regex is working, it cannot be used for analysing HTML/XML files (generally all files with opening/closing tags). Of course we can write our own state machine that is using multiple regular expressions, but it's really better to use for example DOM for such files.

Ok. But what is regex? Regular expression is abstract way of describing any string of characters. Regexes can have matches in string. Match means that some portion of text meets the criteria of regular expression.

Let's have a quick example. We will create expression that searches for every word ending with letter 'e'. We will gradually extend the complexity of it. Let's assume word is a combination of letters that is preceded by whitespace and end with whitespace. The regex for such family of strings will be:

[a-zA-Z]*e

[ ] - square bracket means character class. It means one character that can be either of characters of class.
- - means range of characters. a-z means a or b or c or d or … z
* - means 0 or more occurences of previous character. Here, the previous character is the ']' - that means parser will take whole class of characters in [ ] brackets
e – just a letter

Different matches of our regex were marked in text below:

Here there, here there and again here e e e! The here is everywhere

It looks fine, every word with 'e' at the end was found. By the way 'e' occurences were also found because please remember that * wildcard means ZERO or more occurences.

Greediness of *

When writing regexes, please remember that * wildcard is greedy operator. That means it always takes as many characters as possible. If we would add space to our [ ] class in regex:

[a-z A-Z]*e

The matches would be different:

Here there, here there and again here e e e! The here is everywhere

We included space in our class, so every space+letter string ending with 'e' was found. Because of greediness of *, it is really important to limit your character class to absolute minimum. Of course you can use non-greedy quantifiers, but that is out of scope of this article.

Groups and back-references

Imagine we have following problem: the file provided by company X has invalid area code in all mobile numbers (but not for landline numbers). Let's say it's +84 instead of +48. We want to replace all occurences of +84XXX-XXX-XXX with +48XXX-XXX-XXX. We will use grouping of characters for that.

Group is a way of organizing regex into logical parts. It can be used for back-references, a very powerful tool for replacing parts of regex. Let's get straight to the point, we need something like this:

\+84(\d{3}-\d{3}-\d{3})

+ - just a regular plus, but it's also a special character in regex, so it needs to be escaped with '\'
( ) - group, the brackets are transparent for parser in a way that they are not treated literally (if you would like to find '(', you would have to use '\(')
\d – digit class, a shorthand for [0-9]
{3} – means exactly 3 occurences of previous character
- - just a regular dash

Now, to use the replace:

'...'.replaceAll('\\+84(\\d{3}-\\d{3}-\\d{3})','+48$1')

If we use this code on following string:

+84123-123-123 +84111-222-333

We will receive:

+84123-123-123 +84111-222-333

Notes:

When using regex in Apex code we have to exit every '\' with '\'.

Groups are always numbered from 0. Group 0 is always whole regex. The groups are numbered from left to right.

If you wish to use any group in replacing string, just use group number preceded with '$'.

Usage in Apex

We have already covered String.replaceAll function, but there's more than that. There are classes called Pattern and Matcher. See the example:

Pattern p = Pattern.compile('fo*bar');
Matcher m = p.matcher('foobar');
m.matches(); //returns true, the string matches the pattern
m.group(0); //returns group 0 = 'foobar'
m.groupCount(); //returns 1, self-explanatory

You can find more information in Apex documentation about Pattern, Matcher and String classes.

More about regular expressions

Of course we have talked about only tip of the iceberg of regular expressions. Here are some examples:

(foo|bar) – search for 'foo' OR 'bar'
[a-z^kb] – search for any small letter, but not k and not b
^foo – search for 'foo', but only at the beginning of the line
bar$ - search for 'bar, but only at the end of the line

You can read full specification of Java regular expression language here (Apex is using same implementation):

If you feel you need more practice, the great help in checking how your regexes work is using special applications like Regex Buddy. It's unfortunately not free. You can use Notepad++ regex search instead, but please note that some elements can have different notations.


Łukasz - Salesforce technical consultant/developer with over 3 years of experience. Python evangelist when it comes to non-SF solutions. Crazy about writing clean, readable and reusable code when not chased by the deadlines

Comment