regular-expressions

Regular Expressions Not Equal to a Smile or a Frown

One of the nifty and powerful tools in our litigation support tool belt is “regular expressions.”

We use regular expressions to match patterns of data that we need to search for and then perhaps manipulate in some way. A few examples of easy-to-recognize “data patterns” are social security numbers (999-99-9999) and phone numbers (999-999-9999). Both values are consistent in their “data patterns” and both include a specific number of digits. If we needed to search for a social security number that ends with 2334 or a phone number in the United States that begins with an area code of 301, we can use regular expressions to (1) isolate the correct position in the value and (2) match the exact numeric value.

Let's work through one example together.

For this example, we need to search for US phone numbers that look like either of these two patterns:

(999) 999-9999

999-999-9999

Here is an example of a regular expression to find any US phone numbers that includes an area code, with or without parentheses surrounding the area code:

(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}

In terms of syntax, let's define the types of items used in this regular expression. As you read through this list, try to stare at the entire regular expression as a whole and remember the two patterns we need to find.

  1. Using brackets ([ ]) means that at least one of the characters within the brackets must be a match.
  2. Using braces ({ }) means that the preceding characters within brackets must occur exactly the number of times designated within the braces.
  3. One set of parentheses is being used to encapsulate an OR statement
  4. Another set of parentheses is being used to search for parentheses surrounding an area code
  5. A pipe (|) is being used as an OR operator
  6. Backslashes are being used to signify that we literally want to search for the next character; in other words the next character is not part of the regular expression syntax; instead it is part of text string we are searching for
  7. The hyphens are literally part of the text string
  8. There is an intentional space in the text string

Okay, when you look at the regular expression as a whole, can you visualize where the two dashes are in the second phone number pattern (999-999-9999)?

regex-graphic-1

Keep visualizing the regular expression as a whole. Can you see the counts of 3, 3 and 4 that match both phone number patterns? Each phone number is 3 digits plus 3 digits plus 4 digits, right?

regex-graphic-2

Now focus in on the wider instance of open parentheses and closing parentheses.

regex-graphic-3

Inside this wider parenthetical, since we know the pipe character (|) represents the OR operator, can you visualize how it is searching for the area code with or without surrounding parentheses?

regex-graphic-4

Doing this visualization technique leaves us with one last item to understand and that is the numbers within the phone number itself can be any number between 0 and 9.

regex-graphic-5

How did you do? I bet that you can stare at that regular expression now and understand exactly what it means. No more complexity.

Learning how to use regular expressions can be intimidating because, at first glance, it seems similar to trying to learn a programming language. Yes, formulating a regular expression can get complex, but spending the time to gain a basic understanding of the syntax (like you just did with me) will assist with many searching scenarios.

If you had any aha moments during this exercise, let me know in the comments area below, okay?

Bonus Points: At the beginning of this article, I mentioned that if we wanted to search for a US phone number with an area code of 301, we could use a regular expression. Now that you understand the example above, how would you edit it to only search for phone numbers with a 301 area code? You can do this! Trust your instincts.

NOTE: In case you're interested, I wrote another article entitled Regular Expressions – Searching for SSN Numbers Across Text Files where I explained step-by-step how I used regular expressions on-the-job.

4 Comments

  1. I absolutely love using RegEx. A great fast way to grab data with delimiters and grouping them is the use of (.*?)
    For example, a load file of
    “Info”,”More Info”,”Even More Info”
    could be quickly identified as
    ^”(.*?)”,”(.*?)”,”(.*?)”$
    (^ = beginning of line $ = ending of line)

    The contents of each set of information within parenthesis are assigned to group numbers which can be manipulated by replacing. This is based on the order they appear via the usage of /# where # is a number of a group. The contents in the first (.*?) is group 1, the second is group 2, and the third is group 3.

    From there, you can swap out all you want. If you wanted the contents of the third group to be in the place of the first, and the contents of the first group to be in the place of the third group, you’d

    SEARCH:
    ^”(.*?)”,”(.*?)”,”(.*?)”$

    REPLACE:
    “3”,”2″,”1″

    This tends to be handy when having to swap out dates.
    2016-10-05

    This is YYYY-MM-DD, but what if you wanted it MM-DD-YYYY?

    SEARCH:
    (.*?)-(.*?)-(.*?) —-> 2016 is group 1, 10 is group 2 and 5 is group 3
    REPLACE:
    2-3-1

    Once you get more savvy, you can be more efficient with something like
    (d{4})[.- \]?(d{1,2})[.- \]?(d{1,2})
    This finds a pattern of 4 digits followed by the usual suspects for date delimiters (period, dash, space, backslash, and no delimiter), then 1 or 2 digits, then delimiters, then one more set of 1 or 2 digits.

    NOTE: Not all text editors support features such as “grouping”. I personally prefer EditPad Pro for my RegEx work.

    Anyway, great article! RegEx is fantastic!

    -Jared

  2. Excellent, Amy. They look daunting when you see someone else’s example, however as you say, they are straight forward once you understand what they are doing.

Leave a Reply

Your email address will not be published.