Regular Expressions

Regular Expressions – Searching for SSN Numbers Across Text Files

I believe that the best way to learn litigation support is by working through real-life, on-the-job scenarios. This article is about a scenario that landed on my plate a few years ago.

In a previous article, I talked about the importance of creating cheat sheets in litigation support. After I figured out a solution to handle this scenario, I documented the steps in a cheat sheet that I could refer to later.

Before I share this scenario and solution with you, please keep in mind that one of the cool things about a litigation support career, in my opinion, is (1) there are multiple solutions to every scenario and (2) the solutions will be based on your expertise, the tools available to you and perhaps the timeline.

The Scenario

We were about to prepare a production using the Concordance database software and we knew that some of the documents contained PII (Personal Identifying Information) such as social security numbers. We wanted to redact the SSN numbers in the production set.

For some reason, which I don't remember now, I did not trust the search results I was getting in the Concordance database as I tried to search specifically for SSN numbers.

For those of you that don't know anything about a Concordance database, each record (document) has a database field that contains all of the text of that document and we use it to search for words within the document.

Concordance Text Field

On the back end, the document text for that database field comes from a single text file, which we refer to as a “document-level” text file. The file has a .TXT file extension. The text file contains either the “extracted text” from a “native” electronic document (Word, Excel, Email, PDF) or the “OCR text” from an “imaged” electronic document (scanned PDF, TIFF). Through an import process, we load all of those document-level text files into the database.

The Thought Process

So, I knew that I had all of those text files sitting on the server within subdirectories. The thought came to me that I could search against all of the text files directly, instead of using the text in the Concordance database.

What tool can I use to search across text files? My thought process went to “text editors”. I use both TextPad and UltraEdit. I chose TextPad for this project.

In order to search specifically for SSN numbers, I would need to tell the text editor how to search for SSN numbers. That brings us to the world of “regular expressions“. One reason I was excited to do this project is that I usually don't have time to learn more in-depth ways of using “regular expressions”, and I wish I did. I only use regular expressions in very basic ways on a daily basis.

What are regular expressions, you ask? Simply, they provide us with a way to identify characters of text based on their pattern. For instance, SSN numbers are always 3 numbers (dash) 2 numbers (dash) 4 numbers.

I enjoyed solving this puzzle, as least as it pertains to TextPad. Keep in mind that each tool we use has a slightly different way of accomplishing the same task. I had to figure out how TextPad does it.

The Solution

Below are the steps I put into my cheat sheet:

SSN_Find in Files

From the Search menu, select Find In Files.

SSN_Reg Ex

TextPad uses a different syntax for <count>.  Instead of {3}, the syntax is \{count\}.  This regular expression works in TextPad.

1. Digit 0-9 that is 3 characters in length

2. Digit 0-9 that is 2 characters in length

3. Digit 0-9 that is 4 characters in length

SSN_Reg Ex in TextPad

1. Find what: Enter search criteria using regular express syntax.

2. In files: Leave the default file extensions or edit as necessary.

3. In folder: Browse to the folder that contains the text files that will be searched.

4. Conditions: Select Regular expression.

5. Report detail: Select either All matching lines to see the detail or select File counts only if that will suffice.

6. If it applies, select Search subfolders

7. Click the Find button.

SSN_search results

The search results pane will appear. The individual files will be listed and the total occurrences in total files will be displayed.

SSN_word wrap

1. Select Word Wrap from the toolbar to display all of the text.

2. Click within the search results pane to make it the active window.

SSN_search within results

From the Search menu, select Find.

SSN_search within results 2

1. Find what: Enter search criteria using regular express syntax.

2. Conditions: Select Regular expression

3. Click the Find Next button.

The term will be highlighted within the line that it resides.

In the end, I was able to confirm which documents had SSN numbers and then redact those documents in the production set.

One of the benefits of learning on-the-job in litigation support is that there are plenty of scenarios like this that will test your ability to come up with a solution. Personally, I find that part of the job challenging and fun.

Similar Posts

4 Comments

  1. Amy great article.Wondering if this approach could be used to pull attorney names before first pass review and final production

  2. We use regular expressions heavily with Textpad – particularly coupled with block select and the end of line character.

    I’m sure you already know this, however Adobe Acrobat Professional has the ability to use regular expressions (or at least Adobe’s implementation of regex) to search and redact, and there is a canned / preloaded regex specifically for SSN – that is US SSN. We have utilise this Acrobat Pro function of regex and search and redact for a variety of tasks. Its slow and cumbersome, however does a pretty solid job – and its significantly faster than our puny human redaction capabilities!

Leave a Reply

Your email address will not be published.