Databases

OCR Text vs. Extracted Text

ByAmy Bowser-Rollins 04/30/2015

Another topic when training a litigation support newbie is the concept of how we go about getting searchable documents for our document databases.

One of the most important reasons why we provide litigators with document databases is to enable the legal team to perform searches across all of the documents in order to find the (a) relevant documents, (b) harmful documents and (c) helpful documents.

Since searching the documents is such a high priority, we need to find ways to make sure there is “searchable text” for most of the documents.

The first way we accomplish this is to simply “extract the text” from the native electronic files. There are many software programs that perform this task. Extracted text is considered nearly 100% accurate in terms of performing searches.

The second way is a little more work. There will be native electronic files that do not play nice with the text extraction software. Sometimes it is because there simply isn't any text to be extracted. Other times it is because there is a technical reason why the file is prohibiting the text from being extracted. We have some workarounds for these technical reasons and we eventually succeed in extracting text from some of these files.

There are other electronic files that will require a process referred to as “optical character recognition” (OCR) before we can get searchable text for the file. The OCR software can be run against a batch of files, but it can also take a while to complete the process. In the end, we get OCR text that we consider to be about 85% accurate.

We must not forget that some of the documents collected in litigation matters are hardcopy documents. They need to go through the process of being scanned and then OCR'd. There are several quality check stages during this process. Dealing with the conversion of hardcopy documents into a format we can use in a document database can take a good deal of time.

When we advise the legal team, we make sure they are aware of what percentage of the database contains extracted text versus OCR text. They need to understand any potential limitations of their search results.

Databases

Search Query Success Tip

ByAmy Bowser-Rollins 04/25/201608/02/2021

Have you ever gotten frustrated running search queries against a database and it won’t find what you’re looking for? One reason for the lack of success could be that you are not aware of the “stop words” (also called “noise words”) in your particular database system. Over the years, I have had frustrated paralegals and attorneys…

Databases | Expressions

A Database is a Database is a Database

ByAmy Bowser-Rollins 02/05/201207/30/2022

One of the expressions I use in the world of litigation support is “A database is a database is a database“. When we are exposed to yet another database tool for the first time this expression will apply because all databases have the same basic features. It is just a matter of finding out where…

Databases

Control Numbers vs. Bates Numbers

ByAmy Bowser-Rollins 04/15/2015

When training a newbie in litigation support, I always explain to them the concept of control numbers versus bates numbers. In order to create a litigation document database, we need every record (document) in the database to have a unique identifier field. The criteria for the contents of this field is as follows: 1. The value…

Analysis | Databases

Electronic Discovery – Indexing

ByAmy Bowser-Rollins 07/14/201708/17/2021

In a previous article on the topic of electronic discovery, I began the conversation about a term we use called “processing”. As I mentioned in that article, the term “processing” encompasses many steps, which can make it one of the most difficult topics to teach to newbies in litigation support. Some of the “processing” steps…

Databases | EDRM | Review

Last Viewed Fields in Your Document Review Database

ByAmy Bowser-Rollins 12/15/201408/15/2021

One of the best practices in Legal Project Management (LPM) is demonstrating to the legal teams how we techies can add value. We were having a team meeting related to a case that had a large document database in Relativity. The legal team had several concurrent sub-projects for this matter. The attorneys were teamed up to resolve…

Databases

Limit Keyword Search to Beginning or End of the Document

ByAmy Bowser-Rollins 05/01/2016

In a previous article, I discussed “stop words” and within the free resource at the bottom of the article, I mentioned a search engine called dtSearch. You may not be aware that you are using dtSearch because it is “baked in” to many of the legal industry database systems we use. To increase the success of…

3 Comments

mgolab says:

04/30/2015 at 6:54 pm

Thanks, Amy.

To me this is a key issue when clients conduct their own collection in that you have to weigh up the risk of % of image based attachments vs searchable. With the [theoretical at least] trend in companies generating less paper, then there is a pretty big trend in having scanners email you scanned docs which then sit in mailboxes and email archives as inert image based files that can’t be searched in their native environments.

There is also the slight ‘elephant in the room’ issue of handwriting and the [to me at least] relatively poor capability of the standard OCR tools to handle handwriting.

Finally there is also the consideration of the original language of the image and the capabilities of the OCR system in terms of language
identification.

It would be interesting to hear your thoughts on the OCR tools out there such as Adobe Acrobat and ABBYY etc.

Reply
1. Amy Bowser-Rollins says:
  
  05/28/2015 at 1:22 pm
  
  Thanks for sharing your wisdom, Matthew. I have found Adobe’s OCR results to be better than some other tools and I use it often. It is also a good solution for a small firm who can’t afford to purchase litigation support processing tools. I haven’t used ABBYY in a while, but I know some people are.
  
  Reply
Patt F. says:

08/18/2015 at 3:16 pm

Another problem is when you get a production in PDF format and the Bates number have been applied using Adobe. In some processing software, when you try to OCR those documents, your text file will only contain the Bates number! 🙁

Reply

Similar Posts

3 Comments

Leave a Reply Cancel reply