In a previous article on the topic of electronic discovery, I began the conversation about a term we use called “processing”. As I mentioned in that article, the term “processing” encompasses many steps, which can make it one of the most difficult topics to teach to newbies in litigation support.
Some of the “processing” steps that I've covered so far are DeNIST, Deduplication, Embedded Objects, Exceptions, Password Protected Files, and Time Zone.
Indexing is another step in the “processing” stage and it is one of the steps that can take some time to complete, depending on the volume of data that needs to be indexed.
If an attorney is anxious (read: impatient) about the turnaround time to process a new set of electronic data, I usually explain to them that there are many steps in the process and, more specifically, we will not be able to search against the document database until the indexing step has been completed. If they are interested in hearing more, I explain what indexing is, and how it improves database searches.
In simple terms, the indexing process will go grab every word in every document and then generate a list of those words, sorted in alphanumeric order. This facilitates fast and more accurate information retrieval when we perform a full-text search across all of the documents in the database.
A database index is comparable to an index at the end of a book or to a concordance (index) at the end of a transcript.
For example, let's say we need to search for the term “maintenance”. If there is no index, the search will take much longer as it reads one document after another, looking for the term we want. Alternatively, if an index exists, the search will be much faster because it can jump right to the term within the sorted list of words and the index already knows everywhere that term exists.
Keep in mind that if an index already exists and we are adding new electronic data, we will need to perform a re-index so that all of the new words can be added, as well as sorted, along with all the previous words in the pre-existing index. Additionally, if documents are removed from a database, we also perform a re-index.
Now, there is much more to this topic related to which search engine is being used to create the index. Some of the search engines you might hear about from a software provider are dtSearch, Lucene, SQL Server and Elasticsearch. There are also different types of indices, such as an “inverted index” and “latent semantic indexing”. In addition, many of our document databases have multiple indices that are used in different types of targeted searches.
But, I want you to focus on understanding that indexing is part of the electronic discovery “processing” steps and I want you to be able to explain to an attorney how indexing improves the legal team's database searching.
Don't forget – if you're trying to search your database and you're getting frustrated with the results, you might want to learn about Stop Words.
Very good. I like the paragraph that starts with “If an attorney is anxious”, as it gives me hope that somewhere out there is a world where lawyers/attorney’s aren’t anxious and actually take what you say at face value, and don’t keep pestering you to tell them how long it ‘really takes’ ie they appear to operate under the assumption that we’re outright lying about how long things take – after all you just click a button right!?
Hey Matthew – That’s why I’m here, to give you some hope. Ha! You described the reality of our role, for sure.