How Data In The Cloud Will Affect eDiscovery

David Carns is an attorney, technologist, and fan of all things eDiscovery. Follow him on LinkedIn and Twitter. My friend David is much geekier than me. This is a topic that many litigation support professionals probably haven't had the time to grasp yet, but it is coming up more often during data collections.

Cloud computing is taking off at a meteoric rate. More businesses are moving to services such as Microsoft 365, Google for Work, or Dropbox to store their business documents and email. Cloud computing offers security, redundancy, and ease-of-use that few internal IT departments can match – all at a lower price point. No wonder the enterprise is making its way to the cloud.

How will data in the cloud affect eDiscovery? First and foremost, collecting data from cloud sources requires a new way of approaching data collections. In order to collect data from the cloud you need to understand what kind of cloud computing service you are collecting from and choose appropriate techniques for each type.

Kinds of Cloud Computing

Cloud computing is a broad term that means different things to different people. Cloud computing can be broken down into three, distinct types: Software as a Service (SaaS), Infrastructure as a Service (IaaS), and Platform as a Service (PaaS).

Software as a Service (SaaS)

Software as a Service (SaaS) is the most common kind of cloud computing. Any software application you access as a web page is SaaS. Examples include Gmail, SharePoint and even Twitter. SaaS offers both an interface and backend service in a single, easy-to-access package.

How does SaaS affect eDiscovery?

SaaS tools, like Gmail, are popular with individuals and companies alike. Millions of people rely on it for its email, contacts and instant messaging functionality. Getting data out of Gmail for eDiscovery, however, comes with complications.

Many eDiscovery experts swear by extracting data from Gmail using POP3 or IMAP protocols using an email client like Microsoft Outlook. The POP3 protocol downloads all emails as if they were in the inbox proper, which is less than desirable for eDiscovery, since the storage path may be important metadata in a case. IMAP preserves folder information, but Gmail actually does not have folders; it has labels. With folders, an email can only be in one folder at a time, which means you must make a copy of an email if you want it to exist in multiple folders. Using labels, an email can be associated with as many labels as you would like – without making copies. IMAP, however, converts all labels into folders, which means Gmail's IMAP implementation may be creating multiple copies of emails that the user did not intend to create. These multiple copies can dramatically increase the volume of downloaded data.

A new technique is to extract data from SaaS sources such as Gmail by using their official applications APIs, when available. This provides a complete copy of the data and is officially supported by each SaaS provider. SaaS APIs also often authenticate using the oAuth authentication protocol. oAuth supports encrypted authentication and never reveals custodian passwords to person collecting the data.

Before you embark on a SaaS data collection, make sure you understand all of the protocols available to extract data use the protocol that is best for your situation. If you're not sure, be sure to consult with a qualified eDiscovery expert before embarking down the path of eDiscovery.

Infrastructure as a Service (IaaS)

Infrastructure as a Service (IaaS) allows providers to give access to servers, routers, storage and other “back office” systems over the Internet. IaaS providers can create virtual servers needed to run new software within minutes and with no capital investment. If you find you need one terabyte of disk space, an IaaS provider can provide it in a moment's notice. Examples of IaaS include Amazon's Elastic Computing Cloud (EC2) and Microsoft's Azure.

How does IaaS affect eDiscovery?

In an IaaS infrastructure, creating and destroying virtual servers is effortless. Disk space used by these servers can also be created and removed at will. So, the first eDiscovery challenge in an IaaS scenario is verifying that you have accounted for all of the target servers and disk space and that no data source has been destroyed before you perform a collection. Another complication in IaaS collections is obtaining data in a usable format. Since all of the servers and disk space are virtual, it is often not possible to perform a block-level collection remotely. There are methods, however, to obtain the virtual images and virtual disk space using the IaaS provider's custom environment, but your mileage may vary. It is important to understand what is possible before you extract any particular type of data from an IaaS provider.

Ask each custodian’s IT organization about which cloud computing resource it is using and request a complete itemization of all cloud resources that have been created and destroyed during the time of interest in an eDiscovery matter. If you're not getting the information you need, be sure to consult with a qualified eDiscovery expert before embarking down the path of eDiscovery.

Platform as a Service (PaaS)

Platform as a Service (PaaS) gives you access to functional computing services on demand. You get the benefit of cloud scale in your custom application, without the headaches of managing the actual software and hardware that run the system. Examples of PaaS include Amazon's Simple Storage Service (S3), Google's BigQuery and Microsoft's SQL Azure Database.

How does PaaS affect eDiscovery?

PaaS environments may be the most difficult of all the cloud computing platforms from which to extract reliable eDiscovery data. For instance, Amazon S3 — the world's most popular file storage service — provides developers with the ability to store photos, office documents, videos and database files all in one location. The popular Dropbox service is built on top of Amazon S3, as are many other popular tools. Amazon S3 is a PaaS service because it does not give developers a file server to store data, but rather an object-based interface to Amazon's own file servers. That level of abstraction allows Amazon to scale the service very broadly and across geographies.

For eDiscovery purposes, this raises some important issues. When collecting data from Amazon you must not only download the actual files, but all of the associated metadata, as it is all part of the object. Most tools that download data out of Amazon S3 do not provide the metadata by default, which means that it is easy to miss relevant data when downloading from Amazon S3. It takes a great deal of scrutiny to make sure all data has been collected.

When collecting data from PaaS services, ask thorough questions and expect broad answers. Many IT professionals do not understand the eDiscovery ramifications of the technology they have selected for their IT infrastructure. When asking legal questions about cloud computing, try to be specific and be willing to ask follow-up questions to general responses. If you're not confident with their answers, be sure to consult with a qualified eDiscovery expert before embarking down the path of eDiscovery.

Conclusion

eDiscovery and cloud computing are both here to stay. There are different kinds of cloud environments that affect how you perform eDiscovery collections. Understanding what type of cloud computing environment you are dealing with will enable you to make the best, informed decisions about data collections.

Once you establish whether you are collecting data from a SaaS, IaaS, or PaaS cloud environment you can choose the method appropriate for each situation. There are few easy answers when dealing with cloud-based data, but being informed will enable you to make the best decisions for your case.