Processing of Electronic Data Discovery

EDD is an acronym for Electronic Data Discovery and in my last post I talked about processing hard copy. This post will look at how we collect and process electronic documents such as emails, attachments, etc.

Given that it is the duty of a lawyer to make a reasonable search for relevant documents in a case, you can see how the evolution of email opened up new horizons in the field of discovery. We know that there is an argument suggesting that the first electronic mail was sent around 1971, but realistically it was in the early nineties that email was born. Gradually over succeeding years it became more frequently used, and litigation software developers saw an opportunity to ride the roller coaster. Lawyers and their clients were printing emails, then scanning and coding as per my post discussing digitising documents. There were many problems with this practise, and it took technology experts and lawyers some time to recognise what these were. For example, if someone was “bcc’d” on an email, when that email was printed, it would not show the blind copy recipient so a lawyer would not know that this recipient knew of the contents of the email. Deleted emails would never be printed, obviously. People wisened up, realising that if you changed the text of an email from black to white you would not see it when the email was printed. Ever tried printing an excel spreadsheet? It is so difficult and you can get pages and pages with blank cells and hard to read chronological entries. From all of these issues, and more, came technology to process electronic documents. The revolution started in the USA in the mid nineties and the UK became seriously interested around 2000. After speaking  at the first EDD seminar in the UK, and I had an article published in Technolawyer in February 2003 on the subject (contact me if you would like a copy of this, as well as other articles I had published).

Before dealing with EDD processing, I want to spend a few minutes on the subject of data collection.  When a lawyer begins to learn from his client the whereabouts of what could be relevant data, he has to ask about PC’s, lap tops, servers, email archive systems and procedures, and other devices such as cell phones, tablets etc. Then a decision has to be made about how to obtain the data.  In essence there are 3 possible methods:- 1. Self collection by the client, (which I do not recommend!). How can the lawyer know if all has been collected? How can the client know how to collect e.g. deleted emails? 2. Obtain the services of a professional forensic services provider (which I do recommend), and 3. What I call a hybrid collection, whereby a forensic provider works with a competent IT Dept at the client company to organise the collection (which I sometimes recommend!). Forensics is a whole topic on its own so, for the time being that is as much as I will say here.

Once the data has been collected it needs to be processed in order for contents to be searched and prepared for discovery and/or trial. As mentioned in my previous post, it is not practical, or efficient to print all of these electronic documents, partly because some documents would be missed, partly because we would be printing documents that would be irrelevant but mostly because corporate clients would not wish to pay for this unnecessary and unreliable expense. Now, there are many software solutions which process electronic documents for our purposes and I cannot list them all, nor do I know them all, but I will mention the two that I am most  familiar with and like. One is the Australian product Nuix, and the other is eCapture by Ipro in the USA. These solutions process data at varying speeds and extract the metadata and text of each document. Pausing there, you may have heard of metadata, you may know what it is or you may have some idea of what it is. When I first started in this industry, I kept hearing about metadata and I asked a number of people in IT or within the industry to explain it to me. Almost 8 out of 10 said exactly the same thing, “It is data about data”. Well as a simple non technical practising litigator, that meant precisely nothing to me, so I learned about it and then came up with my own practical description. It tells you “WHO knew WHAT WHEN”. In any matter that is precisely what the lawyer needs to know. It includes fields that I mentioned in my previous description of coding, but more in line with the creation of electronic documents, such as From, To, CC, BCC, date and time, subject, and type of doc but it goes further and can include such fields as the last date the doc was modified and many many more.These systems can handle hundreds, even thousands of file types and extract literally hundreds of metadata fields. As with coding, once the processed data is exported into a database solution, the metadata fields can be searched. The text extraction allows for text searching, but unlike OCR, text extraction from electronic documents is 100%.

During processing there are multiple features available including “de-nisting” which is the removal of unwanted files such as system files as well as de-duplication. The latter is almost a topic in itself but for the purposes of this post, it is self explanatory but what is clever is that even if duplicates are removed, the metadata of the removed docs is captured for use if necessary. These systems will also always maintain the family relationships with parent emails and their child or children attachments. The net amount of data can then be speedily exported ready for review either in native format or converted to TIFF or PDF as required. Native format means what it says in that if a Word document is being processed, then it is exported as a Word document. Converting to TIFF or PDF at this stage is now rarely used. I advocate keeping documents native for as long as possible - review is aided as the lawyer sees the document in its native format and the client is happy as it saves the cost and time of converting irrelevant documents to TIFF or PDF. In the early days of EDD processing everything was converted to TIFF and there were some horror stories such as a US client receiving a bill for a per page TIFF cost where there were over 800,000 pages of blank cell spreadsheets. The industry and clients learned and I have seen huge strides over the last decade.

I will end this post with the best success story I have ever had relating to EDD processing. A few years ago we were working on what was then the world’s largest internal financial investigation and we scanned millions of pages on site in Germany as well as processed terabytes of data on a daily basis in London for over 2 years. One day we received a CD from our client, an International law firm, and they asked us to process it as a matter of urgency as it belonged to one of the most senior executives of the global client organisation. They said it would not take long as it was only a CD and could not contain more than around 300mb. We were using eCapture at the time and loaded the data where we immediately discovered that there was actually 2Gb. The lawyers could not understand how it was possible to fit 2Gb on a 300mb CD and we explained that there were likely to be some embedded items and so we processed the data. We began to be suspicious about its contents as we saw some strange happenings such as Word docs which had had their file extensions changed to .exe (no system would be fooled that easily!) and other things but then we found an innocuous PowerPoint presentation which our system did not like. There was a tiny icon on one page, almost invisible to the naked eye but the system caught it and drilled down to discover literally thousands of pages of Excel spreadsheets containing this guy’s embezzled accounts. Either he or someone on his behalf had managed to embed an .xls inside a .ppt. In those days (about 10 years ago) this was really clever stuff but even better was that our system found it. Naturally we reported it to our clients, disclosed the data and the outcome was that the Executive spent the next few years languishing in a German prison! I was proud that we discovered this and incidentally we tried the data through 2 other systems which did not detect the spreadsheets. Our client’s felt that we had broken new ground. Now, of course, several years on, I would expect any self respecting system to find this but way back then..... I have dined out on this story many times and I am very happy to share it with you.

My last 2 posts have dealt with processing paper and electronic documents. In my next we will look at what we do with these documents to assist the lawyers to review them.