Digitising Paper in Litigation Discovery

A few years ago, I was speaking at a Litigation Technology Conference in Phoenix Arizona and I was asked the question, “Will paper ever go away”? My reply was that it would....when Mars becomes a holiday destination!

When I worked in law firms dealing with litigation we would search for paper documents from filing cabinets, storage warehouses etc. and disclose relevant docs to the other side in the form of copies. However, when I entered the industry of litigation support I realised that scanning and coding was a better way of dealing with the high volumes of hard copy. First let me explain that way back in 2000 in the UK, paper was everything and emails etc. were only just being considered as “documents” for discovery purposes, and even then they were printed. We would advise lawyers about searching for relevant paper docs in order to prepare and understand their client’s case. The problem arose when the volumes were so high that, realistically, a team was needed to review these docs for relevance. For example if you have 20 boxes of paper and 4 people in the team, then you may think it is simple maths to give them 5 boxes each. Fine, but how does one person know if any of the docs in his boxes are relevant to those in another box being reviewed by a colleague? Furthermore, the only way of dealing with the boxes was to copy them several times for each reviewer and this was costly apart from the issues of space!  So, the answer was, and is, to scan the documents and enable them to be reviewed using technology so that several reviewers work at the same time with only one reproduction of the documents. Sounds simple? Certainly it is logical and cost effective in most cases, but way back then it still required  a great deal of education to lawyers and their clients. By nature lawyers like paper and the fear and mistrust of technology rears its head. People like myself had to work very hard to assure the legal world that this was the way forward, and the only realistic method of meeting deadlines and satisfying their clients with regard to costs. This is what I see now in South Africa – a market, just like the UK some years ago, with a requirement for education, information and trust.

So, let us look at how we can digitise paper properly in litigation or investigation cases. Firstly you must choose a proper, experienced provider because scanning for these purposes is unlike commercial scanning. We want scanning software that is specifically for the purpose and this incorporates; strict elements of QC; the ability to recognise and store boundaries (where a document starts and ends); and bindings (if a document is stapled, has a paper clip etc.). I am a huge fan of the Ipro suite of litigation software, a Phoenix based company that I have known and worked with for over 15 years. We also want people who are trained and have the necessary skills not just with the software and the scanners but also have knowledge of the process of litigation and are familiar with documents. Handling original documents belonging to a client corporation is not something to be taken lightly - documents should be handled with care and respect. Once the scanning process is completed the documents MUST be returned in exactly the same order and with the same bindings as they were found. We call this re-assembly or reconstruction. One of the benefits of the Ipro software is that it captures all of this information and even if the documents become separated or moved, reassembly can be completed from the stored information. Traditionally, the pages are scanned to TIFF (Tagged Image File Format) but can just as easily to be scanned to PDF (Portable Document Format). 

I mentioned QC and this is absolutely crucial. Checks should be made to ensure: the quality of the scanned images; that no pages have been missed; that there are no blanks (unless they are deliberate); and that there are no duplicated scans. The software and many scanners also has the ability to de-skew, de-speckle (remove unwanted marks on the paper to assist reviewing) and detect misfeeds. QC in Ipro, for example, has two stages, the first performed by operatives and the second which is a statistical QC performed usually by a senior person who tends to look for any images which may have been deleted, inadvertently or otherwise.

I have gone into some detail here to show how seriously I take this process and I will only work with providers who follow these procedures.

So, we are at the stage that we have scans of the original pages. What do we need to do next? Firstly we need to OCR the images. OCR (optical character recognition) is the electronic conversion of images of typewritten or printed text into machine-encoded text. This is necessary as it enables the text to be searched. No OCR software is 100% accurate as it is entirely governed by the quality of the original printed text. Therefore poor quality text (e.g faded or handwritten) will not produce good or reliable OCR. Again, therefore, I want a provider who has excellent OCR software such as Ipro, Abbyy (www.abbyy.com) or LexisNexis LAW. These companies produce OCR-capable software as good as it can get!

Now I want to talk about the next stage which is called unitisation – in my view the most important aspect of the entire exercise. This process determines the beginning and end of each document logically, rather than merely relying upon physical boundaries such as staples etc. If you think about it, a staple could actually contain several separate documents and unless these are separated logically, then individual documents will never be isolated and found. Once again good quality software helps but essentially this process is done by humans, albeit electronically.

Once we have scanned and unitised images the documents then need to be coded. This process extracts essential information from each document and populates the information into fields. Information such as: author; recipient; document date; document type; document subject, as well us other relevant data. With electronic documents these fields are called metadata, which I will deal with in another post. We need this information so that reviewers can find individual documents by searching, for example, for all letters written by A to B dated between 1 Jan 2010 and 1 Jan 2014 with a subject of “Bank dispute”. The coded fields will ensure that only those documents containing the required search fields will be produced from the corpus of documents. Searching and the type of software to enable it is yet another different topic which will also be covered in a future post.

I have gone into some detail here about the process of dealing with hard copy documents quite deliberately as it is crucial - I've seen so many cases go wrong or parties and their lawyers face criticism from Judges (and clients) for defective processing of documents.

If I am involved in a matter incorporating hard copy then I have documented procedures, and methodologies upon which I insist are followed to ensure accuracy. The entire process must be defensible and carry an audit trail. Failures can result in criticism from the Court or Orders for costs or even the loss of a case. That is why I want and expect it to be done properly.