Deduplication Demystified

Blog Post
Posted on 1 July 2019

Litigation can attract significant costs for law firms and 70–75 per cent of this cost is usually attributed to the review of discovery documents. Decreasing the volume of documents that need to be reviewed from outset efficiently can save significant time and costs.

Within any set of data, there will usually be duplicate documents. One of the easiest ways to quickly reduce volume is to remove the duplicated documents.

How Does Deduplication Work?

Deduplication (or deduping) is a common process often used in eDiscovery to reduce the amount of data to be searched or reviewed.

Each electronic file or email is assigned a unique identifier or “MD5” which is created in raw data format (bit by bit) and used for deduplication.

Theoretically, when there are multiple identical documents that have been allocated the same MD5 value in the data set, only the first loaded document will be available for search or review. The other identical documents processed after will be “deduped” out. But hold on, that’s not quite true.

Why am I Seeing Duplicated Documents After Applying Deduplication?

As every discovery has different requirements, there are options and flexibility built into the deduplication process.

Global Deduplication vs. Custodian Deduplication

In global deduplication, we deduplicate the document against every document within the data set. In custodian deduplication, we deduplicate the document against every document within the data set which belongs to a unique ‘custodian’. Each custodian then has a unique copy of the same document.

Parent Document vs. Child Document

In regard to emails, each email will have a parent document with attachments, also known as child documents. In order to maintain the full family relationship, the deduplication process would not compare the MD5 identifiers of the attachments or ‘child’ documents. When processing email attachments with an identical MD5 identifier, this file would not be “deduped” out if the parent email document has already been loaded with the same MD5 identifier.

Why Do Computers See Differences in Files that Look Alike?

In theory, all the parent documents with the same MD5 identifiers should be “de-duped” out in the process described above. However, the document will not be identified as a duplicate if it has identical content but the MD5 identifiers are different.

The most common reason this occurs are different document formats. Each file format (e.g. PDF, DOC, DOCX) may contain some metadata unique to its application and may create files with different MD5 identifiers.

Another example where look-alike files occur is an image file. The computer would treat a text searchable document (i.e. you can select the text then copy and paste) and a text non-searchable document (i.e. image within the document) as different although the two documents are in the same file format (e.g. PDF) and the contents are identical.

Near-Deduplication[1]

Near-Deduplication, also called textual-deduplication, is a method of grouping together “nearly identical” documents based on its content (i.e. Extracted Text from the document).

Before applying near-deduplication, all the text non-searchable documents (image and document without text) need to be sent for a process of “Imaging”.

This process will provide the percentage of similarity of the document against the other nearly identical documents by comparing the extracted text (document content or email body). It is typically used to reduce review costs, and to ensure consistent coding during review.

Custom-Deduplication

As mentioned earlier, the MD5 is a unique identifier produced from the document in its raw data format and two documents are identical if their MD5 identifier values match. What if we know the two files are different but they are identical in nature?

Original emails vs. Archived emails

Most of email archiving systems will archive a complete version of an email (with attachments) and keep a trimmed version of an email (with the attachments removed) in the mailbox. This is typically done to reduce the storage size in the email system.

Custom-deduplication uses selected metadata from the email. In the above case, most of the metadata from the two emails should be identical. A custom MD5 identifier can be created from the metadata “Subject”, “From”, “To” and “Date Sent” to address this situation.

Conclusion

Nowadays, almost every eDiscovery matter requires a different level of deduplication to reduce the volume of documents and subsequently, the overall review cost. Deduplication also ensures coding consistency during review. However, deduplication can be a double-edged sword, it may have unintended consequences if not planned thoroughly. Consult your eDiscovery expert at Law In Order to understand which approach is suitable to your matter.

[1] Reference: https://www.edrm.net/glossary/near-duplicate-detection/