Legal professionals new to eDiscovery and data analytics often confuse the various types of eDiscovery analytics and when to use them. I think this comes down to a need to understand two primary concepts – structured vs conceptual analytics.
The most used structured analytics tools in eDiscovery are Email Threading, Near-Duplicate Identification/grouping, Repeated Content Filtering and Language Detection. These four tools are pervasive in EDD processing and review workflows and, in my opinion, should always be used.
Email Threading is used, you guessed it, when you have email collections. Threading can be used in two ways – as either a visual grouping tool or as a suppression tool. As a visual tool, you can, in pretty much every processing tool out there on the market, identify email conversations as members of a thread group. And in review, you can group these thread families together for efficient review and visually display the relationship, usually in an indented and color-coded fashion or in a group batch. This allows a single reviewer to review all emails in the context of the conversation and make fast, consistent decisions on each individual email, as it relates to the conversation.
The second method for using Threading is to suppress for review (and possibly production) all members of a thread family, except for the most inclusive email – the one that includes all other emails in the conversation. This can drastically cut down your review population (sometimes up to 40 or 50% depending on your collection) and therefore your review cost. This process entails making a decision on the conversation, rather than on the individual emails, within the thread family.
In most systems, Near-Duplicate Identification groups together documents that are 90% or more alike based on the text contained in the document. Using extracted or OCR text, processing systems can identify very textually similar documents and display/group them in review as related sets of documents in a batch or check-out group.
Near-Duplicate ID is best suited for e-docs, as Email Threading will already get you to a similar grouping with email collections. In e-docs, I’ve found the best use is with document drafts, whether they’re contracts, agreements, sales collateral, or you name it. Any time you have versioning of documents during the authoring phase, Near-Duplicate ID and grouping can drastically improve review speeds and consistency of coding simply by grouping for review all versions of the document into a single batch.
Repeated Content Filtering
Searching, concept searching and predictive analytics all depend on the quality of the text in the various indexes created for a matter. Ensuring the indexes are as clean as possible is a step I always recommend, especially when using conceptual analytics on a project.
Repeated Content Filtering is a suppression technique that allows disclaimers, footers, headers and basic legal/sales/marketing text to be removed from the conceptual index so as not to create noise within that index. Noise is essentially extra concepts (words) available in the index pulled from these repeated text areas that create false weights of certain terms in the index.
For example, if you have an email disclaimer attached to your email signature that contains language about emails being ‘privileged and confidential,’ those two terms will create noise in your index simply because of the number of times they appear at the bottom of every email. This will skew your concept clusters. Removing those phrasings from your concept index will clean up the index and create better clustering, categorization and predictions in your documents.
The final frequently used Structured Analytics tool is detection of languages within the population. Most processing tools or review platforms have some version of this tool available for the processing of foreign language text and grouping/segregation of foreign languages from English language documents for review. Many systems can do a primary language identification to detect the language that occurs most in the document while others have primary, secondary or all language detection to further assist in the grouping/batching of documents.
This detection is important for both processing/indexing data (Unicode, LATAM, EMEA, APAC langauges) as well as for review strategy and planning. This tool allows for decisions on review resources, machine translation and overall effort/cost strategy planning.
All of the above tools are text or field based analytics and are designed to optimize sorting and grouping of data. Other examples of Structured Analytics include communication mapping/social networking visuals, timelines and geo-tag maps.
Another type of analytics is based on the meanings of words and documents, using the content and context, rather than the explicit text. These are called Conceptual Analytics, and I think some combination of them should also always be used in EDD.
The most commonly used conceptual analytics tools are Concept Clustering, Find Similar, Categorization, Key Word Expansion/Concept Searching and Predictive Coding/Prioritized Review. These tools help us better understand the data in similar ways that we would normally think about the data – not limited by the exact text or the fielded data in the documents themselves.
Concept Clustering/Find Similar
Concepts are simply words in documents. In the context of conceptual analytics, these words become topics or ideas in the document and are pulled out as Concepts. These Concepts are then weighted against every other Concept in the document based on things like location (a Re: line word is more heavily weighted than a word in paragraph 112) and frequency (a word that shows up 600 times in a document will be more heavily weighted than a word that shoes up twice).
Concept Clustering then groups together documents with similar, and similarly weighted, Concepts into grouping of documents for review. These groupings are designed to better present documents to Reviewers (any reader really) so that we can review likely similar documents, or at least documents with similar themes or ideas, together. This is unsupervised machine learning at its finest. With no human interaction at all, this tool can show me groups or themes of documents. Concept Clustering is a great tool to use with a received production or when you are starting on strategy planning or review management planning to get a quick idea of the types/themes of documents you have in your collection.
Find Similar also uses the Concepts to show similarly themed documents on demand. At any time when you are looking at a singular document of interest, you can use Find Similar to create a Concept Cluster spontaneously. This can be a great investigative tool.
Keyword Expansion/Concept Searching
Two other related conceptual analytics tools commonly used are for searching purposes. Boolean searching is limited by our knowledge of the case – we try to come up with terms that we think will find the documents we are seeking. The inherit flaw is that we don’t all think, speak, or write exactly the same way. I may say ‘stock’ when you say ‘investment’. I may say ‘quick’ when you say ‘fast’. Boolean searching will not find ‘fast’ when I search for ‘quick’. Conceptual Search and Keyword Expansion will.
These tools are designed to help us learn the way people communicated in the documents that we are searching. I know generally what I’m looking for, but I don’t know exactly how it was talked about in these documents. Boolean search penalizes me for this because of its limited abilities. These conceptual analytics tools do not.
Categorization/Prioritized Review/Predictive Coding
While the above tools are prime examples of unsupervised machine learning, other tools are considered supervised machine learning. In these workflows, human training/interaction is necessary to the process. Document or text examples must be submitted to the system in order for the system to return results or next steps.
Categorization, which can also be very helpful in received productions, requires documents to be submitted as examples in various categories (like issues). Then the system can go ‘Find Similar’ and group or tag found documents the same as the examples. This takes Concept Clustering to the next level by having informed document decisions assist with the groupings, and unlike Concept Clustering, it allows the user to define the grouping categories.
Prioritized Review also requires human training. My goal with Prioritized Review is to have two simultaneous review tracks going to group documents into Likely Responsive Document and Likely Not Responsive document buckets. In order to do this, I first have to give the system examples of Responsive and Not Responsive (trash or business blast) docs. Once I do this, the system can then use the Concept groupings, as well as near-duplicated ID and Threading, to suggest either a Not Responsive or Responsive tag for review assignment. Using this workflow, I can send the Likely Responsive documents to attorneys and the Likely Not Responsive (low risk, low value) documents to law students or paralegals for half the cost.
Predictive Coding is exactly the same as Prioritized Review above, except the workflow usually involves stopping review at certain thresholds and reporting on various statistics, like precision and recall. The idea is once the system helps you identify all (well, most of) the Responsive documents, within certain statistical parameters, you can defensively stop reviewing or sampling the Likely Responsive documents.
Both structured analytics and conceptual analytics have their place at various stages in the discovery process and hopefully, the above has provided more insight on when and how to use these tools.