“It’s the data.” Throughout my career I have found myself repeating those words many times to explain everything from bad database queries to poor analytics results. So, when I began using GenAI tools for eDiscovery and Litigation Management, I ended up with a few disappointing results in areas that I had high expectations. I should not have been surprised of the culprit, but there I was again repeating it to myself like a 1990s political catchphrase: “It’s the data, stupid.”
What often makes or breaks eDiscovery and Litigation Management projects – whether you’re talking about keyword searches in 2005 or large language models in 2025 – isn’t the technological solution. It’s the data.
Cliché Alert: Garbage In, Garbage Out
We’ve all said it, and if you work with technology resources or database administrators then you’ve heard it many times: garbage in, garbage out. The Large Language Models (LLMs) tied to GenAI are certainly no exception to this principle. LLMs do not just process data, they learn patterns, generate context, and attempt to summarize meaning. If the input text is riddled with full text or OCR errors (“Wine transfer scheduled for Friday” instead of “Wire transfer scheduled for Friday”), you’re not getting the reliable summaries or insights you were hoping to receive.
It’s easy to focus on a shiny GenAI interface or an impactful demo, but underneath it all, quality data still rules the day. Models can’t compensate for missing context, broken threads, or poor metadata management.
Why Data Quality Matters More Than Ever
GenAI doesn’t reduce the importance of clean data. It multiplies it. In eDiscovery and Litigation Management, data quality impacts searchability, review efficiency, and critical case and data analysis. Bad text means missed hits, false positives, and cranky platform users. In GenAI-driven workflows, the stakes are just as critical. Now, the model is not just surfacing search terms, it’s rewriting, summarizing, and contextualizing evidence for human validation.
A few key reasons why quality data is critical for eDiscovery and Litigation Management:
- Full Text Completeness: Scanned images with poor OCR used to just frustrate reviewers. Now, they actively mislead GenAI engines. Bad data results in bad summaries. Period.
- Metadata Accuracy: Dates, custodians, file paths, and email threading still matter. Feed sloppy metadata into GenAI, and your timeline analysis looks like a Marvel multiverse.
- Normalization and Consistency: Standardizing formats, deduplication, and clean threading ensure that GenAI isn’t wasting cycles summarizing the same “Lunch at Chipotle” email 500 times. No one needs that much Chipotle, not even Chipotle.
- Context Preservation: Fragmented data sources (scattered chats, docs, and emails) need careful unification. If not, the AI generates outcomes based on incomplete information. Try putting a puzzle together without the picture on the box.
In short:
Good data + GenAI = Faster insight, accurate summaries, happier clients and attorneys.
Bad data + GenAI = Nonsense at scale, bigger risks, unhappy clients and attorneys.
GenAI is like an enthusiastic co-worker. It’ll produce something no matter what. Your job is to make sure it has the right material to work with.
Building Data Discipline
So, what can we do as legal services professionals to keep data from sabotaging our GenAI dreams?
- Invest Early in Data Hygiene & Management: Collection and ingestion aren’t just technical steps. They’re the foundation, so treat them as mission critical.
- Prioritize GenAI OCR and Text Extraction: If the text isn’t searchable, it isn’t usable. Spend the time and dollars to do it right.
- Audit Your Data Regularly: Spot check for redundant data, missing metadata, broken families, format inconsistencies, and other data anomalies. If you have thousands of database fields in a litigation management system, that is common. But if you have thousands of fields in an eDiscovery database, then you are doing something special, and not in a good way.
- Educate Stakeholders: Remind clients and counsel that GenAI isn’t a magic box. It’s a tool that amplifies the quality, or chaos, of the data they hand you.
Shiny AI features are fun, but they don’t fix messy inputs. If you want GenAI to deliver real value, start with disciplined, organized, high-quality data. That’s the foundation. Ignore this, and you’re just automating chaos. Embrace it, and you’re unlocking the real promise of AI-powered solutions. And that’s anything but stupid.
Dave York
Author
Share article:
Dave oversees TCDI’s Litigation Services team involved in projects and data relating to eDiscovery, litigation management, incident response, investigations and special data projects. Since his start in the industry in 1998, Dave has made the rounds working on the law firm, client, and now provider side of the industry, successfully supporting, executing and managing all phases of diverse legal and technical projects and solutions. During his career he has been a NC State Bar Certified Paralegal, holds a certification in Records Management, is a Certified eDiscovery Specialist (ACEDS), and has completed Black Belt Lean Six Sigma training. Learn more about Dave.