Most AI Tools Say They Verify Case Law. Here's a Six-Minute Test to Find Out If Yours Actually Does.

AI in Forensic PracticeMay 10, 20268 min read

Dr. Aubree Harrington, Psy.D.

The number of sanctions cases involving fabricated AI citations keeps climbing.

Most of the early ones involved attorneys. A lawyer would let a chatbot draft a brief, file it without checking the citations, and end up explaining to a federal judge why three of the cases he cited had never been decided. Mata v. Avianca was the first big public one. By the time the courts started tracking the pattern, the count was over a thousand. The most recent tracker I looked at had more than 1,200 individual cases of fabricated citations in legal filings, and the number is still going up.

For a while this was an attorney problem. Forensic psychologists watched from the sideline.

It isn't only an attorney problem anymore.

More of us are using AI tools to help with forensic reports. Some of us use them to flag methodological gaps. Some use them to surface case law. Some use them to prepare for cross-examination. Whatever the use case, the moment a forensic psychologist starts relying on an AI tool to bring case law into a report or a courtroom, we inherit the same fabricated-citation risk that has been embarrassing attorneys for two years.

Most tools say they "verify" their citations.

That word is being asked to do a lot. It is also being asked to cover something most tools never actually do, and it is the part that matters most.

The hallucination that isn't a fake case

The publicly visible AI hallucinations are mostly fake cases. A fabricated case name, a fabricated citation, sometimes an entire fabricated procedural history. These show up in attorney sanctions cases because they are the easy ones to catch. Run the citation through any case law database. The case either exists or it doesn't. If it doesn't, the brief gets struck.

There is a second kind of AI hallucination in legal contexts that gets much less public attention because it is much harder to catch. It is not a fake case. It is a real case cited for a holding the case never reached.

This failure mode is built into how language models work. A language model is trained to produce text that sounds plausible. Plausible includes generating a confident-sounding summary of a holding for a real case the model has limited information about. The case is real. The reporter cite is real. The case name is canonical. Every existence check passes. The holding the model attributes to the case is invented.

When this happens, every shallow verification layer comes back green. CourtListener confirms the case exists. The reporter cite parses correctly. A second language model checking the work agrees the citation is plausible. Nothing fails. The packet ships.

Then opposing counsel pulls the actual opinion. The language the AI tool cited is absent. The cross-examination question that follows is the one no expert wants to answer.

The word "verified" has five meanings, and only one of them is enough

When a vendor tells you their AI verifies case law citations, here is what they could be doing.

1. The model believes the citation is real. This is the weakest version. The same language model that generated the citation is the one being asked whether the citation is real. That's a closed loop. The model has no mechanism to know whether the case exists, so when you ask, the answer is almost always yes.

2. The citation matches a plausible format. The system checks whether the case name, year, court, and reporter all conform to what citations of that type usually look like. This is slightly stronger, but a fabricated case can pass it easily.

3. The citation has been cross-checked by another language model. A few platforms run citations through a second AI as a "verification" step. This is marginally better than the closed loop, but it has the same underlying problem. Language models do not know which cases exist. They know which citations are statistically likely. Two models agreeing on plausibility is not the same as one database confirming reality.

4. The citation has been checked against an authoritative case law database. This is what most vendors mean when they say "verified." It is necessary, but not sufficient. A tool can pass this check on every citation it produces and still ship the second class of hallucination. The case exists. The holding does not.

5. The citation has been verified against an authoritative case law database AND the holding the tool attributes to the case has been independently confirmed against the actual opinion.

The fifth standard is the only one that catches both classes of hallucination. Until you can answer "the case exists AND the holding the tool described matches what the case actually held," you are not verifying citations. You are verifying that cases exist.

How to test any AI tool's verification in six minutes

This is a practical exercise. It works on any AI tool that produces case law citations as part of its output.

Step one. Pick a real case you've cited before. Daubert, Frye, something in your jurisdiction.

Step two. Make up a plausible-sounding case. Real-sounding party name, a year in the last decade, citation format that matches your jurisdiction. Don't make it absurd. Make it the kind of case opposing counsel could plausibly believe is real.

Step three. Submit both. Ask the tool to confirm them.

Step four. Watch what happens with the fake one. A real verification system flags it as not found in its case law database. A weaker one confidently confirms it and may even generate a summary of what the case held, because the language model is happy to invent a holding for any plausible case name you give it.

Step five. Ask the tool whether the verification step actually involves a lookup against a structured legal database, and what the source of that database is. The answer should be specific and identifiable. If the answer is vague (the phrase "our proprietary system" is the giveaway), or if the vendor can't or won't tell you, you have your answer.

Step six. Pick one of the citations the tool returned for a real case. Open the actual opinion in your jurisdiction's reporter or in a public legal database. Read the operative paragraphs. Compare them to what the tool said the case held.

If the tool's summary matches what the opinion actually says, the verification has both legs. If the tool's summary is plausible but the language isn't in the operative paragraphs, you are looking at the second class of hallucination. The case is real, the citation is real, the holding is invented.

The whole exercise takes about six minutes. Skipping the last step is how the harder hallucinations reach courtrooms.

Why this matters more for us than for attorneys

When an attorney is sanctioned for a fabricated citation, the damage is concentrated. The brief gets struck. There's an apology to the court, a fine, sometimes professional discipline, an embarrassing news cycle. Then things mostly move on.

When a forensic psychologist's report cites a real case for a fabricated holding, the damage radiates outward.

Cross-examination is structurally an attack on credibility. The cross-examining attorney isn't trying to find the one true error in the report. They're trying to surface enough small errors that the report's reliability becomes the issue, instead of the evaluee's mental state. One citation in which the case is real but the holding is invented is exactly the kind of finding that lets opposing counsel rewrite the entire narrative of the testimony.

The follow-up questions write themselves. If you cited this case for this proposition and the case does not say this, what else have you cited for things the cases do not say? Did you read the cases you relied on? What other citations should we look at?

Ten correct citations don't offset one in which the holding is fabricated, because the fabricated holding reframes everything else as suspect.

The honest tradeoffs of two-leg verification

There are tradeoffs, and pretending otherwise would be dishonest.

Two-leg verification is slower than one-leg verification. Looking up a citation against a structured database takes time. Independently confirming the holding against the actual operative paragraphs of the opinion takes more time. A tool that does both is going to feel less responsive in the moment.

Two-leg verification produces "we couldn't independently confirm this" results, sometimes on real cases. That feels like a failure, but a tool returning "this citation could not be independently confirmed against the actual opinion, please verify before quoting" is doing exactly what we want it to do. The competing tool that returns a confident summary of a holding nobody confirmed is the one that's failing. It just fails in a way that feels useful until cross-examination.

Two-leg verification covers fewer obscure jurisdictions. Some opinions exist in case law databases but their full text is sparse or paywalled, which limits how much independent confirmation any tool can do automatically. A real verification system tells you when this is the case so you can compensate. The competing kind doesn't, and the practitioner finds out only when the citation gets challenged.

The version of AI assistance forensic psychologists should be willing to adopt is the one that admits when it doesn't know. The other kind ends up in a 702 challenge.

What ForensicShield does on this

Briefly, because the buying criteria is the point of this post, not the product.

ForensicShield's citation verification covers both legs. The case has to exist in an authoritative database, and the holding the platform attributes to the case has to be independently confirmed against the actual opinion. When either leg comes back inconclusive, the platform tells you so, with a specific caveat in the analysis instead of a confident summary.

The architecture that runs the second leg is proprietary, and it is going to stay that way. What I will say publicly is that the standard a forensic psychologist should hold any AI tool to is the same standard we hold ourselves to. Case existence, holding accuracy, honest "we couldn't independently confirm this" results when one of the legs comes back inconclusive, and a clear story about what the word "verified" means when you ask.

What to do tomorrow

Three things, in order of how much they protect you.

1. Run the six-minute test on every AI tool you are evaluating. Real case, fake case, ask which database covers existence, ask whether holdings are independently verified, then verify one yourself by reading the operative paragraphs of an opinion the tool returned. If the tool fails any of those steps, walk away.

2. If you have already used a tool that does not verify holdings, audit your recent reports. Pull every case law citation that came from an AI source. Read the actual opinion's operative paragraphs. Compare them to what the AI tool said the case held. If there is a mismatch, replace the citation before the next motion deadline.

3. Add holding verification to the questions you ask any vendor. Alongside HIPAA compliance, BAA availability, data retention, and database existence checks, the fifth question should be this one. How do you confirm that the holdings your tool describes match what the cited cases actually held, and what happens when the two do not match?

The answer to that question tells you whether the vendor has thought about both classes of hallucination, or only the easier one. The vendor that has only thought about the easier one is the vendor whose pipeline can ship a real case attached to a fabricated holding without anything in the verification stack catching it.

The fabricated-citation problem isn't going away. The harder version of it, where the case is real but the holding isn't, isn't going away either. Most discussion of AI hallucinations in legal contexts is still focused on the first kind. The second kind is the one we should be planning for.

It's worth running the six-minute test before uploading your next report.

Dr. Aubree Harrington is a licensed forensic psychologist (Psy.D.) and the founder of ForensicShield, a HIPAA-aligned QA platform for forensic reports. She works in hospital and jail-based competency restoration and conducts forensic evaluations across criminal and civil cases.

If you would like to see how ForensicShield handles citation verification on one of your own reports, your first analysis is free at forensicshield.net.

Dr. Aubree Harrington, Psy.D.

Founder & CEO, ForensicShield

Dr. Harrington is a licensed forensic psychologist and the founder of ForensicShield. She specializes in forensic evaluation methodology and cross-examination preparation.

See ForensicShield in action.

Review a real court preparation packet — or start your free trial and upload your first report today.

See a Sample Court Preparation Packet Start Your Free Trial