Can AI Help with Citations?

May 13, 2025

Like all law professors this time of year, I’m in the throes of grading and wishing there was a quick and easy way to reach the finish line. Year after year, my colleagues and I have lamented the fact that we have not yet discovered a way to automate citation grading. It seems like it would be easy for a computer program to do. Citations follow detailed, strict, and largely consistent procedural rules. But, to date, nothing like that exists.

Yet we now have generative AI, which is capable of all kinds of neat tricks. One of my colleagues decided to try her hand this semester at creating a customGPT to help assess citation formatting, but she was ultimately disappointed. While the customGPT began on an upward trajectory, learning with each tweaking prompt she provided, after a bit, it began to descend, inserting new errors in after correcting old ones and then failing to consistently apply rules it had already “learned.”

It turns out that her experience was not unique. In a new empirical study, Matthew Dahl, a JD/PhD student at Yale, tested whether large language models like GPT-4 can comply with the formal citation rules that structure legal writing. The findings are both fascinating and cautionary.

Dahl and his collaborators constructed a dataset of 866 citation tasks drawn from the Bluepages (i.e., practitioner-oriented) portion of The Bluebook and ran them through five major LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. The results? In zero-shot testing, the models produced fully compliant citations only 69-74% of the time. Yet, even when given the full text of the citation rules (90,000 tokens’ worth), accuracy rose to only 77%.

Citation is one of the few parts of legal practice where there is a right answer. Yet, even in this rules-based domain, LLMs underperform. They do best with common case law citations (likely because they’ve memorized them) but perform significantly worse on citations to statutes, regulations, and secondary sources.

The study also reveals that, when LLMs make citation errors, they can be significant. The LLMs occasionally confuse parties, attribute decisions to the wrong court, misstate subsequent history, and invent nonexistent citations. These are not typos—they’re substantive errors with real professional consequences.

So what are the takeaways?

Don’t delegate citation compliance to AI—yet. While LLMs are promising as drafting assistants, they remain unreliable as cite-checkers. The risk of an AI-generated error making it into a filed brief is too high.

But AI may be a useful teaching tool. Instructors can show students flawed citations generated by LLMs and ask them to diagnose and correct the mistakes. That kind of exercise reinforces both citation knowledge and professional skepticism.

For those interested in delving deeper into the study, you can access the full paper here: Matthew Dahl, Bye-Bye, Bluebook? Automating Legal Procedure with Large Language Models.