
Coase-Sandor Working Paper Series in Law and Economics
Publication Date
2025
Abstract
Can large language models (LLMs) replace human judges? By replicating a prior 2 x 2 factorial experiment conducted on 31 U.S. federal judges, we evaluate the legal reasoning of OpenAI’s GPT-4o. The experiment involves a simulated appeal in an international war crimes case, with two altered variables: the degree to which the defendant is sympathetically portrayed and the consistency of the lower court’s decision with precedent. We find that GPT-4o is strongly affected by precedent but not by sympathy, similar to students who were subjects in the same experiment but the opposite of the professional judges, who were influenced by sympathy. We try prompt engineering techniques to spur the LLM to act more like human judges, but with little success. “Judge AI” is a formalist judge, not a human judge.
"I predict that human judges will be around for a while.” – Chief Justice John G. Roberts, Jr. (2025)
Number
25-03
Recommended Citation
Posner, Eric A. and Saran, Shivam, "Judge AI: Assessing Large Language Models in Judicial Decision-Making" (2025). Coase-Sandor Working Paper Series in Law and Economics. 25-03.
https://chicagounbound.uchicago.edu/law_and_economics/1044