Large Language Models Falter at Sudoku and Transparent Reasoning, Study Shows

Key Points
- University of Colorado researchers tested large language models on Sudoku puzzles.
- Models struggled with both 6x6 and 9x9 grids, often using trial‑and‑error.
- Explanations provided by the models were frequently inaccurate or irrelevant.
- One model responded to a reasoning query with a weather forecast for Denver.
- Findings raise concerns for AI use in high‑stakes areas like driving and tax preparation.
- A Ziff Davis lawsuit against OpenAI over training data is noted in the study.
Researchers at the University of Colorado at Boulder tested popular large language models, including OpenAI's ChatGPT and its reasoning variants, on Sudoku puzzles and their ability to explain solutions. The models struggled with both 6x6 and 9x9 puzzles, often resorting to trial‑and‑error and producing inaccurate explanations. In some cases, the models gave unrelated answers, such as a weather forecast. The findings raise concerns about AI transparency, especially as the technology moves into high‑stakes domains like driving, tax preparation, and business decision‑making. The study also notes a pending Ziff Davis lawsuit against OpenAI over training data.
Background and Test Setup
Scientists from the University of Colorado at Boulder examined how large language models handle logic puzzles and self‑explanations. They focused on Sudoku, testing both the standard 9x9 grid and a simpler 6x6 version. The models evaluated included OpenAI's ChatGPT and its newer reasoning models such as o1‑preview and o4.
Performance on Sudoku Puzzles
The models frequently failed to solve the puzzles directly. When they did produce an answer, it often required multiple attempts, resembling trial‑and‑error rather than systematic logical deduction. For the 6x6 puzzles, the models struggled without external tools, and even the 9x9 challenges proved difficult.
Quality of Explanations
Beyond solving the puzzles, the researchers asked the models to explain each step. The explanations were often inaccurate, irrelevant, or entirely unrelated. In one instance, a model responded to a follow‑up question with a weather forecast for Denver instead of a logical justification. The study highlighted that the models tend to generate explanations that sound plausible but lack fidelity to the actual reasoning process.
Implications for Real‑World Use
These shortcomings are concerning as AI systems are being positioned for tasks such as autonomous driving, tax filing, business strategy formulation, and document translation. The inability to provide trustworthy, transparent reasoning could undermine confidence and safety in these applications.
Legal and Ethical Context
The research also references a lawsuit filed by Ziff Davis against OpenAI, alleging that the company used copyrighted material to train its AI. This legal dispute adds another layer of scrutiny to the development and deployment of large language models.
Conclusion
The study underscores the gap between impressive language generation and genuine logical problem‑solving ability. It calls for greater transparency and rigor in AI reasoning, especially as the technology moves into domains where accurate explanations are essential.