Frontier AI models lose money on soccer betting, study shows

A new paper from General Reasoning finds that leading AI models, including Anthropic's Claude Opus, OpenAI's GPT, and Google's Gemini, all lost money when tasked with betting on a full season of soccer matches. Each system started with a £100,000 bankroll and ended with significant deficits, some wiping out entirely. The authors say the results expose a gap between hype‑driven claims of AI automation and real‑world performance on long‑term, dynamic tasks.

General Reasoning released a paper that puts several high‑profile artificial‑intelligence models to the test on a real‑world problem: betting on a season’s worth of soccer matches. The study gave each model a normalized £100,000 bankroll and let it place bets across three simulated attempts. All of the systems lost money, and a handful went bust.

Anthropic’s Claude Opus 4.6 posted the smallest loss, with an average return on investment (ROI) of –11.0 percent. Its best‑case try barely broke even at –0.2 percent, while the worst saw a drop of –18.8 percent, leaving a final bankroll of £89,035. OpenAI’s GPT fared worse, averaging –13.6 percent ROI and ending with £86,365 after its poorest run sank 31.6 percent.

Google’s Gemini series performed dramatically poorer. Gemini 3.1 Pro recorded a –43.3 percent average ROI but managed a +33.7 percent gain in its most successful trial before a total loss in its worst, ending with £56,715. The lighter‑weight Gemini Flash 3.1 LP posted –58.4 percent ROI on average, with a best‑case rise of 24.7 percent and a final bankroll of £41,605 after a complete wipe‑out in another run.

Other contenders struggled even more. Z.AI’s GLM‑5 posted –58.8 percent ROI, ending with £41,221. Moonshot’s Kimi K2.5 recorded a –68.3 percent average loss and finished with just £7,420. Both xAI’s Grok 4.20 and Acree’s Trinity failed to survive any of the three attempts, each ending with a £0 bankroll.

“There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting,” said Ross Taylor, chief executive of General Reasoning and co‑author of the paper. He added that many existing benchmarks test AI in static environments that do not reflect the chaos of real‑world decision making.

The authors argue that while AI has made impressive strides in tasks like code generation, its performance on complex, dynamic activities remains unproven. “If you try AI on some real‑world tasks, it does really badly,” Taylor noted. “Software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at.”

General Reasoning’s findings, which have not yet undergone peer review, provide a sobering counterpoint to the optimism that often surrounds AI breakthroughs. The study suggests that businesses and professionals should temper expectations when considering AI for high‑stakes, long‑term decision making.

Frontier AI models lose money on soccer betting, study shows

Key Points

Also available in: