OpenAI Claims GPT-5 Nears Human Performance on New GDPval Benchmark

OpenAI says GPT-5 stacks up to humans in a wide range of jobs
TechCrunch

Key Points

  • OpenAI introduced the GDPval benchmark to compare AI models with human experts across 44 occupations.
  • GPT‑5‑high achieved a win rate of about 40.6% against human professionals.
  • Anthropic’s Claude Opus 4.1 recorded a win rate near 49% in the same test.
  • The benchmark focuses on key U.S. economic sectors such as healthcare, finance, and manufacturing.
  • OpenAI sees the results as a sign AI can begin offloading routine work for many jobs.
  • Current testing scope is limited; OpenAI plans to expand GDPval to cover more tasks and workflows.
  • Analysts view GDPval as a step toward realistic measurement of AI’s economic impact.

OpenAI introduced a new benchmark called GDPval that pits its AI models against human experts across dozens of occupations. In the initial rollout, GPT-5‑high was judged better than or on par with professionals in about 40.6% of tasks, while Anthropic’s Claude Opus 4.1 achieved roughly a 49% win rate. The test covered 44 roles spanning key sectors such as healthcare, finance, and manufacturing. OpenAI says the results show AI can start offloading routine work for many jobs, though it acknowledges the current scope is limited and plans to expand the benchmark’s coverage.

OpenAI Launches GDPval Benchmark to Measure AI Against Human Professionals

OpenAI announced a new benchmark named GDPval, designed to compare the output of its AI models with that of seasoned professionals across a wide range of industries and occupations. The benchmark focuses on sectors that contribute heavily to the U.S. economy, including healthcare, finance, manufacturing, and government, and evaluates performance in forty‑four distinct jobs.

For the first version, dubbed GDPval‑v0, OpenAI asked experienced workers to review AI‑generated reports alongside human‑generated ones and choose the better piece. The model’s “win rate” represents the percentage of times its work is judged equal to or superior to the human baseline across all occupations.

Results Show GPT‑5‑high and Claude Opus Making Strides

In the initial run, OpenAI’s GPT‑5‑high model, a more powerful variant of GPT‑5, was judged better than or on par with experts in about 40.6% of the tasks. Anthropic’s Claude Opus 4.1 performed slightly higher, achieving a win rate near 49%. By contrast, OpenAI’s earlier GPT‑4o model scored roughly 13.7%.

OpenAI noted that Claude’s strong showing may stem from its ability to produce pleasing graphics rather than pure performance, but both models demonstrate notable progress compared to earlier releases.

Implications for the Workforce

The company frames the benchmark as evidence that AI systems are becoming capable enough to assist professionals in routine aspects of their work, potentially freeing up time for higher‑value activities. OpenAI’s chief economist highlighted that as models improve, workers can offload more tasks to AI, enhancing productivity across sectors.

Nevertheless, OpenAI cautions that GDPval‑v0 tests a limited set of tasks and does not capture the full complexity of many jobs. The firm plans to broaden the benchmark to cover more interactive workflows and a wider array of occupations.

Industry Perspective

Analysts see the GDPval results as a step toward more realistic assessments of AI’s economic impact. While the benchmark’s current scope is narrow, it offers a concrete way to gauge progress toward artificial general intelligence, a core goal of OpenAI’s mission.

Future iterations of GDPval are expected to incorporate additional industries and more comprehensive task sets, providing deeper insight into how AI can complement – rather than replace – human expertise.

#OpenAI#GPT-5#AI benchmark#GDPval#Anthropic#Claude Opus#AI performance#Jobs#Artificial General Intelligence#Economic impact
Generated with  News Factory -  Source: TechCrunch

Also available in:

OpenAI Claims GPT-5 Nears Human Performance on New GDPval Benchmark | AI News