AI News Feed

OpenAI's GDPval: AI Matches Experts, Cuts Costs

27 Sep 2025- OpenAI's GDPval benchmark tests 1,320 real‑world tasks across 44 occupations; Claude Opus 4.1 often outperformed humans, GPT‑5 excelled in accuracy, showing huge speed/cost gains but persistent errors and risks.

General
Trending

27 Sep 2025


OpenAI published a new benchmark called “GDPval” that evaluates models on 1,320 real-world tasks spanning 44 occupations (created by professionals). The goal is to measure how well AI can perform work that contributes to GDP. In the study, Claude Opus 4.1 matched or beat human experts on nearly half the tasks and led performance across most industries; GPT-5 scored highest on accuracy (following instructions, calculations) while Claude shone on aesthetics (polished documents and slides).

The newsletter reports striking efficiency gains: AI completed many tasks “100x faster” and “100x cheaper” than humans (tasks that typically take experts 7+ hours were done in minutes). But models still stumble on complex, ambiguous instructions, formatting failures, and occasional confident hallucinations—so human review remains crucial. The piece also cautions about “workslop” (low-quality AI output that creates extra rework) and highlights security risks illustrated by prompt‑injection hacks on recruiting agents.

OpenAI is releasing 220 tasks publicly so others can run evaluations and plans to expand the dataset and task types. Bottom line from the newsletter: AI is a powerful intern — very useful when used thoughtfully, but not a finished replacement for expert judgment.

Source

The method

The prompts

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied

Copied