18 Nov 2025
Anthropic has published an open, reproducible evaluation of political bias across major chat models — including its own Claude family plus GPT‑5, Gemini, Grok and Llama. The company frames the work as an “Ideological Turing Test”: can a model describe political viewpoints so well that people holding those views would agree with the description? Rather than single prompts, Anthropic used “Paired Prompts” that ask models to present opposing takes on the same political topic and scored responses on even‑handedness, acknowledgement of opposing perspectives, and refusal rates.
They ran 1,350 prompt pairs over 150 topics and used AI graders to evaluate thousands of replies. Key results: Claude Sonnet 4.5 scored 94% on even‑handedness and Claude Opus 4.1 hit 95%; Gemini 2.5 Pro (97%) and Grok 4 (96%) were marginally higher. GPT‑5 scored 89% and Llama 4 trailed at 66%. On acknowledging counterarguments Opus led (46%), Grok 34%, Llama 31%, Sonnet 28%. Refusal rates were low for Claude models (3–5%), near zero for Grok, and highest for Llama (9%).
Crucially, Anthropic open‑sourced the dataset, grader prompts and methodology on GitHub to let other labs reproduce, challenge, or improve the benchmark — arguing a shared standard for measuring political bias benefits the whole industry.
Source