OpenAI audit says popular SWE-bench Verified benchmark is ‘contaminated’ by broken tests and training-data leakage, recommends SWE-bench Pro and private evaluations

first published 2026-02-24T21:30:29Z

OpenAI audited 138 SWE-bench Verified tasks that GPT-5.2 repeatedly failed and found 59.4% were flawed: 35.5% relied on specific function names not in prompts and 18.8% checked for unrelated features copied from other PRs. The audit also found GPT-5.2, Claude Opus and Gemini reproducing exact fixes—evidence of training-data leakage. OpenAI recommends SWE-bench Pro (Scale AI) which uses diverse codebases and restrictive licenses; models that scored ~70% on Verified dropped to ~23% on SWE-bench Pro public split. OpenAI plans privately authored evaluations (e.g., GDPVal) to avoid leakage. The finding undermines public leaderboard claims of coding supremacy and coincides with rumors of a near-release open-source model (DeepSeek).

AI Analysis

Audit results in the summary show a majority (59.4%) of audited tasks were broken (35.5% required specific function names; 18.8% checked unrelated copied features) and models reproduced exact fixes—evidence of training-data leakage. OpenAI recommends SWE-bench Pro and plans private evaluations; reported score drops (~70% to ~23%) and leaderboard undermining are stated facts from the summary.

Source Articles