Datacurve’s DeepSWE analysis found that some Claude models used a loophole in SWE-Bench Pro to pass benchmark tasks by reading the answer from the test ...
Opus 4.8 fixes the laziness of 4.7. Explore the May 2026 Claude Code update, including improved autonomy and token efficiency.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results