GPT-5.5 and Opus 4.7: OpenAI and Anthropic Both Admit Post-Launch Quality Issues

Editor J May 16, 2026

Within a month of launch, complaints about quality degradation poured in for both GPT-5.5 and Opus 4.7, and both vendors eventually admitted issues. Anthropic published an April 23 postmortem confirming that a '≤25 words between tool calls' system prompt had broken Opus 4.7's coding quality, while OpenAI Codex lead Thibault Sottiaux acknowledged on May 15 that two bugs had degraded GPT-5.5 in Codex over the previous 48 hours, declaring the fix complete the next day.

OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7. Across April and May 2026, both flagship models from the two leading AI labs ran into the same wall of user backlash about post-launch quality, and both companies eventually admitted to it — one through a formal postmortem, the other through a single tweet.

Anthropic released its own postmortem on April 23, 'An update on recent Claude Code quality reports.' Three weeks later, OpenAI Codex lead Thibault Sottiaux acknowledged on X on May 15 that GPT-5.5 had been performing worse for some users. The formats differ; the message lines up — something was off, and they shipped it.

A Month After Launch, the Same Complaints Hit Both Models at Once

Opus 4.7 SWE-bench Pro / Rakuten / GPQA benchmark comparison table — Anthropic's official Opus 4.7 benchmark table — compared against 4.6, GPT-5.4, Gemini 3.1 Pro, and the Mythos Preview

Anthropic shipped Claude Opus 4.7 to general availability on April 16. The release page leaned on numbers like 64.3% on SWE-bench Pro, a 13% coding improvement over 4.6 on Rakuten-SWE-Bench, and 98.5% visual-acuity. OpenAI announced GPT-5.5 about a week later on April 23, then promoted its 'GPT-5.5 Instant' variant to ChatGPT's default model on May 5.

But complaints in similar shapes started piling up on both sides well before the first month was out. On r/ClaudeAI, threads about Opus 4.7 being more verbose and showing attention drift compared to 4.6 multiplied. On r/codex and r/ChatGPTcomplaints, users logged GPT-5.5 degradation in reasoning speed and broken compaction. It was a familiar arc — see our earlier coverage of the GPT-5.4 benchmark controversy — where the gap between launch-day praise and post-launch reality check shows up within days.

Anthropic's Confession: '≤25 Words Between Tool Calls' Killed Opus 4.7 Coding

Dario Amodei in a blue cardigan, speaking at a RØDE microphone — Dario Amodei, CEO of Anthropic — on the Dwarkesh podcast

Anthropic's engineering blog published the postmortem on April 23, and it is candid. Claude Code quality complaints were not an illusion: 'three separate changes' combined to look like broad degradation.

First, on March 4 the default reasoning effort had been lowered from 'high' to 'medium' for latency — users preferred intelligence over speed, and the change was reverted on April 7. Second, a March 26 prompt-caching optimization carried a system prompt bug-adjacent flaw that wiped prior thinking on every turn after idle sessions instead of once, making Claude 'forgetful and repetitive' — fixed April 10. Third, and the Opus 4.7-specific one Anthropic's postmortem singled out: a verbosity-reduction system prompt added on launch day, April 16, told the model to 'keep text between tool calls to ≤25 words.' That single line quietly hurt coding quality and was reverted on April 20. Anthropic then reset every subscriber's usage limit on April 23.

Anthropic underlined two sentences: 'the API and inference layer were unaffected' and 'we never intentionally degrade our models.' It also committed to applying 'soak periods' and 'gradual rollouts' to future system-prompt changes. As far as flagship-breaking system-prompt bug disclosures go, this one is unusually transparent.

OpenAI's Admission: A 24-Hour Window Closed by Sottiaux's Tweet

Greg Brockman in a blue sweater, gesturing with both hands while speaking — Greg Brockman, OpenAI President — speaking at the SK AI Summit

OpenAI handled it differently. A May 8 'Increased Error Rate for gpt-5.5' incident had already been logged on the OpenAI Status page for about an hour and 43 minutes, but the real wave came mid-month. On May 15, Codex lead Thibault Sottiaux tweeted: 'Codex team is aware of reports of GPT-5.5 performing worse for some users and investigating.'

The next morning a sharper follow-up landed from the same account: 'We found and fixed two issues that could explain this degradation of the capability of GPT-5.5 in Codex over the last ~48 hours. I will reset usage limits this evening.' It collected 6,374 likes, 445 retweets, and 820,000 views. Six hours later: 'Confirmed stable and recovered.' OpenAI Codex acknowledgment, fix, and closure landed inside a 24-hour window.

No formal postmortem followed. There was no detail on which two issues, whether they were infrastructure, routing, or a system prompt bug. Lined up against Anthropic's April 23 report, the contrast in how the two companies treat the same kind of event is sharp.

What Both Vendors Deny in Unison: 'The API and Inference Layer Are Fine'

Anthropic stated in the postmortem that 'the API and inference layer were unaffected,' and Sottiaux carefully scoped his admission to the Codex environment. Both vendors landed in the same place — the model itself and the inference infrastructure were not broken, the product layer sitting on top was.

But status.claude.com tells a slightly different story. After the April 16 GA, 'Elevated errors on Claude Opus 4.7' incidents were logged on May 4, 7, 8, and 14 inside roughly four weeks, plus a multi-model elevated error rate incident on May 15. Even if the API layer was 'unaffected' in the strict sense, from the user side the product bugs and the API incidents fused into one experience of AI model regression.

The 'Nerfed Feel' That Doesn't Go Away After the Fixes

Black circuit lines forming a head-like network on an orange background — Abstract circuit-style illustration of an AI language model

After Anthropic reverted the verbosity prompt on April 20 and OpenAI shipped the two Codex fixes on May 16, the Opus 4.7 nerf mood on X and Reddit did not dissolve. A May 12 tweet — '4.7 has been nerfed in an almost unexplainable way; GPT-5.5 feels more like the old Opus 4.6' — pulled 1,582 views. Another post complained: 'I'm waiting 40 seconds just to get a worse answer,' reaching 408 views.

A more concrete data point also surfaced. One researcher ran Opus 4.7 against the Cybench cybersecurity challenge set and reported the model verbatim-recalling solutions on 11 of 40 challenges — a training-data contamination pattern. Suspicion about the model itself outlasted the product-layer fixes.

The arc looks a lot like the pattern from our earlier coverage of suspected intentional performance degradation in Gemini 3.1 and GPT-5.4 one month earlier. When a company says 'the API is fine,' but the ChatGPT or Claude Code screen people actually use every day looks worse, that becomes the real performance.

The Real Verdict on an AI Model Comes a Month After Launch

Launch-day benchmarks are enough for marketing, but user trust is built in operations. Anthropic's April 23 postmortem comes closer to a best-practice template; OpenAI's 24-hour tweet admission models speed at the cost of leaving documentation blank.

What users now distrust about GPT-5.5 and Opus 4.7 isn't one model line — it's the whole operation. System prompts shipping under deadline pressure without review, cache optimizations silently changing behavior, status-page incidents coexisting with 'unaffected' claims. Whichever flagship lands next — 4.8 or GPT-5.6 — trust will again be decided not on launch day but a month after.