← 返回今日
GPT-5.5Codex令牌聚类模型性能推理令牌OpenAIAI 退化错误报告

GPT-5.5 Codex 推理令牌聚类在 516/1034/1552 处可能导致复杂任务性能下降

对 390,195 条 Codex 响应的分析表明,GPT-5.5 响应不成比例地集中在恰好 516 个推理令牌处(占所有精确 516 事件的 82%),并在 1034 和 1552 处出现额外峰值,同时平均推理令牌强度从 2026 年 2 月的 268.1 个令牌急剧下降至 2026 年 5 月的 106.9 个令牌,这可能表明存在导致复杂任务性能下降的推理预算或截断机制。

核心要点

  • GPT-5.5 仅占响应的 19.3%,但占精确 516 事件的 82.0%,其精确 516/≥516 比率(44.0%)比其他模型高 33.6 倍
  • 月度在 516 处的聚类从 2026 年 2 月的 0.11% 上升至 2026 年 5 月的 53.30%,同期平均推理令牌从 268.1 降至 106.9
  • 固定令牌值(516、1034、1552)表明是基于阈值的推理预算行为而非自然变化,需要调查潜在的截断或调度程序机制

证据包尚未生成

信息源:Hacker News

全文 · 原文

默认展示原文,不触发翻译

GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks · Issue #30364 · openai/codex · GitHub Skip to content Navigation Menu Toggle navigation Sign in Appearance settings Platform AI CODE CREATION GitHub Copilot Write better code with AI GitHub Copilot app Direct agents from issue to merge MCP Registry New Integrate external tools DEVELOPER WORKFLOWS Actions Automate any workflow Codespaces Instant dev environments Issues Plan and track work Code Review Manage code changes APPLICATION SECURITY GitHub Advanced Security Find and fix vulnerabilities Code security Secure your code as you build Secret protection Stop leaks before they start EXPLORE Why GitHub Documentation Blog Changelog Marketplace View all features Solutions BY COMPANY SIZE Enterprises Small and medium teams Startups Nonprofits BY USE CASE App Modernization DevSecOps DevOps CI/CD View all use cases BY INDUSTRY Healthcare Financial services Manufacturing Government View all industries View all solutions Resources EXPLORE BY TOPIC AI Software Development DevOps Security View all topics EXPLORE BY TYPE Customer stories Events & webinars Ebooks & reports Business insights GitHub Skills SUPPORT & SERVICES Documentation Customer support Community forum Trust center Partners View all resources Open Source COMMUNITY GitHub Sponsors Fund open source developers PROGRAMS Security Lab Maintainer Community Accelerator GitHub Stars Archive Program REPOSITORIES Topics Trending Collections Enterprise ENTERPRISE SOLUTIONS Enterprise platform AI-powered developer platform AVAILABLE ADD-ONS GitHub Advanced Security Enterprise-grade security features Copilot for Business Enterprise-grade AI features Premium Support Enterprise-grade 24/7 support Pricing Search or jump to...

Search code, repositories, users, issues, pull requests... Search Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name Query To see all available qualifiers, see our documentation . Cancel Create saved search Sign in Sign up Appearance settings Resetting focus You signed in with another tab or window.

Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.

Reload to refresh your session. Dismiss alert {{ message }} Uh oh! There was an error while loading. Please reload this page .

openai / codex Public Notifications You must be signed in to change notification settings Fork 14.2k Star 95.5k Code Issues 5k+ Pull requests 332 Discussions Actions Security and quality 1 Insights Additional navigation options Code Issues Pull requests Discussions Actions Security and quality Insights GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks #30364 New issue Copy link New issue Copy link Open Open GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks #30364 Copy link Labels bug Something isn't working Something isn't working model-behavior Issues related to behaviors exhibited by the model Issues related to behaviors exhibited by the model rate-limits Issues related to rate limits, quotas, and token usage reporting Issues related to rate limits, quotas, and token usage reporting Description vguptaa45 opened on Jun 27, 2026 Issue body actions Summary I found an aggregate pattern in Codex token_count metadata: gpt-5.5 responses disproportionately land at exactly reasoning_output_tokens = 516 , with additional fixed-boundary spikes around 1034 and 1552 .

This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks. This is related to #29353 , which reported a task-level reproduction where gpt-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window. I am not claiming this proves hidden chain-of-thought truncation.

The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.

Environment Product: Codex Model most implicated: gpt-5.5 Data source: Codex token_count metadata Time window analyzed: Feb 1-Jun 27, 2026 UTC Related issue: gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop #29353 Evidence Metric Value Response-level token records analyzed 390,195 Sessions represented 865 Exact reasoning_output_tokens = 516 events 3,363 GPT-5.5 share of all responses 19.3% GPT-5.5 share of exact-516 events 82.0% GPT-5.5 exact-516 / >=516 ratio 44.0% Non-GPT-5.5 exact-516 / >=516 ratio 1.3% Model-level result: Model Response records Exact 516 / >=516 gpt-5.5 75,401 44.0% gpt-5.4 25,214 19.8% gpt-5.2 247,575 0.34% gpt-5.3-codex 13,333 0.0% gpt-5.3-codex-spark 26,179 0.0% Monthly exact-516 clustering increased sharply: Month Exact 516 / >=516 Feb 2026 0.11% Mar 2026 2.45% Apr 2026 4.25% May 2026 53.30% Jun 2026 35.84% At the same time, overall reasoning-token intensity decreased: Month Mean reasoning tokens P90 reasoning tokens Feb 2026 268.1 772 Mar 2026 256.8 723 Apr 2026 228.7 669 May 2026 106.9 344 Jun 2026 168.5 515 Why this looks suspicious The anomaly is not simply higher reasoning-token usage overall.

Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply. The clustering is also not evenly distributed across models. gpt-5.5 accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.

The fixed values are also notable: 516 , 1034 , and 1552 look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution. Expected behavior Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family. Actual behavior gpt-5.5 responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.

Ask Could the Codex team investigate whether gpt-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens? If this is expected behavior, it would be useful to know whether exact 516 indicates a normal stopping point, a budget cap, a degraded tier, or another internal threshold. Useful internal validation checks: Query token_count events with reasoning_output_tokens by model. Compare exact-value counts for 0 , 516 , 1034 , and 1552 .

Compute count(reasoning_output_tokens = 516) / count(reasoning_output_tokens >= 516) by model and day. Compare gpt-5.5 against gpt-5.2 , gpt-5.4 , and Codex-specific variants. Replay matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals, especially separating exact-516 responses from longer-reasoning responses.

Reactions are currently unavailable Metadata Metadata Assignees No one assigned Labels bug Something isn't working Something isn't working model-behavior Issues related to behaviors exhibited by the model Issues related to behaviors exhibited by the model rate-limits Issues related to rate limits, quotas, and token usage reporting Issues related to rate limits, quotas, and token usage reporting Type No type Fields No fields configured for issues without a type.

Projects No projects Milestone No milestone Relationships None yet Development No branches or pull requests Issue actions Open in GitHub Copilot app Footer © 2026 GitHub, Inc. Footer navigation Terms Privacy Security Status Community Docs Contact Manage cookies Do not share my personal information You can’t perform that action at this time.