OpenAI Unveils GPT-5.5: A New Era of Agentic AI Capabilities

OpenAI has launched GPT-5.5, touted as a new class of agentic AI designed for independent planning and tool use. This powerful model demonstrates significant performance gains across various benchmarks and is available to a wide range of users, though its premium pricing necessitates careful evaluation against real-world workloads.

Uche Emeka • AI • 3 months ago • 4 minute read •

OpenAI Unveils GPT-5.5: A New Era of Agentic AI Capabilities

OpenAI has officially unveiled GPT-5.5 on April 23, positioning it as a revolutionary “new class of intelligence for real work and powering agents.” This deliberate framing highlights its core design as the most capable agentic AI model to date, meticulously engineered from the ground up to independently plan, utilize tools, self-check its outputs, and execute complex tasks. GPT-5.5 represents the first retrained base model since GPT-4.5 and was co-designed in conjunction with NVIDIA’s advanced GB200 and GB300 NVL72 rack-scale systems. According to OpenAI, the most significant practical difference is its ability to handle tasks that previously demanded multiple prompts and human intervention for ‘course-correction’ with far greater autonomy.

The new model is being progressively rolled out to a wide user base, including Plus, Pro, Business, and Enterprise subscribers across ChatGPT and Codex. API access was made available shortly after the launch on April 24, allowing developers to integrate its enhanced capabilities into their applications. OpenAI has released compelling benchmark results to underscore GPT-5.5’s superior performance.

On Terminal-Bench 2.0, a critical benchmark for evaluating command-line workflows that necessitate planning and tool coordination within a sandboxed environment, GPT-5.5 achieved an impressive score of 82.7%. This notably surpasses GPT-5.4’s 75.1% and Claude Opus 4.7’s 69.4%. For GitHub issue resolution, tested on SWE-Bench Pro, GPT-5.5 reached 58.6%, demonstrating its ability to resolve more issues in a single pass than its predecessors. Furthermore, in Expert-SWE, an internal benchmark involving tasks with a median estimated human completion time of 20 hours, GPT-5.5 scored 73.1%, a significant improvement over GPT-5.4’s 68.5%. In long-context reasoning, specifically on MRCR v2 at one million tokens—a retrieval benchmark assessing a model's capacity to locate specific answers embedded within vast documents—GPT-5.5 registered 74.0%, a substantial leap from GPT-5.4’s 36.6%. Interestingly, on Scale AI’s Model Context Protocol (MCP) Atlas, a tool-use benchmark, Claude Opus 4.7 currently leads with 79.1%, while GPT-5.5's score was not recorded; OpenAI’s inclusion of this absence in their own benchmark table reflects confidence in the overall robust performance profile.

Regarding token efficiency and pricing, API access for GPT-5.5 is set at US$5 per million input tokens and US$30 per million output tokens, which is double the rates for GPT-5.4. OpenAI justifies this by asserting that GPT-5.5 completes the same Codex tasks using fewer tokens than GPT-5.4, resulting in effective costs that are roughly 20% higher once its enhanced efficiency is factored in. This claim has been independently validated by Artificial Analysis. For Pro, Business, and Enterprise users, GPT-5.5 Pro is available, priced at US$30 per million input tokens and US$180 per million output tokens. This Pro version applies additional parallel test-time compute to tackle harder problems and leads the list of publicly-available models on BrowseComp, OpenAI’s agentic web-browsing benchmark, with a score of 90.1%. It is advised that token efficiency be rigorously stress-tested against actual workloads before committing to a model switch. For instance, at 10 million output tokens per month, standard GPT-5.5 costs US$300 compared to Claude Opus 4.7’s US$250. This 20% premium only becomes cost-effective if GPT-5.5’s superior agentic performance translates into fewer task iterations and retries, with the precise financial benefit varying by use case.

In terms of practical applications, OpenAI reports that over 85% of its employees across various departments, including engineering and marketing, now utilize Codex weekly. As an example, the communications team leveraged GPT-5.5 to process six months of speaking request data, enabling the model to construct a scoring and risk framework that significantly automated low-risk approvals. Greg Brockman, President of OpenAI, characterized the release as “a real step forward towards the kind of computing that we expect in the future,” while chief scientist Jakub Pachocki observed that the model progress over the past two years had felt “surprisingly slow.” OpenAI also highlights that GPT-5.5 maintains GPT-5.4’s per-token latency in production serving while simultaneously delivering a higher level of intelligence—an impressive feat, as larger, more capable models often incur slower serving times. The key question for the coming weeks will be whether these promising benchmark leads effectively translate into tangible production gains for teams deploying real agentic pipelines, particularly for unattended terminal agents and DevOps automation where Terminal-Bench scores are highly relevant. The reported gap on MCP Atlas, however, warrants close attention for those heavily involved in tool-use orchestration.

Add us on Google