For years, voice AI has been the feature that almost worked. Fast enough to feel snappy in a demo, but brittle in production, stumbling over interruptions, losing context mid-conversation, and lacking the reasoning to handle anything more complex than a weather query. Today, OpenAI took a direct swing at that problem.
On May 7, 2026, OpenAI announced three new realtime audio models through its API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Taken together, they represent the most ambitious voice infrastructure push the company has made since the original Realtime API launched.
This wasn't a consumer product announcement. It was a developer infrastructure play, and that framing matters.
The Three Models, Explained
GPT-Realtime-2: Voice With a Brain
The flagship of the trio is GPT-Realtime-2, and the headline is straightforward: this is OpenAI's first voice model built on GPT-5-class reasoning. Previous voice models could respond, this one can think.
The practical difference shows up in benchmarks. GPT-Realtime-2 scored 15.2% higher on Big Bench Audio compared to its predecessor, GPT-Realtime-1.5. But the architectural changes are just as notable:
- Context window expanded from 32K to 128K tokens, enough to hold a long, complex conversation with full recall
- Adjustable reasoning levels, from minimal (low latency, simple tasks) to extra-high (complex multi-step requests)
- Designed to "carry the conversation forward naturally", not just respond, but anticipate and recover
The adjustable reasoning dial is particularly significant. It lets developers tune the model for their use case: a fast-response customer service bot doesn't need the same reasoning depth as a medical intake assistant. One model, configurable for both.
GPT-Realtime-Translate: Speaking Across Languages, Live
The second model tackles one of the most requested enterprise use cases in voice AI: real-time multilingual translation.
GPT-Realtime-Translate supports 70+ input languages translated into 13 output languages, all while keeping pace with the speaker. No lag, no waiting for the sentence to finish, it translates as you talk.
OpenAI cited Deutsche Telekom as an early partner, building customer support experiences where callers speak in their native language and agents hear a translated response in real time. The implications extend well beyond telecom: healthcare, travel, legal services, and any customer-facing industry dealing with a multilingual user base all stand to benefit immediately.
This is the kind of capability that previously required purpose-built translation infrastructure bolted onto a voice stack. It's now a single API call.
GPT-Realtime-Whisper: Transcription That Doesn't Wait
The third model is the most focused: streaming speech-to-text that works as you speak.
GPT-Realtime-Whisper doesn't wait for a pause or a sentence break. It transcribes live, turning spoken words into text in real time. For developers building voice agents, meeting tools, medical dictation software, or any product where latency in transcription creates friction, this closes a meaningful gap.
What Developers Are Actually Building
OpenAI framed the release around three emerging patterns in voice AI:
Voice-to-action, describe what you need, the system does it. No forms, no menus.
Systems-to-voice, software speaks to you proactively with relevant context. Imagine a travel app that says: "Your inbound flight is delayed, but you can still make your connection. I found the new gate and your bag is still expected to transfer." No screen required.
Voice-to-voice, live conversations that cross languages, tasks, and shifting context without breaking.
Priceline is already working on a full trip management experience built entirely on voice: searching for flights, adjusting hotel bookings after delays, getting real-time TSA wait time updates, all conversationally, through these models.
Pricing
All three models are available now through the Realtime API. Pricing breaks down as follows:
| Model | Pricing |
|---|---|
| GPT-Realtime-2 | $32 / million audio input tokens ยท $64 / million audio output tokens |
| GPT-Realtime-Translate | $0.034 / minute |
| GPT-Realtime-Whisper | $0.017 / minute |
The per-minute pricing for Translate and Whisper keeps cost predictable for high-volume use cases, a deliberate choice for enterprise buyers who need to forecast spend.
Safety and Compliance
OpenAI noted that the API includes active classifiers that can halt conversations violating content guidelines, and the service supports EU data residency requirements, a prerequisite for any meaningful European enterprise deployment.
Why This Matters Now
The timing isn't coincidental. Voice is becoming the dominant interface layer for ambient computing, smart glasses, in-car assistants, screenless devices. OpenAI's own hardware ambitions (reportedly audio-first, in development with former Apple design chief Jony Ive) require a voice stack that can actually handle the complexity of real-world conversations.
Today's release is that stack.
GPT-Realtime-2 alone shifts what's possible: a voice agent that can reason through hard requests, maintain context across a long interaction, and adjust its intelligence level based on what the task demands. Paired with live translation and real-time transcription, the infrastructure for a genuinely useful voice-first product is now available through three API calls.
For developers who have been waiting for voice AI to catch up to text, it just did.
All three models are available today through the OpenAI Realtime API.
Reference:
https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/