
OpenAI released a new generation of voice models in its API on Wednesday, giving developers tools to build apps that can reason through spoken requests, translate across +70 languages, and transcribe speech as it happens.
The three models are named GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They move AI voice interfaces beyond simple Q&A exchanges into a territory where an AI agent can listen, think, and act mid-conversation.
GPT-Realtime-2 brings sharper reasoning to voice
GPT-Realtime-2 is the flagship. OpenAI says it offers GPT-5-class reasoning, a significant step up from its predecessor, GPT-Realtime-1.5.
The model scored 15.2% higher on Big Bench Audio, a benchmark for audio intelligence, and 13.8% higher on Audio MultiChallenge, which tests instruction following in multi-turn spoken dialogue.
The practical upgrades target developers building production voice agents. The model now supports a 128K context window, quadrupled from the previous 32K limit, and offers five tiers of adjustable reasoning effort from “minimal” to “xhigh.”
It can call multiple tools simultaneously, recover from errors with spoken acknowledgments, and produce short bridging phrases like “let me check that” while processing a request.
GPT-Realtime-Translate handles live speech translation. It accepts more than 70 input languages and outputs in 13, designed to keep pace with a speaker in real time.
GPT-Realtime-Whisper provides streaming speech-to-text (STT), transcribing words as they are spoken rather than waiting for a completed utterance.
Zillow, Deutsche Telekom test the models in production
Several companies got early access. Zillow is building a voice assistant that can process complex real estate queries, handle tool calls to search listings, and comply with Fair Housing regulations.
The company reported a 26-point improvement in call success rate on its hardest adversarial benchmark after prompt optimization with GPT-Realtime-2, reaching 95% compared to 69% previously.
Deutsche Telekom is testing real-time translation for customer support, allowing callers to speak in their preferred language while the model handles the conversion on both sides.
Priceline is exploring a voice based travel assistant that could manage flight searches, hotel changes, and on-the-ground translation in a single session.
The models target companies looking to expand customer service capabilities, but also noted potential applications across education, media, events, and creator platforms.
OpenAI said it built content moderation into the new models, with triggers that can halt conversations detected as violating harmful content guidelines. The company framed the guardrails as protection against spam, fraud, and other forms of abuse.
On pricing, the Translate and Whisper models bill by the minute. GPT-Realtime-2 bills by token consumption. All three are available through OpenAI’s Realtime API, accessible via WebRTC, WebSocket, and SIP connection methods.
If you’re reading this, you’re already ahead. Stay there with our newsletter.
Source: https://www.cryptopolitan.com/openai-voice-models-reason-translate/





Be the first to comment