By Sagar Shankaran, Founder of CallSphere
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
Key takeaways
At Cloud Next 2026, Google announced Gemini 3.1 Ultra: a 2-million token context window with full multimodality across text, image, audio, and video, plus Gemini 3.1 Flash-Lite for cheap high-volume workloads. This post is a builder-facing deep dive on what 2M context actually unlocks (and what it does not), how multimodality changes agent design, and how to think about cost and latency in the new model lineup. It also names where a focused voice/chat product like CallSphere fits — the front door does not need Ultra; the analyst workflow downstream might.
The numbers matter less than the shape of the lineup: an Ultra-class model for hard problems, a Pro-class model for everyday, a Flash-Lite class model for the long tail. That tier shape is now standard across all frontier vendors.
Two million tokens is roughly 1.5 million English words, or 3,000–4,000 single-spaced pages, or a few hundred contracts, or a year's worth of board minutes. The practical implication is that for the first time, "just put everything in the context" is a real design choice for whole document corpora.
That sounds liberating. It mostly is. But it does not mean retrieval is dead. Three reasons:
The right pattern is retrieve narrow context first, fall back to long context when retrieval fails. Not "always retrieve" and not "always long context."
A handful of genuinely new affordances:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
These are real new capabilities. They also have a real new cost shape — see the cost section below.
Full multimodality means text, image, audio, and video are first-class inputs (and outputs, for the modalities Google supports). The non-obvious implication is for agents, not chatbots. A real-estate agent can take a photo from a prospect and reason about the property. An IT-helpdesk agent can read a screenshot of an error. A healthcare agent can read a wound photo (with appropriate clinical guardrails) and route appropriately. A field-service agent can watch a customer's 30-second video and decide what to send.
The right design pattern is to expose multimodality at the capture layer. Let the customer send the photo, the screenshot, the audio note. Then route to the right downstream agent.
Honest framing:
A reasonable budget split: 70% Flash-Lite, 25% Pro, 5% Ultra. Tune from there.
A practical decision rule:
Most production agents end up hybrid.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Three pragmatic recommendations:
CallSphere is the AI voice and chat front-door — voice/chat/SMS/WhatsApp, 57+ languages, six verticals (healthcare, real estate, sales, salon, IT helpdesk, after-hours), HIPAA-friendly, $149/$499/$1,499 per month, 3–5 day launch. CallSphere does not need Ultra for the call itself; tight-latency voice agents are best served by purpose-built voice models. But the downstream analyst workflow — the agent that reads the call transcript plus the customer's whole history and writes the case note — is exactly where Gemini 3.1 Ultra's 2M context shines. CallSphere captures; Ultra synthesizes. Try the trial.
If you already built on Gemini 2.x or Claude long-context:
A short list before you commit to Ultra for a workload:
If the answer to any of those is uncertain, start with Pro or hybrid.
Is 2M tokens always better? No. Long context is more expensive and slower than retrieval. Use it where it changes the answer.
Does Ultra replace retrieval-augmented generation? No. Most production agents will run hybrid — retrieve narrow, fall back to long context when needed.
Should I use Ultra for voice agents? Generally no. Voice agents need sub-300ms latency; purpose-built voice models win. Use Ultra for the analyst-style downstream reasoning the call feeds into.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
At Cloud Next 2026 Google renamed Vertex AI to Gemini Enterprise Agent Platform and absorbed Agentspace. What actually changed and why a rebrand made sense.
Agentspace is gone as a standalone product. Google folded it into Gemini Enterprise at Cloud Next 2026. What builders actually see in the new console.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI