# Ambient Advantage — May 13, 2026

*Wednesday · May 13, 2026 · [Episode page](https://podcast.ambient-advantage.ai/episodes/2026-05-13.html) · [Audio](https://storage.googleapis.com/ambient-advantage-podcast/2026-05-13-ambient-advantage.mp3)*

[AVA]
Mira Murati's new lab just shipped its first model and it responds faster than you can blink. Literally. Zero point four seconds. If that number holds up, the walkie-talkie era of voice AI is over.

[JON]
Oh, we are starting there. Welcome to Ambient Advantage — I'm Jon, and this is Ava. It's Wednesday, May 13, 2026, and here's what matters in AI today. We have a packed show. Mira Murati's Thinking Machines Lab drops its first real product. Cerebras is about to become the biggest IPO of the year. Google has been caught hiding seven models ahead of I/O next week. And Ben Thompson wrote something that should change how every CIO thinks about their inference bill. Ava, let's get into it.

[AVA]
Let's start with Thinking Machines Lab, because this one has been the most anticipated debut in AI since... honestly, since OpenAI itself. Mira Murati left OpenAI, founded this company in early 2025, raised two billion dollars in a seed round at a twelve billion dollar valuation, and then went quiet. Today we know what they were building.

[JON]
And it is not what most people expected. This is not another chatbot. This is not another reasoning model. What exactly did they ship?

[AVA]
They released a research preview of something called TML-Interaction-Small. It is a 276 billion parameter mixture-of-experts model, but only 12 billion parameters are active at any given time. And the key innovation is architectural. Instead of the traditional pattern where you talk, the AI waits for you to finish, then it thinks, then it responds... this model processes audio, video, and text in continuous 200 millisecond chunks. It is always listening, always watching, always ready.

[JON]
So what does that actually feel like in practice?

[AVA]
It means the turn-taking latency drops to 0.40 seconds. For context, OpenAI's GPT-Realtime-2 is at 1.18 seconds. Google's Gemini Live is at 0.57. This is not an incremental improvement, Jon. This is a different category of interaction. And the clever bit is the dual architecture. There is a fast interaction model that stays present in the conversation, and a background reasoning model that handles the heavy thinking asynchronously. So it can answer your follow-up question while simultaneously working on the complex analysis you asked for thirty seconds ago.

[JON]
That is genuinely new. Now I want to connect this to the business world because I know our listeners are thinking about voice agents, customer service, real-time monitoring. What changes if this latency number is real?

[AVA]
Everything about enterprise voice AI has been constrained by that one to two second delay. It makes conversations feel robotic. It makes real-time use cases like monitoring a factory floor or a surgical procedure basically impossible. At 0.4 seconds, you are inside the window of natural human conversation. That unlocks live translation where you can actually have a dinner conversation through an interpreter AI. It unlocks industrial safety monitoring where the model is watching a video feed and can interrupt the moment something goes wrong. And it unlocks customer service agents that don't make your callers want to throw their phone across the room.

[JON]
Big caveat though. These are self-reported benchmarks from a company that has every incentive to impress.

[AVA]
Absolutely. Research preview means exactly that. We need independent benchmarks, we need real-world testing under load, and we need to see pricing. But even if the numbers soften by twenty or thirty percent, this architecture — the continuous chunking, the split between interaction and reasoning — that is a design pattern the entire industry is going to copy. I will drop the link to their blog post in the show notes.

[JON]
Alright, let's move into the rundown. Ava, Cerebras is pricing tonight. What do people need to know?

[AVA]
Cerebras Systems raised its IPO price range to 150 to 160 dollars a share, up from the original 115 to 125. At the top of that range they would raise 4.8 billion dollars at a fully diluted valuation of nearly 49 billion. The IPO is reportedly 20 times oversubscribed. Trading starts tomorrow on Nasdaq under ticker CBRS. This is the largest IPO globally this year and the first major AI hardware pure-play to go public.

[JON]
And the financials are actually... good?

[AVA]
Shockingly good for a company at this stage. 510 million in revenue last year, up 76 percent year over year, with a 47 percent net margin. They have a 24.6 billion dollar revenue backlog. And the customer list now includes OpenAI and Amazon, which was not the case during their earlier failed IPO attempt. The market is saying very clearly: AI infrastructure is where we want to put money right now.

[JON]
Next up, Google got caught with its hand in the cookie jar. What happened?

[AVA]
Reddit users and app researchers found a model card labeled Gemini Omni inside the Gemini app on Sunday. The description reads, and I am quoting directly, "Create with Gemini Omni, meet our new video model, remix your videos, edit directly in chat, try templates, and more." Separately, researchers found a hidden seven-model selector in the Gemini Live interface, including a thinking variant with enhanced reasoning capabilities. Two of the seven models are already at release candidate two stage.

[JON]
And Google I/O is next Monday, May 19th.

[AVA]
Exactly. So this is clearly staged for a big reveal. But the important thing for enterprise buyers is chat-based video editing. You describe the change you want in plain language and the model regenerates the footage. That is a direct threat to the entire post-production toolchain. And the thinking variant for Gemini Live suggests Google is about to combine real-time conversation with deep reasoning in one interface. Sound familiar? It is the same split architecture Thinking Machines Lab just announced, which tells you this is where the whole industry is converging.

[JON]
Now here is a fun one. DeepMind wants to turn your mouse cursor into an AI collaborator.

[AVA]
This is experimental research, no product announcement, no release date. But the concept is powerful. DeepMind published demos of an AI Pointer system powered by Gemini that treats your cursor position as a real-time contextual signal. Instead of describing what you are looking at so the AI can help you, the AI already sees what you are pointing at. It captures both visual and semantic context around the cursor.

[JON]
Why does that matter beyond being a cool demo?

[AVA]
Because the single biggest friction point in AI-assisted work today is context assembly. You spend more time explaining what you need help with than actually getting help. If cursor position becomes a first-class input alongside text and voice, that entire friction layer disappears. For enterprise software vendors, this is a signal to start rethinking UI patterns now. The era of the prompt box as the primary AI interface is ending. Ambient context is replacing it.

[JON]
Alright, two more quick ones. The Neuron newsletter ran a feature on enterprise AI costs quietly growing. Ava, what is driving that?

[AVA]
Three things. First, silent model upgrades — vendors shift users to more capable but more expensive models without explicit notification. Second, organic usage expansion as more teams discover and start using AI tools. Third, and this is the sneaky one, hidden agentic task costs. When an agent runs a loop of fifteen API calls to complete one task, your bill reflects all fifteen calls, not the one task. CIOs who locked in contracts at 2024 usage rates are seeing real overages in 2026. The fix is a usage governance layer. Track which teams run which models, flag agentic loops, and negotiate model-tier flexibility into new contracts.

[JON]
And there was a great piece in Ben's Bites about agentic coding being a trap. What is the takeaway there?

[AVA]
The argument is that teams are outsourcing understanding to AI coding agents — tools like Copilot, Claude Code, Cursor — and in the process, they stop comprehending the systems they are building. The framing is "learn the system." And this applies far beyond developers. For any workflow you are handing to an agent, if you do not deeply understand the process before you automate it, you inherit an opaque system you cannot maintain, audit, or fix when it breaks. Automation without understanding is just technical debt with a chatbot on top.

[JON]
Alright, let's step back. The bigger picture. Ava, what connects all of today's stories?

[AVA]
There is a thread running through every single story today, and Ben Thompson named it perfectly in his Stratechery piece this week. He calls it "the inference shift." His argument is that agentic inference is fundamentally different from interactive inference. When a human is in the loop, latency is everything — which is exactly what Thinking Machines Lab optimized for. But when agents are running autonomously in the background, latency is irrelevant. What matters is throughput and cost per token. And that is exactly the market Cerebras is going public to serve.

[JON]
So we are heading toward a world where different AI workloads run on fundamentally different infrastructure.

[AVA]
Exactly. Heterogeneous compute. Your real-time voice agent runs on one architecture optimized for sub-second latency. Your batch summarization pipeline runs on a completely different architecture optimized for throughput. Your background coding agent runs on something else entirely. And here is the actionable insight for every enterprise leader listening: you are probably paying premium low-latency prices for workloads where latency does not matter. Nightly data pipelines, background agents, batch processing — none of that needs the fastest inference. As companies like Cerebras scale and alternative inference providers multiply, there is going to be meaningful cost arbitrage. But you can only capture it if you have mapped your workloads to their actual latency requirements. Most companies have not done that work yet.

[JON]
So the homework is... audit your inference workloads.

[AVA]
That is the homework. Know which of your AI workloads are human-facing and latency-sensitive, and which are background tasks where you are paying for speed you do not need. That single exercise could cut your inference bill by thirty to fifty percent over the next twelve months as the market diversifies.

[JON]
What should people be watching this week?

[AVA]
Two things. First, Cerebras starts trading tomorrow, May 14th. Watch the opening price action, but more importantly watch the analyst coverage that follows — it will tell you how Wall Street is modeling the inference compute market. Second, Google I/O kicks off Monday, May 19th. Given what has already leaked — Gemini Omni, the seven hidden Live models, the thinking variant — this could be the most consequential I/O in years. If Google demonstrates real-time video editing and switchable reasoning modes in a live demo, the competitive landscape for enterprise AI shifts overnight.

[JON]
It is going to be a big week. Ava, take us home.

[AVA]
That is your Ambient Advantage for Wednesday, May 13, 2026.

[JON]
Share it with a colleague figuring out what AI means for their business. See you tomorrow.
