In 2025, OpenAI posted $13B in revenue — and burned $22B doing it. A $9B annual loss. And $8.4B of that spend went to inference alone — more than double 2024.
That math reveals an uncomfortable truth: every dollar of cloud AI inference you're paying today is subsidized by venture capital. The price tag on your Twilio + OpenAI voice AI stack isn't the real price. It's a market-share land grab. And when the music stops, your AI bill is going to be 3-5x what it is now.
INFINITO CLOUD started building voice AI on-prem for the day after the subsidy runs out.
The math nobody at the cloud companies wants you to do
Let's run the numbers for a real voice AI deployment.
A mid-sized telecom operator running conversational voice agents at scale processes maybe 50,000 minutes/month per tenant. Today's pricing for the "easy" stack — Twilio Media Streams + OpenAI Realtime — runs $0.018-$0.06/min effective once you bundle voice transport, STT, LLM tokens, TTS, and connection overhead.
Now multiply by 50 tenants. That's $1.2M/year. Just for inference. Just for one product line.
You don't even need to scale to enterprise to feel this. The data is brutal:
- Enterprise AI budgets exploded from $1.2M (2024) → $7M (2026) (FourWeekMBA)
- Inference is now 85% of the enterprise AI budget — not training, not seat licenses. Inference. (Constellation Research)
- A 5,000-employee mid-market with 10 AI use cases is staring down $9-19M/year in total AI cost (FourWeekMBA)
"OpenAI is losing $1.35 for every $1 of inference revenue they collect."
And the kicker: these prices are artificially low (AI Automation Global). Google, Anthropic, Meta are doing the same thing — pricing below cost to capture market share, funded by money that won't be there forever.
When the subsidy ends — and it will — every voice AI bill at scale is going to look very different.
Why we built on-prem when cloud was cheaper
I'll be honest about how I got here. It wasn't a Big Strategic Bet. We already had clients running serious voice/audio workloads. DGX Spark dropped at $4,699. We just… built it.
What we found surprised us. We could cut per-tenant cloud spend by ~$18K/year at production scale — and pay back the full deployment in month 8. Then pure margin every month after.
The full stack works in production. NVIDIA NeMo handles speech in and out. Qwen 35B runs the language model on vLLM. Sub-second latency. 184 voices, Spanish and English, native. Running on a single NVIDIA DGX Spark that costs $4,699 today (ToolHalla).
That $4,699 number matters. Five years ago, equivalent inference compute meant a $200K NVIDIA DGX-A100 server — or nothing at all. The Blackwell generation isn't just an incremental upgrade — it's NVIDIA quietly democratizing inference hardware. RTX PRO 6000 Blackwell (96GB), DGX Spark (128GB unified memory), DGX Station tier — all aimed at putting AI on your desk, not in someone else's cloud.
And my aha moment? It came when I actually wrote out the per-minute math at scale. For a single voice line: cheap. For 50,000 minutes/month: brutal. I've built my entire career around helping clients spend less on infrastructure. The voice AI cloud cost curve was a cliff hiding in plain sight.
What "the day after" looks like
Here's what I think is going to happen in the next 18-24 months:
- The big AI labs run out of subsidy runway. VC money is patient but not infinite. When the next funding round repricing forces honest unit economics, cloud AI prices rise.
- Enterprise CFOs notice. A 50% price increase on a $7M annual AI bill is $3.5M/year. That's a budget meeting, not a renegotiation.
- Hardware costs keep falling. DGX Spark is $4,699 today. The next Blackwell consumer-class part will be cheaper. Per-flop costs on owned hardware drop ~30%/year. Cloud subsidies don't.
- The break-even math flips. Already, at >80% sustained GPU utilization, on-prem wins on 3-year TCO (Spheron). When the subsidy ends, that threshold collapses to ~50% utilization. Suddenly on-prem makes sense for most mid-market voice AI workloads.
- The market splits. Cloud AI becomes the right answer for sporadic, low-utilization, R&D workloads. On-prem becomes the obvious answer for predictable production traffic. Voice AI for telecom operators, healthcare SaaS, BPOs, contact centers — those are predictable production workloads.
This isn't a contrarian thesis. It's just the math, run forward 24 months.
What we're betting on
INFINITO CLOUD builds Nemo-RT Pro.
Voice AI runs on your NVIDIA hardware. Tenants stay isolated by default. Spanish and English work natively, not translated. The per-minute meter — gone.
Our explicit goal is to replace your Twilio + OpenAI Realtime + Vapi + Retell stack for production voice workloads. Not augment. Replace.
Pricing is brutally simple: $5,000 once to deploy, $500/month optional Support. You own the deployment forever. Pay once, scale tenants, never see a per-minute meter again.
We're already in production. 200+ concurrent end-users on a single NVIDIA DGX. Bilingual Spanish/English workflows. No per-minute meter.
We're working actively with telecom operators, healthcare SaaS platforms, and BPOs across LATAM, Spain, and US Hispanic markets. NVIDIA Inception portfolio member. Microsoft for Startups Founders Hub.
Are we early? Yes. Are we right? The math says yes. Are we positioned for what comes next? That's the bet.
Two ways to engage
🟢 OSS Community v2 — pre-release on github.com/infinitocloud/nemo-rt-community. Single-tenant version of the stack, Apache 2.0 license, for self-hosters and SIP integrators. ⭐ Star the repo to get notified the moment the code drops (W26 2026).
🟡 Discovery call (20 min). If you're running voice AI in production at >1,000 min/month and the cost curve worries you, let's run the TCO math against your actual numbers. No sales pitch. We tell you honestly if Nemo-RT Pro saves you money or not. → Book a slot
We're not building on the assumption that cloud AI stays cheap. We're building for the day after.
Yan Frank builds voice AI that runs in your own datacenter. Founder of INFINITO CLOUD LLC. Built Nemo-RT Pro. Spent the last 10 years writing telephony infrastructure (Asterisk, SIP, voice). NVIDIA Inception portfolio member. infinitocloud.com