See Nemo-RT Pro in action
Watch our bilingual voice AI handle a full booking flow end-to-end in 2:30 — same agent, same session, two languages, with full context retention.
Spanish + English in the same agent · Sub-second time-to-first-audio · Running on NVIDIA DGX Spark
Why Nemo-RT Pro
Voice AI that works like software, not metered water. Benchmarked and tuned for production workloads — comparable to OpenAI Realtime, Gemini Live, Vapi, and Retell AI.
Zero per-minute fees
One deployment, one monthly support fee. 5,000 minutes or 500,000 — your cost doesn't change.
Full data residency
Runs on your NVIDIA hardware in your datacenter. Patient PHI and customer data never cross US clouds.
Sub-second latency
Sub-second TTFA at single-user benchmarks. Comparable to the best cloud voice APIs — without the cost curve.
Multi-tenant by default
One NVIDIA DGX serves multiple clients with isolated prompts, RAG, and MCP tools per tenant.
How it works
SIP call in. Voice AI agent that actually does things out. All on your hardware.
SIP / Asterisk
Your Asterisk PBX (or compatible SIP trunk) bridges the call via ARI WebSocket to Nemo-RT Pro.
VAD + STT
Silero VAD + NeMo Conformer CTC bilingual EN-ES. Acoustic and lexical language detection per turn.
LLM + MCP
Qwen3.6-35B-A3B-FP8 on vLLM with full MCP tool-calling framework and per-tenant RAG.
TTS back out
NeMo FastPitch + HiFiGAN streaming audio back to the caller. 174 Latin Spanish voices plus English.
No data leaves your datacenter. No cloud API fees. Full stack ownership — ASR, LLM, TTS, SIP integration, and MCP tool calling delivered as one coherent product.
Built for production
Benchmarks-driven engineering. Every component chosen and tuned for real traffic.
TTFA at single-user benchmarks
Concurrent conversations tested, zero errors
End-users at 10:1 overbooking on one DGX
TTFA improvement via TTS batch sizing
Model stack
- LLM: Qwen3.6-35B-A3B-FP8 via vLLM (FP8 quantized · MoE 3B active per token)
- STT: NeMo Conformer CTC bilingual EN-ES
- TTS: NeMo FastPitch + HiFiGAN (174 Latin Spanish + 10 English voices)
- VAD: Silero with configurable thresholds
- MCP framework: 5 tool-call parsing formats, multi-turn sequences, RAG per tenant
Hardware and capacity
- Recommended hardware: NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory) — ideal for the 35B MoE model at FP8. ~$4,700 MSRP from any NVIDIA partner. Compatible: any NVIDIA Hopper (H100/H200) or Blackwell (B100, DGX Station) with FP8 support and ≥80GB VRAM.
- Capacity per DGX Spark: 20 concurrent live conversations · 10:1 overbooking ratio → ~200 tenants registered per box (standard voice trunking economics — register your full tenant base, real-time capacity scales to peak hour).
- Scale-up path to thousands of customers: horizontal scaling via DGX Spark clusters (10-20 nodes for ~200-400 concurrent calls / 2,000-4,000 tenants) — or vertical scaling via datacenter-grade hardware: NVIDIA H100 (80GB), H200 (141GB), RTX PRO 6000 Blackwell (96GB), or DGX B200 cluster for enterprise multi-region deployments serving 5,000+ concurrent conversations. Cluster management (load balancing, failover, auto-scaling, zero-downtime updates) included in Enterprise tier.
- Hardware sourcing: BYO from any NVIDIA reseller, or we connect you with a validated distributor in your country — free service, no markup from us (local currency · local warranty · local logistics). Building a vetted reseller network across LATAM and US Hispanic markets.
Nemo-RT Pro vs cloud voice AI
Same conversational quality. Better economics from day one, plus compliance and data sovereignty you can\u0027t get from cloud-only vendors.
| Dimension | Nemo-RT Pro | OpenAI Realtime | Vapi | Retell AI |
|---|---|---|---|---|
| Deployment | On-prem, your hardware | Cloud only | Cloud only | Cloud only |
| Pricing model | One-time + optional Support | ~$0.30/min | $0.20 - $0.33/min | ~$0.11/min |
| Data residency | Your datacenter | US-hosted | US-hosted | US-hosted |
| Multi-tenant native | Yes (per-tenant RAG + MCP) | No (build yourself) | Limited | Limited |
| LATAM Spanish quality | 174 Latin voices | Generic multilingual | Castilian default | Castilian default |
| Year 1 at 10K min/mo | $5K Starter (DIY) | ~$36K | $24K - $40K | ~$13K |
Nemo-RT Pro is cheaper than every major cloud provider from as little as 5,000 min/mo — see the savings table below. Plus: your data never leaves your infra, you own your model, and there\u0027s zero risk of per-minute price hikes from a third party.
Production at scale since Q1 2026 — multi-tenant voice AI in LATAM
Deployed in LATAM serving 5 healthcare organizations as tenants — 24/7 automated appointment booking, triage, and patient surveys in native LATAM Spanish, all on a single NVIDIA DGX.
- 5 clinic tenants on a single NVIDIA DGX, multi-tenant from day one
- 200+ end-users serviced with 10:1 overbooking, zero errors in stress tests
- 3 productized MCPs: scheduling (mcp-citas), human escalation (mcp-transferencias), voice surveys (mcp-encuestas) — plus custom integrations (civil registry, national ID lookup)
- Per-tenant RAG documents, system prompts, and API keys
- Zero cloud AI per-minute fees — eliminated the entire variable cost line item (no Vapi / Retell / Twilio / OpenAI bills)
- 100% data residency — patient PHI never leaves their datacenter
"We evaluated every major voice AI platform and none of them fit: on-premise deployment, native LATAM Spanish, multi-tenant for our clinic customers, and no per-minute fees. Nemo-RT Pro was the only option that checked all those boxes."
Capabilities running in production today
Built for these operators
Where the economics and compliance story of on-prem voice AI lands hardest.
Medical appointment booking
Clinics and health networks automating 24/7 scheduling in native LATAM Spanish with RAG and scheduling MCPs.
Call center automation
Contact centers deflecting routine calls from human agents, with call-transfer MCPs to escalate when needed.
AI-powered SIP trunks
SIP operators reselling multi-tenant voice AI to their client base without per-minute margin pressure.
Government voice services
Municipalities and public entities needing Spanish-native voice access with strict data residency requirements.
Here's what you'd save vs cloud.
Pay once, own the deployment, skip per-minute fees forever. Numbers below assume you run it yourself (DIY) — optional managed Support extends the comparison.
| Your monthly voice volume | Cloud cost (3 years) Vapi/Retell + LLM, ~$0.14/min all-in |
Nemo-RT Pro one-time, you run it (DIY) |
You save (3 yr) | Pays for itself |
|---|---|---|---|---|
| 5,000 min/mo small / SMB |
$25,200 | Starter $5,000 | $20,200 | Month 8 |
| 25,000 min/mo mid-market |
$126,000 | Professional $9,000 | $117,000 | Month 3 |
| 50,000+ min/mo enterprise scale |
$252,000+ | Enterprise from $18,000 | $234,000+ | Month 2 |
Numbers assume you run the deployment yourself (DIY). Optional managed Support ($500 / $1,000 / $2,000 per month) covers model updates, SLA-backed engineer time, and capacity guidance — useful at higher volumes or for teams without dedicated ops.
Plus: you own your model, your data never leaves your infra, and zero risk of per-minute price hikes from a third party.
Transparent pricing. No per-minute fees.
On-prem voice AI deployed in 10-30 days. One-time delivery + optional monthly Support. Platform runs unlimited (calls / users / minutes / tenants) in every tier — tiers differ in delivery scope, not platform features.
Starter
One use-case live in 10 days. Fast start to production.
Delivery scope (10 days):
- 1 use-case deployed end-to-end (FAQ, booking, triage — your choice)
- 3 standard MCPs: call transfers, scheduling, surveys
- SIP integrated to your PBX (Asterisk/FreePBX/3CX/Avaya)
- Qwen 35B FP8 model on-prem (Spanish-native + English)
- Code + weights + docs delivered (you own it)
- 30 days Slack support post-launch
- Runbook + handoff call (1h)
Professional
Production multi-use-case with your brand identity. CRM/ERP integrated.
Everything in Starter, plus (12 days):
- 2 use-cases deployed (vs 1 in Starter)
- 4 MCPs total: 3 standard + 1 custom integration to your CRM/ERP/Helpdesk
- Voice cloning for your brand identity
- 45 days Slack support (vs 30)
- 2h training session for your team
Enterprise
Multi-department, multi-tenant deployments with SLA, dedicated success manager, cluster management.
Everything in Professional, plus (20-30 days):
- 3+ use-cases deployed (custom scope)
- Unlimited custom MCPs (any API in your stack)
- Multi-tenant setup: departments / brands isolated on same hardware
- Cluster management: multi-GPU/multi-node, load balancing, failover, auto-scaling, zero-downtime model updates
- Multi-language deployment (ES + EN + 1-3 more)
- Multi-voice cloning per brand or department
- 75 days white-glove support
- Dedicated success manager (weekly calls, 1 person accountable)
- SLA contract: uptime + response time guarantees
- Compliance documentation: HIPAA-friendly · GDPR audit trail · LGPD
Hardware: Customer-provided NVIDIA GPU recommended (DGX Spark ~$4,700 MSRP available as bundle). See Hardware and capacity section above for specs and procurement options.
Built by certified specialists.
INFINITO CLOUD ships voice and telecom systems to operators, clinics, and SaaS platforms — multi-tenant, on-prem, and bilingual.
US LLC
Incorporated 2024
200+
End-users serviced in production today
5
Healthcare tenants on a single NVIDIA DGX
Active member · Approved May 2026
AI Infrastructure & Operations
Generative AI Essentials
Speech API certified
Multicloud Network Associate
Frequently asked questions
Ready to see if Nemo-RT Pro fits your operation?
30-minute discovery call. No slides. We look at your volume, your compliance needs, your hardware, and tell you honestly whether this is a fit — or what would be better.