TL;DR: I ran head-to-head benchmarks comparing local inference on an RTX 6000 Blackwell Pro Q-Max against OpenRouter’s cloud APIs. For single queries, OpenRouter is often cheaper—Gemma 3 27B costs just $0.06/M tokens in the cloud vs $1.66/M locally. But batched inference changes everything: vLLM with 4 concurrent requests drops local costs to $0.21/M tokens, saving 59% vs OpenRouter. The real wins for local are latency (88ms vs 760ms TTFT for Gemma) and privacy. At ~$8,500 for the GPU, break-even requires 28 billion tokens of batched throughput—about 5 years at 8 hours/day. For most SMBs, local makes sense only for high-volume batch processing, privacy-critical workloads, or when you need sub-100ms response times.
It was a cold January night—early January 2026—and my wife asked me to turn up the heat. I obliged by heading to the basement and firing up that 2,500W space heater positioned strategically under the family room. The one made by NVIDIA.
The rack holds three CPUs, five GPUs, and enough spinning rust to make the floor warm. Tonight’s excuse for running it at full tilt: answering a question that keeps coming up in client conversations.
“Should I buy local hardware or just use OpenRouter?”
Fair question. With the RTX 6000 Blackwell Pro Q-Max now available and open-weight models like GPT-OSS 120B running comfortably on a single card, the economics have shifted. But have they shifted enough?
The Experiment
I set up a direct comparison: three models, two deployment methods, real measurements.
Models tested:
- GPT-OSS 120B (OpenAI’s open-weight model, 117B params, ~65GB in MXFP4)
- Gemma 3 27B (Google’s multimodal model, 27B params, ~18GB in Q4)
- Qwen 3 VL 30B (Alibaba’s vision-language MoE, 31B params, ~19GB in Q4)
Local setup:
- RTX 6000 Blackwell Pro Q-Max (~$8,500)
- Ollama for single-query inference
- vLLM for batched throughput testing
- Power draw: 300W GPU + 150W system = 450W total
Cloud setup:
- OpenRouter API
- Same models via their hosted endpoints
- Local machine at 150W for the API client
Network: 25 Gbps LAN, 2.5 Gbps WAN with 5ms latency—fast enough that cloud performance reflects API overhead, not network bottlenecks.
I measured two critical metrics: Time to First Token (TTFT) for latency, and tokens per second for throughput. Then I ran the numbers on cost.
The Cost Model
Before looking at benchmarks, let’s establish the math.
RTX 6000 Blackwell Pro Q-Max hourly costs:
| Component | Calculation | Cost/Hour |
|---|---|---|
| Hardware depreciation | $8,500 / (3 years × 8,760 hours) | $0.32 |
| Electricity | 450W × $0.15/kWh | $0.07 |
| Total | $0.39 |
For comparison, the RTX 5090 at $3,200 and 750W comes out to $0.23/hour—but it can’t run the 120B model with its 32GB VRAM limit.
OpenRouter pricing (per 1M tokens):
| Model | Input | Output | Blended (80% output) |
|---|---|---|---|
| GPT-OSS 120B | $0.15 | $0.60 | $0.51 |
| Gemma 3 27B | $0.036 | $0.064 | $0.058 |
| Qwen 3 VL 30B | $0.15 | $0.60 | $0.51 |
The blended rate assumes 20% input tokens and 80% output tokens—typical for generation tasks.
Local Throughput Results
Here’s what I measured on the RTX 6000 Blackwell Pro Q-Max via Ollama (single-query, after warmup):
| Model | Tokens/sec | Local $/M Tokens |
|---|---|---|
| GPT-OSS 120B | 159 | $0.68 |
| Gemma 3 27B | 65 | $1.66 |
| Qwen 3 VL 30B | 186 | $0.58 |
Wait—local is more expensive than cloud for all three models? At single-query throughput, yes. The Gemma result is especially stark: $1.66/M locally vs $0.058/M on OpenRouter. That’s a 29x difference.
But single-query throughput isn’t the whole story.
The Batching Multiplier
Running vLLM with concurrent requests reveals the GPU’s true potential:
| Concurrency | Tokens/sec | Local $/M | vs Cloud |
|---|---|---|---|
| 1 query | 175 | $0.62 | 22% more expensive |
| 4 concurrent | 505 | $0.21 | 59% cheaper |
| 16 concurrent | 513 | $0.21 | 59% cheaper |
| 64 concurrent | 515 | $0.21 | 59% cheaper |
At 4 concurrent requests, the GPU hits ~505 tokens/second aggregate throughput. That drops local cost to $0.21/M tokens—well below OpenRouter’s $0.51/M for the same model.
The sweet spot is 4-16 concurrent requests. Beyond that, you’re saturating the GPU without gaining throughput.
The Latency Story
Cost per token is only half the equation. For interactive applications, time to first token matters:
| Model | Local TTFT | Cloud TTFT | Improvement |
|---|---|---|---|
| GPT-OSS 120B | 352ms | 638-7,520ms* | 45-95% faster |
| Gemma 3 27B | 88ms | 760ms | 88% faster |
*Cloud providers have cold starts too—GPT-OSS hit 638ms when warm, 10+ seconds when cold.
After warmup, local inference delivers sub-400ms TTFT. Gemma 3 at 88ms is nearly 9x faster than cloud—that’s the difference between a responsive assistant and one that makes users wait.
With vLLM batching (4 concurrent), local TTFT drops even further to 37ms—faster than any cloud option.
The Bottom Line
Part 1’s key insight: single-query economics favor cloud, but batch processing favors local.
If you’re running occasional queries—answering customer questions, drafting emails, one-off code generation—OpenRouter wins on pure cost. Gemma 3 27B at $0.058/M tokens is essentially free.
But if you’re running batch jobs—processing documents, analyzing datasets, running evaluations—local inference with batched requests drops your cost to $0.21/M tokens. That’s real savings at scale.
In Part 2, I’ll calculate break-even points for different usage scenarios and make concrete recommendations for when to buy hardware vs. rent by the token.
Benchmarks run January 5, 2026 on RTX 6000 Blackwell Pro Q-Max. Ollama for single-query inference, vLLM for batched throughput. OpenRouter pricing retrieved via API. All models in 4-bit quantization. Network: 25 Gbps LAN, 2.5 Gbps WAN with 5ms latency. Test prompt: 50-word ML explanation task generating ~150-350 output tokens.