I Spent $8,500 on a GPU to Beat Cloud AI. Here's What Happened.

TL;DR: I ran head-to-head benchmarks comparing local inference on an RTX 6000 Blackwell Pro Q-Max against OpenRouter’s cloud APIs. For single queries, OpenRouter is often cheaper—Gemma 3 27B costs just $0.06/M tokens in the cloud vs $1.66/M locally. But batched inference changes everything: vLLM with 4 concurrent requests drops local costs to $0.21/M tokens, saving 59% vs OpenRouter. The real wins for local are latency (88ms vs 760ms TTFT for Gemma) and privacy. At ~$8,500 for the GPU, break-even requires 28 billion tokens of batched throughput—about 5 years at 8 hours/day. For most SMBs, local makes sense only for high-volume batch processing, privacy-critical workloads, or when you need sub-100ms response times.


It was a cold January night—early January 2026—and my wife asked me to turn up the heat. I obliged by heading to the basement and firing up that 2,500W space heater positioned strategically under the family room. The one made by NVIDIA.

The rack holds three CPUs, five GPUs, and enough spinning rust to make the floor warm. Tonight’s excuse for running it at full tilt: answering a question that keeps coming up in client conversations.

“Should I buy local hardware or just use OpenRouter?”

Fair question. With the RTX 6000 Blackwell Pro Q-Max now available and open-weight models like GPT-OSS 120B running comfortably on a single card, the economics have shifted. But have they shifted enough?

The Experiment

I set up a direct comparison: three models, two deployment methods, real measurements.

Models tested:

  • GPT-OSS 120B (OpenAI’s open-weight model, 117B params, ~65GB in MXFP4)
  • Gemma 3 27B (Google’s multimodal model, 27B params, ~18GB in Q4)
  • Qwen 3 VL 30B (Alibaba’s vision-language MoE, 31B params, ~19GB in Q4)

Local setup:

  • RTX 6000 Blackwell Pro Q-Max (~$8,500)
  • Ollama for single-query inference
  • vLLM for batched throughput testing
  • Power draw: 300W GPU + 150W system = 450W total

Cloud setup:

  • OpenRouter API
  • Same models via their hosted endpoints
  • Local machine at 150W for the API client

Network: 25 Gbps LAN, 2.5 Gbps WAN with 5ms latency—fast enough that cloud performance reflects API overhead, not network bottlenecks.

I measured two critical metrics: Time to First Token (TTFT) for latency, and tokens per second for throughput. Then I ran the numbers on cost.

The Cost Model

Before looking at benchmarks, let’s establish the math.

RTX 6000 Blackwell Pro Q-Max hourly costs:

Component Calculation Cost/Hour
Hardware depreciation $8,500 / (3 years × 8,760 hours) $0.32
Electricity 450W × $0.15/kWh $0.07
Total   $0.39

For comparison, the RTX 5090 at $3,200 and 750W comes out to $0.23/hour—but it can’t run the 120B model with its 32GB VRAM limit.

OpenRouter pricing (per 1M tokens):

Model Input Output Blended (80% output)
GPT-OSS 120B $0.15 $0.60 $0.51
Gemma 3 27B $0.036 $0.064 $0.058
Qwen 3 VL 30B $0.15 $0.60 $0.51

The blended rate assumes 20% input tokens and 80% output tokens—typical for generation tasks.

Local Throughput Results

Here’s what I measured on the RTX 6000 Blackwell Pro Q-Max via Ollama (single-query, after warmup):

Model Tokens/sec Local $/M Tokens
GPT-OSS 120B 159 $0.68
Gemma 3 27B 65 $1.66
Qwen 3 VL 30B 186 $0.58

Wait—local is more expensive than cloud for all three models? At single-query throughput, yes. The Gemma result is especially stark: $1.66/M locally vs $0.058/M on OpenRouter. That’s a 29x difference.

But single-query throughput isn’t the whole story.

The Batching Multiplier

Running vLLM with concurrent requests reveals the GPU’s true potential:

Concurrency Tokens/sec Local $/M vs Cloud
1 query 175 $0.62 22% more expensive
4 concurrent 505 $0.21 59% cheaper
16 concurrent 513 $0.21 59% cheaper
64 concurrent 515 $0.21 59% cheaper

At 4 concurrent requests, the GPU hits ~505 tokens/second aggregate throughput. That drops local cost to $0.21/M tokens—well below OpenRouter’s $0.51/M for the same model.

The sweet spot is 4-16 concurrent requests. Beyond that, you’re saturating the GPU without gaining throughput.

The Latency Story

Cost per token is only half the equation. For interactive applications, time to first token matters:

Model Local TTFT Cloud TTFT Improvement
GPT-OSS 120B 352ms 638-7,520ms* 45-95% faster
Gemma 3 27B 88ms 760ms 88% faster

*Cloud providers have cold starts too—GPT-OSS hit 638ms when warm, 10+ seconds when cold.

After warmup, local inference delivers sub-400ms TTFT. Gemma 3 at 88ms is nearly 9x faster than cloud—that’s the difference between a responsive assistant and one that makes users wait.

With vLLM batching (4 concurrent), local TTFT drops even further to 37ms—faster than any cloud option.

The Bottom Line

Part 1’s key insight: single-query economics favor cloud, but batch processing favors local.

If you’re running occasional queries—answering customer questions, drafting emails, one-off code generation—OpenRouter wins on pure cost. Gemma 3 27B at $0.058/M tokens is essentially free.

But if you’re running batch jobs—processing documents, analyzing datasets, running evaluations—local inference with batched requests drops your cost to $0.21/M tokens. That’s real savings at scale.

In Part 2, I’ll calculate break-even points for different usage scenarios and make concrete recommendations for when to buy hardware vs. rent by the token.


Benchmarks run January 5, 2026 on RTX 6000 Blackwell Pro Q-Max. Ollama for single-query inference, vLLM for batched throughput. OpenRouter pricing retrieved via API. All models in 4-bit quantization. Network: 25 Gbps LAN, 2.5 Gbps WAN with 5ms latency. Test prompt: 50-word ML explanation task generating ~150-350 output tokens.