1,000x Cheaper: Why Local RAG Changes Everything

TL;DR: RAG pipelines flip the local vs cloud economics completely. vLLM embeddings hit 8,091 chunks/sec—385x faster than cloud and 1,000x cheaper ($0.01 vs $10 per million chunks). Even Ollama manages 117 chunks/sec, 6x faster than OpenRouter. Reranking on a $1,000 RTX 5070 Ti scores 329 docs/sec. For enterprise RAG processing 100M chunks/month, the RTX 6000 Blackwell Pro Q-Max pays for itself in 2.5 days. The choice between Ollama and vLLM matters enormously: vLLM’s native batching delivers 70x better throughput than Ollama’s sequential processing on the same hardware.


In Parts 1 and 2, I benchmarked LLM generation—the part of AI that writes responses. The verdict: cloud wins on cost for low-volume, local wins on latency and privacy.

But most production AI systems aren’t just generating text. They’re doing Retrieval-Augmented Generation (RAG)—searching a knowledge base, ranking results, then generating answers grounded in retrieved documents. RAG pipelines have two additional compute-intensive steps: embedding and reranking.

The question: how do those economics change the calculation?

The RAG Pipeline

A typical RAG system:

  1. Embed the query → Convert user question to a vector
  2. Vector search → Find similar documents (handled by vector DB)
  3. Rerank results → Score relevance of top-K candidates
  4. Generate response → LLM synthesizes answer from context

Steps 1 and 3 hit your inference hardware. Step 2 is your vector database. Step 4 we covered in Parts 1-2. Tonight we’re benchmarking embedding and reranking.

The Setup

Embedding model: Qwen3-Embedding 0.6B (1024-dim vectors, BF16)

  • Local: Ollama on RTX 6000 Blackwell Pro Q-Max
  • Cloud: OpenRouter → OpenAI text-embedding-3-small (1536-dim)
  • Test chunks: ~500 tokens each (typical RAG chunk size)

Reranking model: Qwen3-Reranker 4B

  • Local: vLLM on RTX 5070 Ti ($1,000 GPU, ultra9 server)
  • Cloud: Qwen3-Reranker wasn’t responding on OpenRouter during testing (Cohere Rerank used for cost comparison)

This brings up another advantage of local: model availability on cloud is outside your control. OpenRouter handles provider outages better than most by routing to alternatives, but it remains an issue—your production RAG pipeline shouldn’t fail because a cloud endpoint went down.

Network: 2.5 Gbps bidirectional WAN, 5ms latency (to Cloudflare), 25 Gbps LAN. This is top-end for most office connections—typical offices with slower links or higher latency would see an even larger advantage for local embeddings and reranking.

Embedding Results

Backend Concurrency Chunks/sec Latency/chunk
Ollama (local) 1 24 42ms
Ollama (local) 4 67 15ms
Ollama (local) 16 114 9ms
Ollama (local) 64 117 9ms
vLLM (local) 4 batches 8,091 0.1ms
vLLM (local) 16 batches 3,586 0.3ms
OpenRouter (cloud) batch 21 48ms

The vLLM result is stunning: 8,091 chunks/second—that’s 70x faster than Ollama and 385x faster than cloud.

vLLM’s native batching is the key. While Ollama processes embeddings one at a time (even with concurrent requests), vLLM batches them efficiently on the GPU. The 0.6B model in BF16 is small enough that vLLM can process massive batches with near-zero latency per chunk.

Ollama tops out around 117 chunks/sec regardless of concurrency—the bottleneck is its sequential processing, not the GPU. OpenRouter’s cloud batching hits 21 chunks/sec, limited by network round-trips and API overhead.

Reranking Results

Backend Concurrency Docs/sec Latency/doc
vLLM ultra9 (RTX 5070 Ti) 1 104 10ms
vLLM ultra9 (RTX 5070 Ti) 8 329 3.0ms

The RTX 5070 Ti—a $1,000 consumer GPU—reranks 329 documents per second with 8 concurrent requests. That’s 3.0ms per document.

For context: a typical RAG query retrieves 20-100 candidate documents from vector search, then reranks them. At 329 docs/sec, reranking 100 documents takes 300ms. That’s imperceptible in a user-facing application.

The Cost Analysis

OpenAI text-embedding-3-small pricing: $0.02 per 1M tokens

A typical RAG chunk is ~500 tokens. At $0.02/M tokens:

  • Cost per 1M chunks embedded: $10
  • At 21 chunks/sec cloud throughput: 13 hours to embed 1M chunks

Local embedding costs:

RTX 6000 Blackwell Pro Q-Max hourly cost: $0.39 (from Part 1)

Backend Chunks/sec Time for 1M Cost per 1M
OpenRouter 21 13.2 hours $10.00
Ollama 117 2.4 hours $0.94
vLLM 8,091 2 minutes $0.01

With vLLM, embedding 1 million chunks costs about a penny and takes 2 minutes. That’s 1,000x cheaper than cloud.

Metric Cloud vLLM Local Advantage
Cost per 1M chunks $10.00 $0.01 1,000x cheaper
Time to embed 1M 13.2 hours 2 minutes 385x faster

For reranking, cloud options like Cohere Rerank cost ~$1 per 1K searches. Local reranking on a $1,000 RTX 5070 Ti is essentially free after hardware costs.

Break-Even for RAG Workloads

The RTX 6000 Blackwell Pro Q-Max at ~$8,500 breaks even on embedding alone at:

$8,500 / ($10 - $0.01 per M chunks) = 850M chunks

With vLLM at 8,091 chunks/sec, that’s just 29 hours of continuous embedding.

Usage Chunks/month Break-even
Light (100K docs/month) 100K 7 years
Medium (1M docs/month) 1M 8.5 months
Heavy (10M docs/month) 10M 25 days
Enterprise (100M docs/month) 100M 2.5 days

For enterprise RAG deployments processing 100M+ chunks monthly—customer support knowledge bases, legal document search, enterprise wikis—the GPU pays for itself in under a week.

The RTX 5070 Ti Sweet Spot

The $1,000 RTX 5070 Ti running reranking:

  • 329 docs/sec reranking throughput (8 concurrent)
  • 16GB VRAM fits the 4B reranker model
  • Power: 300W GPU + 150W system = 450W total
  • Hourly cost: ~$0.11 (electricity + 3yr depreciation)

Break-even vs Cohere Rerank (~$1/1K searches):

Daily Queries Cloud Cost/Day Break-even
1,000 $1 2.7 years
10,000 $10 3.3 months
50,000 $50 20 days

For high-volume RAG (50K+ queries/day), the RTX 5070 Ti pays for itself in under a month.

The Latency Advantage

Beyond cost, local RAG components transform user experience:

Operation Cloud Ollama vLLM
Embed query 48ms 9ms 0.1ms
Rerank 100 docs ~500ms N/A 300ms
Total RAG overhead ~548ms ~309ms ~300ms

With vLLM, embedding latency is effectively zero. Combined with the LLM latency advantages from Part 1 (88ms vs 760ms TTFT for Gemma), local RAG delivers noticeably snappier responses.

When Local RAG Wins

Always wins:

  • High-volume embedding (10M+ chunks/month)
  • Reranking-heavy workloads (1K+ queries/day)
  • Latency-sensitive applications
  • Privacy-critical document search

Cloud might win:

  • Low-volume, occasional indexing
  • Burst capacity for one-time migrations
  • When you need OpenAI’s larger embedding models

The Hybrid Architecture

My recommendation for most RAG deployments:

  1. vLLM for embeddings—the 1,000x cost advantage over cloud is too large to ignore
  2. Local reranking on even modest hardware—a $1,000 RTX 5070 Ti handles enterprise load
  3. Local or cloud LLM depending on volume and privacy needs (see Parts 1-2)
  4. Ollama for convenience, vLLM for performance—same model, 70x throughput difference

The GPU rack in my basement now runs triple duty: LLM inference, embedding, and reranking. The space heater metaphor from Part 1 keeps getting more literal—but the economics finally work out.


Benchmarks run January 5, 2026. Embedding: Qwen3-Embedding-0.6B (BF16) via Ollama and vLLM on RTX 6000 Blackwell Pro Q-Max. Reranking: Qwen3-Reranker-4B via vLLM on RTX 5070 Ti. Cloud: OpenRouter → OpenAI text-embedding-3-small. Network: 2.5 Gbps WAN, 5ms latency, 25 Gbps LAN. Test corpus: 80 chunks (~500 tokens each). Concurrency tested: 1, 4, 16, 64 workers.