Friday Morning Space Heaters: The Real Cost of Quantization

It all started when my wife woke up this morning and asked, “Alexa, what is the temperature outside?”

Alexa answered 21 degrees, and she was not talking about Celsius.

The heat pumps were not keeping up. Overnight, indoor temps had dropped three degrees below their set points. The house had that particular chill that makes you question every life decision that led to living in New England. Except I don’t live in New England, I live in Virginia, and it’s not supposed to be this cold here. At least not in my mind.

I had two options: turn on the oil furnace, or fire up the NVIDIA space heaters.

So here we are, running another batch of benchmarks while NVIDIA heats the house. My wife gets a warm house, I get data, and the electric company gets paid. Everyone wins. (The oil company does not win. The oil company can deal with it.)

TL;DR

I benchmarked three Qwen3 MoE models (30B, 30B-VL, and 80B variants) at both 4-bit and 8-bit quantization across speed, MMLU, and GSM8K math reasoning tests. The results: 4-bit quantization is ready for production. Speed improved 9-15% with 4-bit, while accuracy differences were 0-3%—essentially noise for most applications. The 40% memory savings from 4-bit is basically free.

The interesting finding was about AWQ vs GGUF quantization methods. AWQ calibrated on math-heavy data outperformed GGUF on math benchmarks but underperformed on general knowledge. Quantization isn’t just about bit depth—the calibration dataset shapes which capabilities survive. Choose your quantization method based on your use case, not just the bit count.

If you’re still defaulting to 8-bit “just to be safe,” you’re leaving performance on the table for minimal benefit.

The Quantization Question

Since I was going to burn the electricity anyway, I decided to tackle a question that’s been nagging at me: what do you actually lose when you quantize a model down to 4 bits?

The marketing materials all promise “minimal quality loss,” but I’ve been running Q8_0 models out of some vague sense that 8-bit must be meaningfully better than 4-bit. That’s a lot of VRAM I’m leaving on the table if the difference turns out to be noise.

Time to find out.

The Setup

I tested three Qwen3 model variants, each in both Q4_K_M and Q8_0 quantization:

  • Qwen3-30B-A3B-Instruct: The standard 30B parameter model with 3B active parameters (MoE architecture)
  • Qwen3-VL-30B-A3B-Instruct: The vision-language variant, also 30B with 3B active
  • Qwen3-Next-80B-A3B-Instruct: The larger 80B parameter model with 3B active parameters

I also threw in the Qwen3-VL model running through vLLM with AWQ 4-bit quantization. This led to some unexpected discoveries that made the whole frozen-morning exercise worthwhile.

The Benchmarks

Three tests, each designed to stress different model capabilities:

Speed Test: Generate 256 tokens about neural network backpropagation. Five iterations per model after a warmup query. The warmup matters—without it, you’re measuring model loading time, not inference speed.

MMLU: 100 multiple-choice questions across 10 subjects. Abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, college biology, college chemistry, college physics, computer security, world religions. Tests factual recall and reasoning.

GSM8K: 100 grade-school math word problems. Multi-step reasoning required. Simple problems that reveal whether quantization has damaged the model’s ability to chain thoughts together.

Speed Results: 4-bit Wins, But By How Much?

Model Q4_K_M Q8_0 Speedup
Qwen3-30B 196.6 tok/s 170.7 tok/s +15%
Qwen3-VL-30B 193.0 tok/s 168.5 tok/s +15%
Qwen3-Next-80B 89.0 tok/s 81.6 tok/s +9%

The 30B models showed a consistent 15% speed improvement at 4-bit. The 80B model showed a smaller 9% improvement—at that scale, memory bandwidth becomes less of a bottleneck relative to compute.

Then there’s vLLM. The AWQ 4-bit model hit 236.1 tokens per second—22% faster than Ollama’s Q4_K_M on the same model architecture. Time-to-first-token of just 40ms compared to Ollama’s 100ms.

If raw throughput is your priority and you can use AWQ quantization, vLLM is the clear winner. But speed isn’t everything. The real question is what you sacrifice for those extra tokens per second.

Accuracy Results: The Surprising Part

Going in, I expected 8-bit to show meaningfully better accuracy. I was wrong.

MMLU Results

Model Q4_K_M Q8_0 Difference
Qwen3-30B 72% 72% 0%
Qwen3-VL-30B 71% 73% +2%
Qwen3-Next-80B 75% 78% +3%

The 30B base model showed zero difference between quantization levels. The VL variant showed 2% improvement with 8-bit. The 80B model showed the largest gap at 3%.

GSM8K Results

Model Q4_K_M Q8_0 Difference
Qwen3-30B 92% 92% 0%
Qwen3-VL-30B 92% 93% +1%
Qwen3-Next-80B 92% 91% -1%

Math reasoning was even more striking. All models performed within 1-2 percentage points regardless of quantization. The 80B model actually scored lower at 8-bit, though that’s likely statistical noise on a 100-question test.

Bottom line: For practical purposes, 4-bit and 8-bit produce nearly identical results. The 15% speed improvement from 4-bit comes essentially for free.

The AWQ Anomaly: When Calibration Data Matters More Than Bits

Here’s where things got interesting. The vLLM AWQ model showed a peculiar pattern:

Benchmark AWQ 4-bit GGUF Q4_K_M Difference
GSM8K 94% 92% AWQ +2%
MMLU 69% 71% GGUF +2%

AWQ outperformed GGUF on math while underperforming on general knowledge. A trade-off, not uniform degradation.

I initially suspected prompt formatting differences between vLLM’s OpenAI-compatible templates and Ollama’s format. Tested vLLM with raw completions API, bypassing chat templates entirely. MMLU dropped to 66%. Not the culprit.

The answer turned out to be the calibration dataset.

I dug into the model card for the AWQ quantization I was using (cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit) and found it was quantized using nvidia/Llama-Nemotron-Post-Training-Dataset as calibration data. That dataset is heavily weighted toward mathematical reasoning, code generation, and instruction following.

AWQ (Activation-aware Weight Quantization) analyzes which weights matter most based on activation patterns during inference on the calibration dataset. Calibrate on math-heavy data, and the quantization preserves weights important for math at the expense of weights important for factual recall.

GGUF’s Q4_K_M uses a static mixed-precision scheme independent of calibration data. Same strategy regardless of use case. More balanced capability preservation.

Practical implication: Quantization isn’t just about bit depth—the calibration dataset shapes what capabilities survive. Building a math tutoring app? AWQ calibrated on mathematical data might outperform higher-bit static quantization. Building a general assistant? GGUF might be safer.

The Experiment That Didn’t Work

I originally planned to compare GGUF quantizations from multiple sources: unsloth, official Qwen, bartowski. Pulled four additional models:

  • unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_S
  • unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:IQ4_XS
  • Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M
  • bartowski/Qwen_Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M

All four crashed. “Model runner has unexpectedly stopped.” Before crashing, they ran 64x slower than working models—8 seconds per question versus 0.12 seconds—suggesting Ollama was falling back to CPU inference.

Ollama reported "quantization_level": "unknown" for these HuggingFace-sourced models while correctly identifying the native model as Q4_K_M. Compatibility issue between Ollama and HuggingFace-hosted GGUFs for Qwen3-VL specifically.

The tooling around local LLM deployment is still maturing. What works in one configuration fails in another. Error messages don’t always point to root cause. This is the reality of running your own models.

Why MoE Models Might Be Quantization-Friendly

All three Qwen3 models use Mixture of Experts architecture with 3B active parameters out of their total count. For any given inference, only a fraction of weights are actually used.

With a dense model, every weight matters for every token. Quantization errors accumulate across the entire forward pass. With MoE, the router selects which expert networks to activate. Quantization errors in inactive experts don’t affect output at all.

This might explain why the 80B model showed similar quantization sensitivity to the 30B models despite having more total parameters. Same active parameter count (3B), same effective “quantization surface area.”

It suggests MoE models might be inherently more quantization-friendly than dense models of equivalent capability. Sparse activation provides natural error isolation—though I’d need to run head-to-head comparisons with dense models to confirm that hypothesis.

Memory Footprint

Model Q4_K_M Size Q8_0 Size Savings
Qwen3-30B ~18.6 GB ~32.5 GB 43%
Qwen3-VL-30B ~19.6 GB ~33.5 GB 42%
Qwen3-Next-80B ~50.1 GB ~84.8 GB 41%

4-bit models consistently use about 40% less storage and VRAM. For the 80B model, that’s the difference between fitting on a single high-end GPU versus requiring multiple cards or CPU offloading.

Faster loading times too—Q4_K_M models loaded in roughly half the time of Q8_0 equivalents in my tests. For interactive applications where users switch between models, that matters.

What I’m Actually Going to Do Differently

Based on this data, I’m switching my defaults:

4-bit (Q4_K_M) becomes my standard for:

  • Speed-sensitive work
  • Memory-constrained hardware
  • Batch processing
  • Math/code/structured reasoning tasks

8-bit (Q8_0) only when:

  • Maximum accuracy on knowledge tasks is genuinely critical
  • I have VRAM to burn
  • The 3% accuracy difference actually matters for the application

vLLM with AWQ when:

  • Raw throughput is the primary metric
  • Time-to-first-token needs to be minimal
  • Workload skews toward math and code

Ollama-native models when:

  • Simplicity matters more than peak performance
  • I need guaranteed compatibility
  • I don’t want to debug mysterious crashes

The Bigger Picture

The most important finding isn’t any single number. It’s that 4-bit quantization has matured to the point where it’s the sensible default.

The days of dramatic quality loss from aggressive quantization are behind us, at least for models properly quantized with modern techniques. 15% speed improvement compounding across every request. For a service handling thousands of queries daily, that’s the difference between three GPUs and four. The accuracy trade-off—0-3% depending on task and model—is small enough that most applications won’t notice.

What requires careful thought is the choice of quantization method and calibration data. AWQ and GGUF aren’t interchangeable. They make different trade-offs that interact with your use case in ways that aren’t obvious without testing.

If you’re still defaulting to 8-bit “just to be safe,” you’re leaving performance on the table for minimal benefit.


By the time I finished running these benchmarks, the family room above the basement was a comfortable 72 degrees. My wife approved.

“Did you learn anything useful?” she asked.

“4-bit is fine. I’ve been wasting VRAM for months.”

“That’s nice, dear. I meant about setting up the oil furnace to come on when the heat pumps can’t keep up.”

“…I’ll look into that tomorrow.”

She went back to her book. The GPUs spun down. The heat pumps finally caught up as the afternoon sun warmed the south-facing windows.

I’m calling it a productive morning.

These NVIDIA space heaters have been great this winter, but I’m beginning to wonder how they’re going to work out in the spring. Oh well, a problem for a few months from now.


Benchmarks conducted January 2026 using Ollama 0.14.1 and vLLM 0.13.0 on RTX 6000 Blackwell QMax. Test datasets: hails/mmlu_no_train (100 questions) and openai/gsm8k (100 questions).