Quantization Techniques: Running Large Models on Smaller Hardware Without Losing Accuracy | CallSphere Blog

Why Quantization Matters

A 70-billion parameter model stored in standard FP16 precision requires approximately 140 GB of GPU memory just for the weights — before accounting for the KV cache, activations, and framework overhead. That exceeds the capacity of any single consumer GPU and requires multiple enterprise-grade GPUs.

Quantization reduces the numerical precision of model weights (and sometimes activations) from 16-bit floating point to lower-precision formats like 8-bit integers or 4-bit floats. The result: a 70B model that required 140 GB in FP16 fits in 35 GB at INT4 — runnable on a single high-end consumer GPU.

The engineering challenge is doing this without meaningful quality degradation. Modern quantization techniques have gotten remarkably good at this trade-off.

Numerical Formats Explained

Understanding the available formats is the foundation for choosing a quantization strategy.

flowchart LR
    FP16(["FP16 model<br/>baseline weights"])
    CALIB["Calibration set<br/>128 to 1024 samples"]
    METHOD{"Quantization<br/>method"}
    GPTQ["GPTQ<br/>weight only INT4"]
    AWQ["AWQ<br/>activation aware"]
    GGUF["llama.cpp GGUF<br/>K-quants for CPU"]
    EVAL["Eval delta vs FP16<br/>perplexity, MMLU"]
    SERVE[("Serve on<br/>consumer GPU")]
    FP16 --> CALIB --> METHOD
    METHOD --> GPTQ --> EVAL
    METHOD --> AWQ --> EVAL
    METHOD --> GGUF --> EVAL
    EVAL --> SERVE
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SERVE fill:#059669,stroke:#047857,color:#fff

FP16 (16-bit Floating Point)

The standard training and serving precision for most models. Provides a good balance between range and precision with 1 sign bit, 5 exponent bits, and 10 mantissa bits.

BF16 (Brain Floating Point 16)

Same total bits as FP16 but with 8 exponent bits and 7 mantissa bits. Larger dynamic range at the cost of precision. Preferred for training because gradient values span a wide range.

FP8 (8-bit Floating Point)

Two variants: E4M3 (4 exponent, 3 mantissa) for forward pass and E5M2 (5 exponent, 2 mantissa) for gradients. Halves memory compared to FP16 with minimal quality loss — typically less than 0.5% degradation on standard benchmarks.

INT8 (8-bit Integer)

Maps floating-point values to 256 integer levels. Requires calibration to determine the scaling factor that maps the float range to integers. Highly hardware-efficient — most modern GPUs have dedicated INT8 compute units.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

INT4 / FP4 (4-bit)

Extreme compression: each weight uses only 4 bits. Quality preservation depends heavily on the quantization algorithm. Naive INT4 quantization is unusable; advanced methods like GPTQ and AWQ make it practical.

Quantization Methods

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without additional training. It is fast and requires only a small calibration dataset (typically 128 to 512 examples).

# Example: Quantizing a model with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4 - optimal for normally distributed weights
    bnb_4bit_compute_dtype="bfloat16",     # Compute in BF16 for accuracy
    bnb_4bit_use_double_quant=True,        # Quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=quantization_config,
    device_map="auto",
)
# 70B model now fits in ~35GB VRAM

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is a one-shot weight quantization method that minimizes the layer-wise reconstruction error. For each layer, it finds the quantized weights that produce the most similar output to the original FP16 weights when given calibration data.

Key advantages:

Produces high-quality INT4 quantized models
One-time cost: quantization takes hours, but the resulting model serves indefinitely
Broad hardware compatibility

AWQ (Activation-Aware Weight Quantization)

AWQ observes that not all weights are equally important. Weights corresponding to large activations contribute more to the output. AWQ protects these salient weights by keeping them at higher precision while aggressively quantizing less important weights.

GGUF / llama.cpp Quantization

The GGUF format (used by llama.cpp) supports a variety of quantization levels from Q2_K (2-bit) through Q8_0 (8-bit). It uses a block-wise quantization scheme where each block of weights gets its own scaling factor.

# Common GGUF quantization levels and their trade-offs:
Q2_K  - 2.63 bpw - ~60% quality retention - extreme compression
Q3_K_M - 3.07 bpw - ~75% quality retention - aggressive but usable
Q4_K_M - 4.83 bpw - ~92% quality retention - best balance for most use cases
Q5_K_M - 5.69 bpw - ~96% quality retention - high quality
Q6_K  - 6.56 bpw - ~99% quality retention - near-lossless
Q8_0  - 8.50 bpw - ~99.5% quality retention - minimal compression

Accuracy Trade-offs in Practice

The theoretical information loss from quantization does not always translate into meaningful quality degradation. Here are measured results from a representative 70B model:

Precision	Memory (GB)	MMLU	HumanEval	MT-Bench	Throughput vs FP16
FP16	140	82.1%	81.7%	8.9	1.0x
FP8	70	81.8%	81.5%	8.9	1.4x
INT8	70	81.5%	80.9%	8.8	1.6x
INT4 (GPTQ)	35	80.3%	79.2%	8.6	1.8x
INT4 (AWQ)	35	80.7%	79.8%	8.7	1.8x
Q4_K_M (GGUF)	38	80.1%	78.5%	8.5	1.5x

The pattern is clear: FP8 and INT8 quantization are nearly lossless for most applications. INT4 introduces measurable but often acceptable degradation.

Mixed-Precision Strategies

The most sophisticated deployments do not apply uniform quantization. Instead, they use different precision for different components:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Attention layers: Keep at FP8 or higher — these are critical for quality
FFN layers: Quantize more aggressively to INT4 — these tolerate compression better
Embedding layers: Keep at FP16 — quantization here disproportionately hurts quality
KV cache: Quantize to FP8 — saves memory at long context with minimal impact

# Mixed-precision quantization configuration example
layer_quant_config = {
    "attention.q_proj": "fp8",
    "attention.k_proj": "fp8",
    "attention.v_proj": "fp8",
    "attention.o_proj": "fp8",
    "mlp.gate_proj": "int4",
    "mlp.up_proj": "int4",
    "mlp.down_proj": "int4",
    "embed_tokens": "fp16",
    "lm_head": "fp16",
}

Quantization-Aware Training (QAT)

For teams willing to invest in retraining, QAT simulates quantization during the training process, allowing the model to adapt its weights to perform well at lower precision. QAT models consistently outperform post-training quantized models at the same bit width, typically by 1-3 percentage points.

The cost is significant — QAT requires a full or partial training run — but for models being deployed at massive scale, the per-query savings from serving a QAT INT4 model vs a PTQ INT4 model can justify the upfront investment.

Practical Deployment Recommendations

Start with FP8: It is nearly lossless, halves memory, and is natively supported on modern GPU architectures. This should be the default for production serving.
Use INT4 for cost-constrained or edge deployments: When GPU budget is limited, GPTQ or AWQ INT4 quantization provides the best quality at 4-bit precision.
Benchmark on your actual task: Academic benchmarks may not reflect your specific use case. Always evaluate quantized models on representative examples from your production workload.
Quantize the KV cache separately: Even if you serve weights in FP8, quantizing the KV cache to FP8 saves substantial memory at long context lengths with minimal quality impact.
Consider the full serving stack: Quantization interacts with other optimizations (batching, speculative decoding, paged attention). Test the complete pipeline, not just isolated components.

Quantization is not a compromise — at FP8, it is essentially free performance. At INT4, it is an engineering trade-off that, when done correctly, enables deployments that would otherwise require 4x the hardware budget.

Frequently Asked Questions

What is model quantization in AI?

Quantization reduces the numerical precision of model weights and activations from higher-precision formats like FP16 to lower-precision formats like INT8, FP8, or INT4. A 70-billion parameter model that requires approximately 140 GB of GPU memory in FP16 can fit in just 35 GB at INT4 precision. Modern quantization techniques achieve this compression with minimal quality degradation, making large models deployable on significantly less expensive hardware.

What is the difference between FP8, INT8, and INT4 quantization?

FP8 retains floating-point representation at 8 bits and is widely considered the new default serving precision, delivering near-zero quality loss with 2x memory savings. INT8 uses integer representation and reduces memory by 2x with slightly more quality risk than FP8. INT4 achieves 4x memory reduction but requires calibration-based techniques like GPTQ or AWQ to maintain acceptable output quality, with typical quality degradation of 1 to 3 percentage points on benchmarks.

How does quantization affect model performance and accuracy?

At FP8 precision, quantization is essentially free performance with quality indistinguishable from FP16 on most tasks. At INT8, quality loss is under 1 percentage point for well-calibrated models. At INT4, quality degradation ranges from 1 to 5 percentage points depending on the technique used and model architecture. Post-training quantization methods like GPTQ and AWQ minimize this loss by calibrating on representative data, and mixing precision levels across different layers can further optimize the accuracy-efficiency trade-off.