Skip to content
Large Language Models
Large Language Models9 min read11 views

Quantization Techniques: Running Large Models on Smaller Hardware Without Losing Accuracy | CallSphere Blog

Quantization enables deploying large language models on constrained hardware by reducing numerical precision. Learn about FP4, FP8, INT8, and GPTQ techniques with practical accuracy trade-off analysis.

Why Quantization Matters

A 70-billion parameter model stored in standard FP16 precision requires approximately 140 GB of GPU memory just for the weights — before accounting for the KV cache, activations, and framework overhead. That exceeds the capacity of any single consumer GPU and requires multiple enterprise-grade GPUs.

Quantization reduces the numerical precision of model weights (and sometimes activations) from 16-bit floating point to lower-precision formats like 8-bit integers or 4-bit floats. The result: a 70B model that required 140 GB in FP16 fits in 35 GB at INT4 — runnable on a single high-end consumer GPU.

The engineering challenge is doing this without meaningful quality degradation. Modern quantization techniques have gotten remarkably good at this trade-off.

Numerical Formats Explained

Understanding the available formats is the foundation for choosing a quantization strategy.

flowchart TD
    START["Quantization Techniques: Running Large Models on …"] --> A
    A["Why Quantization Matters"]
    A --> B
    B["Numerical Formats Explained"]
    B --> C
    C["Quantization Methods"]
    C --> D
    D["Accuracy Trade-offs in Practice"]
    D --> E
    E["Mixed-Precision Strategies"]
    E --> F
    F["Quantization-Aware Training QAT"]
    F --> G
    G["Practical Deployment Recommendations"]
    G --> H
    H["Frequently Asked Questions"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

FP16 (16-bit Floating Point)

The standard training and serving precision for most models. Provides a good balance between range and precision with 1 sign bit, 5 exponent bits, and 10 mantissa bits.

BF16 (Brain Floating Point 16)

Same total bits as FP16 but with 8 exponent bits and 7 mantissa bits. Larger dynamic range at the cost of precision. Preferred for training because gradient values span a wide range.

FP8 (8-bit Floating Point)

Two variants: E4M3 (4 exponent, 3 mantissa) for forward pass and E5M2 (5 exponent, 2 mantissa) for gradients. Halves memory compared to FP16 with minimal quality loss — typically less than 0.5% degradation on standard benchmarks.

INT8 (8-bit Integer)

Maps floating-point values to 256 integer levels. Requires calibration to determine the scaling factor that maps the float range to integers. Highly hardware-efficient — most modern GPUs have dedicated INT8 compute units.

INT4 / FP4 (4-bit)

Extreme compression: each weight uses only 4 bits. Quality preservation depends heavily on the quantization algorithm. Naive INT4 quantization is unusable; advanced methods like GPTQ and AWQ make it practical.

Quantization Methods

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without additional training. It is fast and requires only a small calibration dataset (typically 128 to 512 examples).

# Example: Quantizing a model with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4 - optimal for normally distributed weights
    bnb_4bit_compute_dtype="bfloat16",     # Compute in BF16 for accuracy
    bnb_4bit_use_double_quant=True,        # Quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=quantization_config,
    device_map="auto",
)
# 70B model now fits in ~35GB VRAM

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is a one-shot weight quantization method that minimizes the layer-wise reconstruction error. For each layer, it finds the quantized weights that produce the most similar output to the original FP16 weights when given calibration data.

Key advantages:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

  • Produces high-quality INT4 quantized models
  • One-time cost: quantization takes hours, but the resulting model serves indefinitely
  • Broad hardware compatibility

AWQ (Activation-Aware Weight Quantization)

AWQ observes that not all weights are equally important. Weights corresponding to large activations contribute more to the output. AWQ protects these salient weights by keeping them at higher precision while aggressively quantizing less important weights.

GGUF / llama.cpp Quantization

The GGUF format (used by llama.cpp) supports a variety of quantization levels from Q2_K (2-bit) through Q8_0 (8-bit). It uses a block-wise quantization scheme where each block of weights gets its own scaling factor.

# Common GGUF quantization levels and their trade-offs:
Q2_K  - 2.63 bpw - ~60% quality retention - extreme compression
Q3_K_M - 3.07 bpw - ~75% quality retention - aggressive but usable
Q4_K_M - 4.83 bpw - ~92% quality retention - best balance for most use cases
Q5_K_M - 5.69 bpw - ~96% quality retention - high quality
Q6_K  - 6.56 bpw - ~99% quality retention - near-lossless
Q8_0  - 8.50 bpw - ~99.5% quality retention - minimal compression

Accuracy Trade-offs in Practice

The theoretical information loss from quantization does not always translate into meaningful quality degradation. Here are measured results from a representative 70B model:

flowchart TD
    ROOT["Quantization Techniques: Running Large Model…"] 
    ROOT --> P0["Numerical Formats Explained"]
    P0 --> P0C0["FP16 16-bit Floating Point"]
    P0 --> P0C1["BF16 Brain Floating Point 16"]
    P0 --> P0C2["FP8 8-bit Floating Point"]
    P0 --> P0C3["INT8 8-bit Integer"]
    ROOT --> P1["Quantization Methods"]
    P1 --> P1C0["Post-Training Quantization PTQ"]
    P1 --> P1C1["GPTQ Generative Pre-trained Transformer…"]
    P1 --> P1C2["AWQ Activation-Aware Weight Quantization"]
    P1 --> P1C3["GGUF / llama.cpp Quantization"]
    ROOT --> P2["Frequently Asked Questions"]
    P2 --> P2C0["What is model quantization in AI?"]
    P2 --> P2C1["What is the difference between FP8, INT…"]
    P2 --> P2C2["How does quantization affect model perf…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
Precision Memory (GB) MMLU HumanEval MT-Bench Throughput vs FP16
FP16 140 82.1% 81.7% 8.9 1.0x
FP8 70 81.8% 81.5% 8.9 1.4x
INT8 70 81.5% 80.9% 8.8 1.6x
INT4 (GPTQ) 35 80.3% 79.2% 8.6 1.8x
INT4 (AWQ) 35 80.7% 79.8% 8.7 1.8x
Q4_K_M (GGUF) 38 80.1% 78.5% 8.5 1.5x

The pattern is clear: FP8 and INT8 quantization are nearly lossless for most applications. INT4 introduces measurable but often acceptable degradation.

Mixed-Precision Strategies

The most sophisticated deployments do not apply uniform quantization. Instead, they use different precision for different components:

  • Attention layers: Keep at FP8 or higher — these are critical for quality
  • FFN layers: Quantize more aggressively to INT4 — these tolerate compression better
  • Embedding layers: Keep at FP16 — quantization here disproportionately hurts quality
  • KV cache: Quantize to FP8 — saves memory at long context with minimal impact
# Mixed-precision quantization configuration example
layer_quant_config = {
    "attention.q_proj": "fp8",
    "attention.k_proj": "fp8",
    "attention.v_proj": "fp8",
    "attention.o_proj": "fp8",
    "mlp.gate_proj": "int4",
    "mlp.up_proj": "int4",
    "mlp.down_proj": "int4",
    "embed_tokens": "fp16",
    "lm_head": "fp16",
}

Quantization-Aware Training (QAT)

For teams willing to invest in retraining, QAT simulates quantization during the training process, allowing the model to adapt its weights to perform well at lower precision. QAT models consistently outperform post-training quantized models at the same bit width, typically by 1-3 percentage points.

flowchart TD
    CENTER(("LLM Pipeline"))
    CENTER --> N0["Produces high-quality INT4 quantized mo…"]
    CENTER --> N1["One-time cost: quantization takes hours…"]
    CENTER --> N2["Broad hardware compatibility"]
    CENTER --> N3["Attention layers: Keep at FP8 or higher…"]
    CENTER --> N4["FFN layers: Quantize more aggressively …"]
    CENTER --> N5["Embedding layers: Keep at FP16 — quanti…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

The cost is significant — QAT requires a full or partial training run — but for models being deployed at massive scale, the per-query savings from serving a QAT INT4 model vs a PTQ INT4 model can justify the upfront investment.

Practical Deployment Recommendations

  1. Start with FP8: It is nearly lossless, halves memory, and is natively supported on modern GPU architectures. This should be the default for production serving.

  2. Use INT4 for cost-constrained or edge deployments: When GPU budget is limited, GPTQ or AWQ INT4 quantization provides the best quality at 4-bit precision.

  3. Benchmark on your actual task: Academic benchmarks may not reflect your specific use case. Always evaluate quantized models on representative examples from your production workload.

  4. Quantize the KV cache separately: Even if you serve weights in FP8, quantizing the KV cache to FP8 saves substantial memory at long context lengths with minimal quality impact.

  5. Consider the full serving stack: Quantization interacts with other optimizations (batching, speculative decoding, paged attention). Test the complete pipeline, not just isolated components.

Quantization is not a compromise — at FP8, it is essentially free performance. At INT4, it is an engineering trade-off that, when done correctly, enables deployments that would otherwise require 4x the hardware budget.

Frequently Asked Questions

What is model quantization in AI?

Quantization reduces the numerical precision of model weights and activations from higher-precision formats like FP16 to lower-precision formats like INT8, FP8, or INT4. A 70-billion parameter model that requires approximately 140 GB of GPU memory in FP16 can fit in just 35 GB at INT4 precision. Modern quantization techniques achieve this compression with minimal quality degradation, making large models deployable on significantly less expensive hardware.

What is the difference between FP8, INT8, and INT4 quantization?

FP8 retains floating-point representation at 8 bits and is widely considered the new default serving precision, delivering near-zero quality loss with 2x memory savings. INT8 uses integer representation and reduces memory by 2x with slightly more quality risk than FP8. INT4 achieves 4x memory reduction but requires calibration-based techniques like GPTQ or AWQ to maintain acceptable output quality, with typical quality degradation of 1 to 3 percentage points on benchmarks.

How does quantization affect model performance and accuracy?

At FP8 precision, quantization is essentially free performance with quality indistinguishable from FP16 on most tasks. At INT8, quality loss is under 1 percentage point for well-calibrated models. At INT4, quality degradation ranges from 1 to 5 percentage points depending on the technique used and model architecture. Post-training quantization methods like GPTQ and AWQ minimize this loss by calibrating on representative data, and mixing precision levels across different layers can further optimize the accuracy-efficiency trade-off.


Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 MLOps & AI Deployment Interview Questions for 2026

Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.

Technology

Understanding AI Inference Costs: How to Cut Token Prices by 10x | CallSphere Blog

Learn the technical strategies behind dramatic inference cost reduction, from precision quantization and speculative decoding to batching optimizations and hardware-software co-design.

Learn Agentic AI

Optimizing Model Size for Edge Deployment: Pruning, Distillation, and Quantization

Master the three core techniques for reducing AI model size for edge deployment — pruning, knowledge distillation, and quantization — with practical code examples and quality preservation strategies.

Learn Agentic AI

Running LLMs on Consumer GPUs: Quantization with GPTQ, AWQ, and GGUF

Understand how GPTQ, AWQ, and GGUF quantization compress large language models to fit consumer GPUs. Compare quality tradeoffs, memory requirements, and practical deployment strategies.

Learn Agentic AI

Speculative Decoding: Using Small Models to Speed Up Large Model Inference

Learn how speculative decoding uses lightweight draft models to generate candidate tokens that a large target model verifies in parallel, achieving 2-3x inference speedups without quality loss.

Learn Agentic AI

ONNX Runtime for Agent Inference: Cross-Platform Model Deployment

Learn how to export AI agent models to ONNX format, optimize them with ONNX Runtime, and deploy cross-platform for consistent inference performance on any hardware.