---
title: "LoRA and QLoRA: Parameter-Efficient Fine-Tuning for Open-Source LLMs"
description: "Understand how LoRA and QLoRA enable fine-tuning of large language models on consumer hardware by training only a small fraction of parameters, with practical examples using Hugging Face and PEFT."
canonical: https://callsphere.ai/blog/lora-qlora-parameter-efficient-fine-tuning-open-source-llms
category: "Learn Agentic AI"
tags: ["LoRA", "QLoRA", "PEFT", "Fine-Tuning", "Open Source LLMs", "Hugging Face"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T02:35:51.898Z
---

# LoRA and QLoRA: Parameter-Efficient Fine-Tuning for Open-Source LLMs

> Understand how LoRA and QLoRA enable fine-tuning of large language models on consumer hardware by training only a small fraction of parameters, with practical examples using Hugging Face and PEFT.

## The Problem with Full Fine-Tuning

Full fine-tuning updates every parameter in a model. For a 7-billion parameter model, that means storing 7 billion gradients, 7 billion optimizer states, and 7 billion updated weights during training. A single training run for Llama 3 8B with full fine-tuning requires roughly 60-80 GB of GPU memory — well beyond a single consumer GPU.

LoRA (Low-Rank Adaptation) solves this by freezing the original model weights and injecting small trainable matrices into specific layers. Instead of updating 7 billion parameters, you train 1-10 million parameters. QLoRA goes further by quantizing the frozen base model to 4-bit precision, cutting memory requirements in half again.

## How LoRA Works

LoRA decomposes weight updates into two small matrices. Instead of computing a full weight update matrix W (dimensions d x d, potentially millions of parameters), LoRA computes two matrices: A (d x r) and B (r x d), where r (the rank) is much smaller than d — typically 8, 16, or 32.

```mermaid
flowchart LR
    DATA[("Curated dataset
instruction or chat")]
    CLEAN["Clean and dedupe
PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA
adapters only"]
    SFT["Full SFT
all params"]
    DPO["DPO or RLHF
preference learning"]
    EVAL["Held out eval
plus regression suite"]
    DEPLOY[("Adapter or
merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff
```

The effective weight update is the product A * B, which has the same dimensions as W but is parameterized by far fewer values. During inference, these low-rank matrices are merged back into the base weights, so there is zero additional latency.

```python
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, original_layer: nn.Linear, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.original = original_layer
        self.scaling = alpha / rank
        self.original.weight.requires_grad = False  # Freeze original
        d_in, d_out = original_layer.in_features, original_layer.out_features
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))

    def forward(self, x):
        return self.original(x) + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling

# A 4096x4096 layer = 16.7M params. LoRA rank 16 = 131K params (0.78%)
```

## QLoRA: Adding Quantization

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are stored in NF4 (NormalFloat4) format, which is specifically designed for normally distributed neural network weights. This reduces the base model memory footprint by roughly 4x compared to 16-bit.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# QLoRA configuration: 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization for extra savings
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare model for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling factor
    target_modules=[               # Which layers to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,847,680 || trainable%: 0.1695
```

## Choosing the Right Rank

The rank (r) controls the capacity of the LoRA adaptation. Higher ranks can learn more complex transformations but use more memory and risk overfitting.

```python
def estimate_lora_params(
    hidden_size: int,
    num_layers: int,
    rank: int,
    num_target_modules: int = 7,  # q, k, v, o, gate, up, down
) -> dict:
    """Estimate trainable parameters for different LoRA ranks."""
    params_per_layer = num_target_modules * 2 * hidden_size * rank
    total_params = params_per_layer * num_layers

    return {
        "rank": rank,
        "params_per_layer": f"{params_per_layer:,}",
        "total_trainable": f"{total_params:,}",
        "total_mb": f"{total_params * 2 / 1024**2:.1f} MB",  # bf16
    }

# Llama 3.1 8B: hidden_size=4096, 32 layers
for r in [4, 8, 16, 32, 64]:
    result = estimate_lora_params(4096, 32, r)
    print(f"Rank {r:2d}: {result['total_trainable']:>12s} params ({result['total_mb']})")

# Rank  4:    7,340,032 params (14.0 MB)
# Rank  8:   14,680,064 params (28.0 MB)
# Rank 16:   29,360,128 params (56.0 MB)
# Rank 32:   58,720,256 params (112.0 MB)
# Rank 64:  117,440,512 params (224.0 MB)
```

**Practical guidelines:** Use rank 8 for simple style and format tasks. Use rank 16-32 for moderate domain adaptation. Use rank 64 only for complex tasks with abundant training data.

## Memory Requirements Comparison

| Configuration | Base Model | Adapters | Optimizer | Total GPU RAM |
| --- | --- | --- | --- | --- |
| Full fine-tune (bf16) | 16 GB | — | 48 GB | ~64 GB |
| LoRA (bf16 base) | 16 GB | 56 MB | 168 MB | ~18 GB |
| QLoRA (4-bit base) | 4.5 GB | 56 MB | 168 MB | ~6 GB |

QLoRA makes it possible to fine-tune an 8B model on a single 8 GB GPU — a consumer RTX 3070 or even a free Google Colab T4.

## Merging and Deploying LoRA Adapters

After training, merge the LoRA weights back into the base model for deployment with zero overhead.

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in full precision
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
```

## FAQ

### What is the difference between LoRA rank and LoRA alpha?

Rank (r) determines the size of the low-rank matrices and thus the capacity of the adaptation. Alpha controls the scaling factor applied to the LoRA output. The effective scaling is alpha/rank. A common pattern is to set alpha to 2x the rank (e.g., r=16, alpha=32). Higher alpha amplifies the LoRA contribution relative to the base model.

### Can I apply multiple LoRA adapters to the same model?

Yes. You can train separate LoRA adapters for different tasks and switch between them at inference time without reloading the base model. Libraries like PEFT support loading multiple adapters and selecting which one is active. You can even merge multiple adapters, though this requires care to avoid conflicting weight updates.

### Is QLoRA quality worse than full LoRA due to the 4-bit quantization?

Research shows that QLoRA matches full-precision LoRA quality in most benchmarks. The key insight is that quantization only affects the frozen base weights, not the trainable LoRA parameters, which remain in bfloat16. The double quantization technique in QLoRA further reduces the quantization error. In practice, the quality difference is negligible for most fine-tuning tasks.

---

#LoRA #QLoRA #PEFT #FineTuning #OpenSourceLLMs #HuggingFace #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/lora-qlora-parameter-efficient-fine-tuning-open-source-llms
