Skip to content
Learn Agentic AI
Learn Agentic AI14 min read2 views

Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

A hands-on tutorial for fine-tuning open-source LLMs using Hugging Face Transformers, PEFT, and TRL libraries, covering setup, training configuration, evaluation, and pushing to the Hugging Face Hub.

The Hugging Face Fine-Tuning Stack

Hugging Face provides a complete stack for fine-tuning open-source models. The core libraries are:

  • transformers — model loading, tokenization, and inference
  • peft — parameter-efficient fine-tuning (LoRA, QLoRA)
  • trl — training utilities specifically for LLMs, including SFTTrainer
  • datasets — data loading and preprocessing
  • bitsandbytes — quantization support for QLoRA

Together, these libraries handle everything from data loading to model deployment. This tutorial walks through a complete fine-tuning workflow from start to finish.

Environment Setup

# Install required packages
# pip install torch transformers peft trl datasets bitsandbytes accelerate

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# Verify GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Loading the Base Model with QLoRA

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Preparing the Dataset

The SFTTrainer works best with datasets in conversational format — a messages column containing lists of role/content dicts.

flowchart TD
    START["Fine-Tuning with Hugging Face Transformers and PE…"] --> A
    A["The Hugging Face Fine-Tuning Stack"]
    A --> B
    B["Environment Setup"]
    B --> C
    C["Loading the Base Model with QLoRA"]
    C --> D
    D["Preparing the Dataset"]
    D --> E
    E["Configuring LoRA"]
    E --> F
    F["Setting Up the SFT Trainer"]
    F --> G
    G["Training"]
    G --> H
    H["Evaluation"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from datasets import Dataset
import json

def load_training_data(filepath: str) -> Dataset:
    """Load JSONL training data into a Hugging Face Dataset."""
    examples = []
    with open(filepath, "r") as f:
        for line in f:
            data = json.loads(line)
            examples.append({"messages": data["messages"]})
    return Dataset.from_list(examples)

# Load and split dataset
full_dataset = load_training_data("training_data.jsonl")
split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

# Inspect one example
print(json.dumps(train_dataset[0]["messages"], indent=2))

Configuring LoRA

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

Setting Up the SFT Trainer

The SFTTrainer from TRL handles chat template formatting, packing, and training loop management.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["transformers — model loading, tokenizat…"]
    CENTER --> N1["peft — parameter-efficient fine-tuning …"]
    CENTER --> N2["trl — training utilities specifically f…"]
    CENTER --> N3["datasets — data loading and preprocessi…"]
    CENTER --> N4["bitsandbytes — quantization support for…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
# Training configuration
training_args = SFTConfig(
    output_dir="./llama3-finetune",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size: 4 * 4 = 16
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,
    max_seq_length=2048,
    packing=False,                    # Set True to pack multiple examples
    report_to="none",                 # Use "wandb" for experiment tracking
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

# Check trainable parameters
trainer.model.print_trainable_parameters()

Training

# Start training
train_result = trainer.train()

# Print training metrics
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.0f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

# Save the LoRA adapter
trainer.save_model("./llama3-finetune/final")
tokenizer.save_pretrained("./llama3-finetune/final")

Evaluation

from transformers import pipeline

# Load the fine-tuned model for inference
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_on_test(pipe, test_data, num_samples=20):
    """Run model on test examples and collect results."""
    results = []
    for i in range(min(num_samples, len(test_data))):
        example = test_data[i]
        messages = example["messages"]

        # Use all messages except the last (assistant response) as input
        prompt_messages = messages[:-1]
        expected = messages[-1]["content"]

        output = pipe(
            prompt_messages,
            max_new_tokens=512,
            temperature=0.1,
            do_sample=True,
        )
        generated = output[0]["generated_text"][-1]["content"]

        results.append({
            "input": messages[-2]["content"][:100],
            "expected": expected[:100],
            "generated": generated[:100],
        })

    return results

results = evaluate_on_test(pipe, eval_dataset)
for r in results[:5]:
    print(f"Input:    {r['input']}")
    print(f"Expected: {r['expected']}")
    print(f"Got:      {r['generated']}")
    print("---")

Pushing to Hugging Face Hub

# Login to Hugging Face (run once)
# huggingface-cli login --token hf_YOUR_TOKEN

# Push the LoRA adapter to Hub
trainer.model.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)
tokenizer.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)

# To merge and push the full model:
from peft import PeftModel, AutoPeftModelForCausalLM

merged = trainer.model.merge_and_unload()
merged.push_to_hub(
    "your-username/llama3-medical-coder-merged",
    private=True,
)

FAQ

What is the difference between SFTTrainer and the standard Trainer?

SFTTrainer (Supervised Fine-Tuning Trainer) from TRL is specifically designed for LLM fine-tuning. It automatically handles chat template formatting, supports packing multiple short examples into a single sequence for efficiency, and integrates seamlessly with PEFT adapters. The standard Trainer from transformers works for general training but requires you to handle tokenization, padding, and label masking manually for language model fine-tuning.

How do I choose between packing=True and packing=False?

Packing concatenates multiple training examples into a single sequence to maximize GPU utilization. Enable packing when your examples are short (under 25% of max_seq_length) and you want faster training. Disable packing when example boundaries matter — for instance, if your system prompts vary between examples, packing can create confusing boundaries. Start with packing disabled and enable it only if training is slow due to short sequences.

How do I resume training from a checkpoint if it gets interrupted?

SFTTrainer saves checkpoints automatically based on your save_strategy configuration. To resume, pass the checkpoint directory to the resume_from_checkpoint parameter: trainer.train(resume_from_checkpoint="./llama3-finetune/checkpoint-150"). The trainer restores the model weights, optimizer state, learning rate schedule, and data loader position so training continues exactly where it left off.


#HuggingFace #PEFT #Transformers #TRL #FineTuning #SFT #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

Large Language Models

Why Enterprises Need Custom LLMs: Base vs Fine-Tuned Models in 2026

Custom LLMs outperform base models for enterprise use cases by 40-65%. Learn when to fine-tune, RAG, or build custom models — with architecture patterns and ROI data.

Learn Agentic AI

LoRA and QLoRA: Parameter-Efficient Fine-Tuning for Open-Source LLMs

Understand how LoRA and QLoRA enable fine-tuning of large language models on consumer hardware by training only a small fraction of parameters, with practical examples using Hugging Face and PEFT.

Learn Agentic AI

Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting

Master the art of building high-quality fine-tuning datasets with practical techniques for data collection, cleaning, deduplication, format validation, and diversity analysis.