By Sagar Shankaran, Founder of CallSphere
A 7B model, an RTX 4070 Ti, and an afternoon — that's all it takes in 2026. We cover the Unsloth-recommended r=16 + DoRA recipe, target_modules=all-linear, NF4 quant, and how to avoid the silent chat-template footgun that breaks half of community LoRAs.
Key takeaways
TL;DR — On a single RTX 4070 Ti you can specialize Llama-3.1-8B or Qwen2.5-7B in an afternoon. Default to r=16 + α=16 + DoRA + target_modules="all-linear" with 4-bit NF4 quant and double-quant. Always re-run
tokenizer.apply_chat_templateand verify it matches the base model's expected format — mismatched templates silently produce broken adapters.
LoRA (Low-Rank Adaptation) freezes the base model and trains tiny adapter matrices that get added to selected weight projections. QLoRA wraps this in 4-bit quantization so a 7–8B model fits in 8 GB VRAM during training. The result: a 50–200 MB adapter file that captures your domain knowledge without touching base weights.
flowchart TD
BASE[Base 7B model FP16] --> Q[4-bit NF4 quantization]
Q --> FROZEN[Frozen weights]
FROZEN --> LORA[LoRA adapters: r=16]
DATA[Domain JSONL] --> TPL[apply_chat_template]
TPL --> TRAIN[SFTTrainer 2k steps]
LORA --> TRAIN
TRAIN --> ADAPT[adapter.safetensors 90MB]
ADAPT --> SERVE[vLLM with LoRA]
CallSphere runs 6 verticals · 37 agents · 90+ tools · 115+ DB tables. We use LoRA for two narrow paths:
{satisfied, neutral, churn-risk}. The 88 MB adapter beats GPT-4o-mini on F1 by 4 points and runs at $0.04/1K calls on our own A10G.Healthcare's post-call analytics pipeline still uses GPT-4o-mini (closed API, faster TTFB on small batches). OneRoof real-estate uses OpenAI Agents SDK with closed models. Everything ships on the same plans — $149 / $499 / $1,499 — with a 14-day trial and 22% partner affiliate.
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=4096, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16, lora_dropout=0,
target_modules="all-linear",
use_dora=True, # Weight-decomposed LoRA, 2026 default
use_gradient_checkpointing="unsloth",
)
# CRITICAL: format with the base model's chat template
def fmt(ex): return tok.apply_chat_template(ex["messages"], tokenize=False)
trainer = SFTTrainer(
model=model, tokenizer=tok, train_dataset=ds.map(lambda x:{"text":fmt(x)}),
args=SFTConfig(
per_device_train_batch_size=2, gradient_accumulation_steps=4,
warmup_steps=20, max_steps=2000,
learning_rate=2e-4, optim="paged_adamw_8bit",
weight_decay=0.01, lr_scheduler_type="linear",
eval_strategy="steps", eval_steps=200,
),
)
trainer.train()
model.save_pretrained_merged("merged", tok, save_method="lora")
tok.apply_chat_template(...) and eyeball it.target_modules="all-linear".Q: r=8 vs r=16 vs r=64? Default r=16. Bump to r=64 only if you're teaching new factual knowledge (rarely the right tool). Below r=8 you risk under-fitting.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Q: DoRA always? Yes for 2026 starting configs — better convergence on complex tasks at <1% extra compute.
Q: Can I serve adapters dynamically? Yes — vLLM, SGLang, and TGI all support hot-loading LoRA adapters per request, ideal for multi-tenant SaaS.
Q: Should I merge weights? For single-tenant deployments yes (faster inference). Multi-tenant: keep adapters separate.
Q: How does this compare to OpenAI fine-tuning cost? OSS LoRA: $0 + your GPU time (~$1.50 on a 4070 Ti afternoon). OpenAI gpt-4o-mini: ~$8 for the same 200K-token dataset.
LoRA & QLoRA Fine-Tuning for Open-Source AI Agents (2026) sounds like a single decision, but in production it splits into eval design, prompt cost, and observability. The deeper you push toward live traffic, the more those three pull against each other — better evals catch silent failures, prompt cost limits how often you can re-run them, and weak observability hides which retries are actually saving conversations versus burning latency budget.
Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs 37 agents across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Structured tools beat free-form text every time. Our 90+ function tools all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries.
The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in 115+ database tables spanning all 6 verticals.
What's the right way to scope the proof-of-concept? CallSphere runs 37 production agents and 90+ function tools across 115+ database tables in 6 verticals, so most workflows you'd want already have a template. For a topic like "LoRA & QLoRA Fine-Tuning for Open-Source AI Agents (2026)", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.
How do you handle compliance and data isolation? Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.
When does it make sense to switch from a managed model to a self-hosted one? The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.
Want to see how this maps to your stack? Book a live walkthrough at calendly.com/sagar-callsphere/new-meeting, or try the vertical-specific demo at healthcare.callsphere.tech. 14-day trial, no credit card, pilot live in 3–5 business days.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
The public MCP registry crossed 9,400 servers in April 2026. Here is a curated walkthrough of the SaaS MCP servers CallSphere mounts in production, with OAuth 2.1 PKCE patterns.
Neo4j's agent-memory project ships short-term, long-term, and reasoning memory in one graph. Microsoft Agent Framework and LangChain both wire it in. Here is the production pattern.
AI SDK 5 ships fully typed chat for React, Svelte, Vue, and Angular plus first-class agent loop primitives. Here are the patterns that matter for shipping in 2026.
Personalizing agents for one user is easy. Personalizing them for a million users is a memory-tier problem. The hot/warm/cold split and what each tier optimizes for.
Long-running agents accumulate noisy state. Five consolidation patterns — summarization, salience scoring, decay, dedup, and refactor — and when each one fits.
© 2026 CallSphere LLC. All rights reserved.