---
title: "Running Open-Source LLMs Locally: Ollama, vLLM, and llama.cpp Setup Guide"
description: "A practical guide to running open-source language models on your own hardware using Ollama, vLLM, and llama.cpp, covering installation, model management, API compatibility, and performance optimization."
canonical: https://callsphere.ai/blog/running-open-source-llms-locally-ollama-vllm-llamacpp
category: "Learn Agentic AI"
tags: ["Ollama", "vLLM", "llama.cpp", "Local LLM", "Open Source", "Self-Hosted"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:45.012Z
---

# Running Open-Source LLMs Locally: Ollama, vLLM, and llama.cpp Setup Guide

> A practical guide to running open-source language models on your own hardware using Ollama, vLLM, and llama.cpp, covering installation, model management, API compatibility, and performance optimization.

## Why Run LLMs Locally

Running language models on your own hardware gives you data privacy, zero per-token costs, full control over the model, and no rate limits. The tradeoff is that you need to manage hardware, handle scaling, and accept that smaller local models will not match the quality of frontier cloud models like GPT-4o or Claude.

Three tools dominate the local LLM ecosystem. **Ollama** is the easiest to set up and best for development. **vLLM** delivers the highest throughput for production serving. **llama.cpp** provides maximum flexibility and runs on CPU-only machines.

## Ollama: The Easiest Path

Ollama packages model downloading, quantization, and serving into a single binary. It runs on macOS, Linux, and Windows.

```mermaid
flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching
vLLM scheduler"]
    PREF{"Prefill or
decode?"}
    PRE["Prefill phase
parallel attention"]
    DEC["Decode phase
token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling
top-p, temp"]
    STREAM["Stream tokens
to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

```python
# After installing Ollama (curl -fsSL https://ollama.com/install.sh | sh)

# Pull a model
# ollama pull llama3.1:8b

# Ollama exposes an OpenAI-compatible API at http://localhost:11434
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists."},
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)
```

**Managing models with Ollama:**

```python
import subprocess
import json

def list_ollama_models() -> list[dict]:
    """List all downloaded Ollama models."""
    result = subprocess.run(
        ["ollama", "list"], capture_output=True, text=True
    )
    lines = result.stdout.strip().split("\n")[1:]  # Skip header
    models = []
    for line in lines:
        parts = line.split()
        if len(parts) >= 4:
            models.append({
                "name": parts[0],
                "id": parts[1],
                "size": parts[2] + " " + parts[3],
            })
    return models

def create_custom_model(
    name: str,
    base_model: str,
    system_prompt: str,
    temperature: float = 0.7,
) -> str:
    """Create a custom Ollama model with a Modelfile."""
    modelfile = f"""FROM {base_model}
SYSTEM {json.dumps(system_prompt)}
PARAMETER temperature {temperature}
PARAMETER num_ctx 4096
"""
    modelfile_path = f"/tmp/{name}.Modelfile"
    with open(modelfile_path, "w") as f:
        f.write(modelfile)

    result = subprocess.run(
        ["ollama", "create", name, "-f", modelfile_path],
        capture_output=True, text=True,
    )
    return result.stdout
```

## vLLM: Production-Grade Serving

vLLM is an inference engine designed for high throughput. It uses PagedAttention to manage GPU memory efficiently, supports continuous batching, and delivers 2-4x higher throughput than naive HuggingFace inference.

```python
# Install: pip install vllm

# Start vLLM server (OpenAI-compatible)
# python -m vllm.entrypoints.openai.api_server #   --model meta-llama/Llama-3.1-8B-Instruct #   --dtype bfloat16 #   --max-model-len 8192 #   --gpu-memory-utilization 0.9 #   --port 8000

# Use exactly like OpenAI API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Synchronous request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain gradient descent in three sentences."},
    ],
    temperature=0.0,
    max_tokens=256,
)
print(response.choices[0].message.content)
```

**Benchmarking vLLM throughput:**

```python
import asyncio
import time
from openai import AsyncOpenAI

async def benchmark_throughput(
    base_url: str,
    model: str,
    num_requests: int = 100,
    max_concurrent: int = 10,
) -> dict:
    """Benchmark inference throughput with concurrent requests."""
    client = AsyncOpenAI(base_url=base_url, api_key="x")
    semaphore = asyncio.Semaphore(max_concurrent)
    latencies = []

    async def single_request(prompt: str):
        async with semaphore:
            start = time.perf_counter()
            response = await client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=128,
                temperature=0.0,
            )
            elapsed = time.perf_counter() - start
            latencies.append(elapsed)
            tokens = response.usage.completion_tokens
            return tokens

    prompts = [f"What is the square root of {i * 17}?" for i in range(num_requests)]
    start_total = time.perf_counter()
    results = await asyncio.gather(*[single_request(p) for p in prompts])
    total_time = time.perf_counter() - start_total

    total_tokens = sum(results)
    return {
        "total_requests": num_requests,
        "total_time_s": round(total_time, 2),
        "requests_per_second": round(num_requests / total_time, 1),
        "tokens_per_second": round(total_tokens / total_time, 1),
        "avg_latency_ms": round(sum(latencies) / len(latencies) * 1000, 0),
        "p99_latency_ms": round(sorted(latencies)[int(0.99 * len(latencies))] * 1000, 0),
    }
```

## llama.cpp: Maximum Flexibility

llama.cpp runs models on CPU, Apple Silicon, CUDA GPUs, and even mobile devices. It uses GGUF quantized models for efficient memory usage.

```python
# Install Python bindings: pip install llama-cpp-python

# For GPU acceleration:
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

from llama_cpp import Llama

# Load a GGUF model
llm = Llama(
    model_path="./models/llama-3.1-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,         # Context window
    n_gpu_layers=-1,    # -1 = offload all layers to GPU
    n_threads=8,        # CPU threads for non-GPU layers
    verbose=False,
)

# Chat completion
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the SOLID principles?"},
    ],
    temperature=0.0,
    max_tokens=512,
)
print(response["choices"][0]["message"]["content"])
```

## Performance Comparison

| Feature | Ollama | vLLM | llama.cpp |
| --- | --- | --- | --- |
| Setup difficulty | Very easy | Moderate | Moderate |
| GPU required | No (but helps) | Yes | No |
| Throughput | Good | Best | Good |
| Concurrent requests | Limited | Excellent | Limited |
| Model format | Ollama/GGUF | HF Transformers | GGUF |
| OpenAI-compatible API | Yes | Yes | Yes (server mode) |
| Best for | Development | Production serving | Edge/CPU deployment |

## FAQ

### Which tool should I use for local development and prototyping?

Ollama is the clear choice for development. It installs with a single command, downloads models automatically, and runs with no configuration. The OpenAI-compatible API means you can develop against Ollama and switch to a cloud API for production by changing only the base URL. Use vLLM only when you need production-level throughput or concurrent request handling.

### How much VRAM do I need to run different model sizes locally?

For quantized models (Q4_K_M, the most common quantization): 7-8B parameter models need 4-6 GB VRAM, 13B models need 8-10 GB, and 70B models need 36-40 GB. Full-precision (bf16) models require roughly 2x the parameter count in bytes — so 8B parameters need 16 GB. Consumer GPUs like RTX 4090 (24 GB) can comfortably run 8-13B quantized models.

### Can I serve fine-tuned LoRA adapters with these tools?

Yes, all three support LoRA adapters. Ollama can import adapters through Modelfiles. vLLM supports loading LoRA adapters at runtime and even serving multiple adapters simultaneously with the same base model. llama.cpp supports GGUF-format adapters that can be applied on top of a base model. For vLLM, this is especially powerful because you can A/B test multiple fine-tuned variants without duplicating the base model in memory.

---

#Ollama #VLLM #Llamacpp #LocalLLM #OpenSource #SelfHosted #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/running-open-source-llms-locally-ollama-vllm-llamacpp
