---
title: "Building Agents with Gemma and Phi: Small Language Models for Edge Deployment"
description: "Deploy AI agents on edge devices using Google's Gemma and Microsoft's Phi small language models. Cover resource requirements, agent patterns for constrained environments, and mobile deployment strategies."
canonical: https://callsphere.ai/blog/building-agents-gemma-phi-small-language-models-edge-deployment
category: "Learn Agentic AI"
tags: ["Gemma", "Phi", "Small Language Models", "Edge AI", "Mobile Deployment"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-06-05T04:49:14.703Z
---

# Building Agents with Gemma and Phi: Small Language Models for Edge Deployment

> Deploy AI agents on edge devices using Google's Gemma and Microsoft's Phi small language models. Cover resource requirements, agent patterns for constrained environments, and mobile deployment strategies.

## The Case for Small Language Models

Not every agent needs a 70B parameter model. Many practical agent tasks — classification, extraction, simple Q&A, form filling, and basic tool calling — can be handled by models with 2-4 billion parameters. Small Language Models (SLMs) open up deployment scenarios that large models cannot reach: mobile phones, IoT devices, laptops without GPUs, and environments with no internet connectivity.

Google's Gemma and Microsoft's Phi families lead the SLM space. Both deliver surprisingly strong performance relative to their size, often matching models 3-5x larger on targeted benchmarks.

## Model Overview

**Gemma 2 2B** — Google's smallest model. 2.6B parameters, trained on 2 trillion tokens of web data. Excels at summarization, classification, and code generation for its size. Licensed under a permissive Gemma license for commercial use.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

**Gemma 2 9B** — The mid-range option. Outperforms Llama 3.1 8B on several benchmarks while being slightly more efficient to serve.

**Phi-3.5-mini** — Microsoft's 3.8B model. Trained on a mix of filtered web data and synthetic data generated by larger models. Remarkably strong at reasoning and code generation.

**Phi-3-small** — 7B parameters with a focus on reasoning. Competes with larger models on math and logic benchmarks.

## Running Gemma Locally

Using Ollama is the quickest way to get started:

```bash
# Pull Gemma 2B (1.6 GB)
ollama pull gemma2:2b

# Test it
ollama run gemma2:2b "Classify this as positive or negative: The product is excellent"
```

For Python integration:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gemma2:2b",
        messages=[
            {"role": "user", "content":
             f"Classify the sentiment as positive, negative, or neutral. "
             f"Respond with one word only.\n\nText: {text}"},
        ],
        temperature=0.0,
        max_tokens=5,
    )
    return response.choices[0].message.content.strip().lower()

print(classify_sentiment("This product exceeded my expectations!"))  # positive
print(classify_sentiment("The delivery was late and the item was damaged."))  # negative
```

## Running Phi on Edge Devices

Phi models are optimized for ONNX Runtime, making them deployable on a wide range of hardware including CPUs and mobile NPUs:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a concise assistant that extracts structured data."},
    {"role": "user", "content": "Extract the name, date, and amount from: "
     "Invoice from John Smith dated March 15, 2026 for $2,500."},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=100, temperature=0.1, do_sample=True)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
```

## Agent Patterns for Constrained Environments

SLMs require different agent design patterns than large models. The key principle is to simplify the task structure so the model can handle each step reliably.

**Pattern 1: Single-Purpose Agents** — Instead of one general agent, deploy multiple specialized micro-agents:

```python
class EdgeAgentRouter:
    def __init__(self, client):
        self.client = client

    def route(self, user_input: str) -> str:
        # Step 1: Classify intent with the SLM
        intent = self._classify_intent(user_input)

        # Step 2: Route to specialized handler
        handlers = {
            "weather": self._handle_weather,
            "reminder": self._handle_reminder,
            "question": self._handle_question,
        }
        handler = handlers.get(intent, self._handle_question)
        return handler(user_input)

    def _classify_intent(self, text: str) -> str:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{"role": "user", "content":
                f"Classify this user request into one category: "
                f"weather, reminder, question.\n"
                f"Respond with the category only.\nRequest: {text}"}],
            temperature=0.0,
            max_tokens=10,
        )
        return response.choices[0].message.content.strip().lower()

    def _handle_weather(self, text: str) -> str:
        # Extract city, call weather API
        return "Weather handler triggered"

    def _handle_reminder(self, text: str) -> str:
        # Extract time and message, set reminder
        return "Reminder handler triggered"

    def _handle_question(self, text: str) -> str:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{"role": "user", "content": text}],
            temperature=0.3,
            max_tokens=200,
        )
        return response.choices[0].message.content
```

**Pattern 2: Structured Output with Constrained Generation** — Use explicit output formats to compensate for smaller models' tendency to be less structured:

```python
def extract_entities(client, text: str) -> dict:
    response = client.chat.completions.create(
        model="phi3.5:latest",
        messages=[{"role": "user", "content":
            f"Extract entities from this text. Respond in exactly this format:\n"
            f"NAME: \n"
            f"DATE: \n"
            f"AMOUNT: \n\n"
            f"Text: {text}"}],
        temperature=0.0,
        max_tokens=50,
    )

    result = {}
    for line in response.choices[0].message.content.strip().split("\n"):
        if ": " in line:
            key, value = line.split(": ", 1)
            if value.strip() != "NONE":
                result[key.strip()] = value.strip()
    return result
```

## Memory and Performance Benchmarks

| Model | Parameters | RAM (Q4) | Tokens/sec (CPU) | Tokens/sec (GPU) |
| --- | --- | --- | --- | --- |
| Gemma 2 2B | 2.6B | 1.8 GB | 15-25 | 80-120 |
| Phi-3.5-mini | 3.8B | 2.5 GB | 10-20 | 60-100 |
| Gemma 2 9B | 9.2B | 5.5 GB | 5-10 | 40-70 |
| Phi-3-small | 7B | 4.5 GB | 5-12 | 35-60 |

CPU token rates are measured on a modern laptop (Apple M2 / Intel i7-13th gen). GPU rates are on an RTX 3060 12 GB.

## FAQ

### Can a 2B model really handle agent tasks reliably?

For narrowly scoped tasks like classification, entity extraction, and template-based responses, yes. A Gemma 2B model fine-tuned on your specific task can be remarkably reliable. For open-ended reasoning or complex multi-step tool calling, you need at least a 7B model.

### How do I deploy an SLM on a mobile phone?

Use the GGUF format with llama.cpp compiled for ARM. On Android, libraries like android-llama.cpp provide JNI bindings. On iOS, use llama.cpp with Metal for GPU acceleration. Expect 5-15 tokens/second on flagship phones with quantized 2-3B models.

### Is fine-tuning necessary for SLMs in agent applications?

Fine-tuning is more impactful for SLMs than for large models. A generic 2B model may struggle with your specific output format, but a fine-tuned version can match larger models on that narrow task. Use LoRA fine-tuning with 500-2000 examples of your expected input/output pairs for the best results.

---

#Gemma #Phi #SmallLanguageModels #EdgeAI #MobileDeployment #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-agents-gemma-phi-small-language-models-edge-deployment
