Skip to content
ONNX Runtime for Agent Inference: Cross-Platform Model Deployment
Learn Agentic AI11 min read17 views

ONNX Runtime for Agent Inference: Cross-Platform Model Deployment

Learn how to export AI agent models to ONNX format, optimize them with ONNX Runtime, and deploy cross-platform for consistent inference performance on any hardware.

What Is ONNX and Why It Matters for Agents

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. It decouples the training framework from the inference engine: you train in PyTorch, TensorFlow, or any other framework, then export to ONNX and run inference using ONNX Runtime on any platform — Windows, Linux, macOS, Android, iOS, or the browser via WebAssembly.

For AI agents, this means you can train your intent classifier, entity extractor, or small language model on a powerful GPU server, then deploy the same model binary to a phone, a Raspberry Pi, or a browser tab without rewriting inference code.

Exporting a PyTorch Model to ONNX

Suppose your agent uses a text classifier to route user intents. Here is how to export a fine-tuned transformer model:

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create dummy input for tracing
dummy_input = tokenizer(
    "Schedule a meeting for tomorrow",
    return_tensors="pt",
    padding="max_length",
    max_length=64,
    truncation=True,
)

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_input["input_ids"], dummy_input["attention_mask"]),
    "intent_classifier.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "seq_len"},
        "attention_mask": {0: "batch", 1: "seq_len"},
        "logits": {0: "batch"},
    },
    opset_version=17,
)
print("Model exported to intent_classifier.onnx")

The dynamic_axes parameter is critical — it allows the model to accept variable batch sizes and sequence lengths at runtime, which is essential for an agent processing inputs of different lengths.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Optimizing with ONNX Runtime

The raw exported model works, but ONNX Runtime provides optimization tools that can significantly improve performance:

import onnxruntime as ort
from onnxruntime.transformers import optimizer

# Optimize the model for inference
optimized_model_path = optimizer.optimize_model(
    "intent_classifier.onnx",
    model_type="bert",
    num_heads=12,
    hidden_size=768,
    optimization_level=2,
)
optimized_model_path.save_model_to_file("intent_classifier_optimized.onnx")

# Create inference session with optimizations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.intra_op_num_threads = 4
session_options.inter_op_num_threads = 2

# Use CPU execution provider (swap for CUDA, DirectML, CoreML, etc.)
session = ort.InferenceSession(
    "intent_classifier_optimized.onnx",
    session_options,
    providers=["CPUExecutionProvider"],
)

Optimization level 2 applies operator fusion, constant folding, and shape inference — typically yielding a 20 to 40 percent speedup over the unoptimized model.

Running Inference in an Agent

Here is a complete agent intent router using the ONNX model:

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

class ONNXIntentRouter:
    LABELS = ["schedule", "query", "cancel", "update", "general"]

    def __init__(self, model_path: str, tokenizer_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.session = ort.InferenceSession(
            model_path,
            providers=["CPUExecutionProvider"],
        )

    def classify(self, text: str) -> dict:
        tokens = self.tokenizer(
            text,
            return_tensors="np",
            padding="max_length",
            max_length=64,
            truncation=True,
        )
        logits = self.session.run(
            ["logits"],
            {
                "input_ids": tokens["input_ids"],
                "attention_mask": tokens["attention_mask"],
            },
        )[0]

        probs = self._softmax(logits[0])
        top_idx = int(np.argmax(probs))
        return {
            "intent": self.LABELS[top_idx],
            "confidence": float(probs[top_idx]),
        }

    @staticmethod
    def _softmax(x: np.ndarray) -> np.ndarray:
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum()

# Usage
router = ONNXIntentRouter("intent_classifier_optimized.onnx", "distilbert-base-uncased")
result = router.classify("Cancel my 3pm appointment")
print(result)  # {"intent": "cancel", "confidence": 0.94}

Performance Benchmarks

Typical inference times for a DistilBERT classifier on ONNX Runtime:

Platform Unoptimized Optimized Quantized (INT8)
Desktop CPU (i7) 12 ms 8 ms 4 ms
Raspberry Pi 5 85 ms 55 ms 30 ms
Android (Pixel 8) 25 ms 15 ms 8 ms
Browser (WASM) 45 ms 30 ms 18 ms

Cross-Platform Deployment

ONNX Runtime supports multiple execution providers — swap the provider string without changing your inference code:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

# GPU inference (NVIDIA)
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

# Apple Silicon
providers = ["CoreMLExecutionProvider", "CPUExecutionProvider"]

# Windows GPU
providers = ["DmlExecutionProvider", "CPUExecutionProvider"]

session = ort.InferenceSession("model.onnx", providers=providers)

The fallback chain means your agent code works everywhere — it uses the best available hardware and falls back to CPU gracefully.

FAQ

How much faster is ONNX Runtime compared to running inference directly in PyTorch?

For transformer models, ONNX Runtime with optimization level 2 is typically 1.5 to 3 times faster than PyTorch eager mode on CPU. With INT8 quantization, the speedup can reach 4 to 6 times. On GPU, the difference is smaller (1.2 to 1.5 times) because PyTorch already uses optimized CUDA kernels.

Can I run ONNX models on mobile devices?

Yes. ONNX Runtime has native libraries for Android (Java/Kotlin) and iOS (Swift/Objective-C). The same ONNX model file runs on both platforms. For mobile, use the CoreML execution provider on iOS and the NNAPI execution provider on Android for hardware acceleration.

What model types can I export to ONNX?

Nearly all PyTorch and TensorFlow models export to ONNX, including transformers, CNNs, RNNs, and custom architectures. The Hugging Face Optimum library provides a dedicated ORTModelForSequenceClassification class that handles the export and optimization pipeline automatically.


#ONNXRuntime #ModelDeployment #CrossPlatformAI #ModelOptimization #EdgeAI #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Large Language Models

Small Language Models That Beat GPT-4: Phi-4, Gemma-3, and SmolLM-3 Benchmarks

By 2026, sub-10B models beat 2024-era GPT-4 on most benchmarks. The Phi-4, Gemma-3, and SmolLM-3 family compared head-to-head.

Voice AI Agents

On-Device Voice LLMs: Apple Intelligence, Gemini Nano, and the Privacy Angle

On-device voice LLMs are now real. What Apple Intelligence, Gemini Nano, and Phi-4 ship in 2026 — and what they cannot do yet.

AI Infrastructure

Build a Voice Agent on Jetson Orin Nano Super (Edge GPU, 2026)

Sub-$250 NVIDIA Jetson Orin Nano Super runs a full Whisper + 8B LLM + Piper voice loop offline at 15 tok/s. Here's the full Docker-based build with thermals, models, and code.

AI Interview Prep

7 MLOps & AI Deployment Interview Questions for 2026

Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.

Learn Agentic AI

Edge AI Agents: Running Autonomous Systems on Local Hardware with Nemotron and Llama

How to run AI agents on edge devices using NVIDIA Nemotron, Meta Llama, GGUF quantization, local inference servers, and offline-capable agent architectures.

Large Language Models

Quantization Techniques: Running Large Models on Smaller Hardware Without Losing Accuracy | CallSphere Blog

Quantization enables deploying large language models on constrained hardware by reducing numerical precision. Learn about FP4, FP8, INT8, and GPTQ techniques with practical accuracy trade-off analysis.