Skip to content
Learn Agentic AI
Learn Agentic AI10 min read3 views

WebAssembly for AI Agents: Running Models in the Browser

Learn how to compile AI models to WebAssembly for browser-based agent inference, covering WASM compilation, model loading strategies, browser constraints, and progressive enhancement patterns.

Why Run AI Models in the Browser

Browser-based AI inference eliminates server costs, removes latency, and keeps data entirely on the user's device. With WebAssembly (WASM), you can run compiled C/C++ model inference engines at near-native speed inside any modern browser.

For AI agents, this means building chat interfaces, form assistants, or document analyzers that work without any backend — the model runs in the browser tab alongside your JavaScript application.

The WASM AI Stack

The typical stack for browser AI consists of three layers:

flowchart TD
    START["WebAssembly for AI Agents: Running Models in the …"] --> A
    A["Why Run AI Models in the Browser"]
    A --> B
    B["The WASM AI Stack"]
    B --> C
    C["Loading ONNX Runtime Web"]
    C --> D
    D["Model Loading Strategies"]
    D --> E
    E["Browser Constraints"]
    E --> F
    F["Progressive Enhancement Pattern"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Model format: ONNX, TFLite, or custom binary weights
  2. Runtime: ONNX Runtime Web (WASM backend), TFLite WASM, or custom C++ compiled to WASM
  3. JavaScript API: A thin wrapper that loads the WASM module and exposes inference functions

Loading ONNX Runtime Web

The fastest path to browser AI is ONNX Runtime Web, which provides both WASM and WebGL backends:

// Install: npm install onnxruntime-web

import * as ort from "onnxruntime-web";

// Configure WASM backend
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
ort.env.wasm.simd = true;

async function loadAgentModel() {
  const session = await ort.InferenceSession.create(
    "/models/intent_classifier.onnx",
    {
      executionProviders: ["wasm"],
      graphOptimizationLevel: "all",
    }
  );
  return session;
}

async function classifyIntent(session, tokenIds) {
  const inputTensor = new ort.Tensor(
    "int64",
    BigInt64Array.from(tokenIds.map(BigInt)),
    [1, tokenIds.length]
  );

  const attentionMask = new ort.Tensor(
    "int64",
    BigInt64Array.from(tokenIds.map(() => BigInt(1))),
    [1, tokenIds.length]
  );

  const results = await session.run({
    input_ids: inputTensor,
    attention_mask: attentionMask,
  });

  const logits = results.logits.data;
  return softmax(Array.from(logits));
}

function softmax(arr) {
  const max = Math.max(...arr);
  const exps = arr.map((x) => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map((e) => e / sum);
}

Model Loading Strategies

Browser models can be large. A quantized DistilBERT is about 64 MB. Here are strategies to handle loading:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    ROOT["WebAssembly for AI Agents: Running Models in…"] 
    ROOT --> P0["Model Loading Strategies"]
    P0 --> P0C0["Lazy Loading with Progress"]
    P0 --> P0C1["Web Worker Isolation"]
    ROOT --> P1["FAQ"]
    P1 --> P1C0["How does WASM AI performance compare to…"]
    P1 --> P1C1["Which browsers support WASM AI workload…"]
    P1 --> P1C2["Can I run large language models in the …"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Lazy Loading with Progress

class BrowserAgent {
  constructor() {
    this.session = null;
    this.loading = false;
  }

  async ensureModel(onProgress) {
    if (this.session) return;
    if (this.loading) return;
    this.loading = true;

    try {
      // Check cache first
      const cache = await caches.open("agent-models-v1");
      const cached = await cache.match("/models/agent.onnx");

      if (!cached) {
        // Download with progress tracking
        const response = await fetch("/models/agent.onnx");
        const reader = response.body.getReader();
        const contentLength = +response.headers.get("Content-Length");
        let received = 0;
        const chunks = [];

        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          chunks.push(value);
          received += value.length;
          onProgress?.(received / contentLength);
        }

        const blob = new Blob(chunks);
        await cache.put("/models/agent.onnx", new Response(blob));
      }

      const modelResponse = await cache.match("/models/agent.onnx");
      const buffer = await modelResponse.arrayBuffer();
      this.session = await ort.InferenceSession.create(buffer, {
        executionProviders: ["wasm"],
      });
    } finally {
      this.loading = false;
    }
  }
}

Web Worker Isolation

Run inference in a Web Worker to keep the main thread responsive:

// agent-worker.js
import * as ort from "onnxruntime-web";

let session = null;

self.onmessage = async (event) => {
  const { type, payload } = event.data;

  if (type === "load") {
    session = await ort.InferenceSession.create(payload.modelBuffer, {
      executionProviders: ["wasm"],
    });
    self.postMessage({ type: "ready" });
  }

  if (type === "infer") {
    const input = new ort.Tensor("int64",
      BigInt64Array.from(payload.tokens.map(BigInt)),
      [1, payload.tokens.length]
    );
    const results = await session.run({ input_ids: input });
    self.postMessage({ type: "result", data: Array.from(results.logits.data) });
  }
};

Browser Constraints

Running AI in the browser comes with hard limits:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Model format: ONNX, TFLite, or custom b…"]
    CENTER --> N1["Runtime: ONNX Runtime Web WASM backend,…"]
    CENTER --> N2["JavaScript API: A thin wrapper that loa…"]
    CENTER --> N3["Startup time: WASM compilation of large…"]
    CENTER --> N4["No GPU from WASM: WASM itself runs on C…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
  • Memory: Browsers typically cap WASM memory at 2 to 4 GB. Models larger than about 1 GB become impractical.
  • Startup time: WASM compilation of large modules takes 1 to 5 seconds on first load.
  • No GPU from WASM: WASM itself runs on CPU. For GPU, you need WebGL or WebGPU backends.
  • Thread limitations: SharedArrayBuffer (required for multi-threaded WASM) needs cross-origin isolation headers.

Progressive Enhancement Pattern

Build your agent to work without the local model, then enhance when it loads:

class ProgressiveAgent {
  constructor(apiEndpoint) {
    this.apiEndpoint = apiEndpoint;
    this.localModel = null;
    this.loadLocalModel();
  }

  async loadLocalModel() {
    try {
      const session = await ort.InferenceSession.create("/models/agent.onnx");
      this.localModel = session;
      console.log("Local model loaded — switching to browser inference");
    } catch (err) {
      console.warn("Local model unavailable, using cloud fallback", err);
    }
  }

  async processInput(text) {
    if (this.localModel) {
      return this.inferLocally(text);
    }
    return this.inferViaAPI(text);
  }

  async inferViaAPI(text) {
    const res = await fetch(this.apiEndpoint, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ text }),
    });
    return res.json();
  }

  async inferLocally(text) {
    // Tokenize and run through local ONNX model
    const tokens = this.tokenize(text);
    // ... run inference as shown above
  }
}

FAQ

How does WASM AI performance compare to native?

WASM inference is typically 2 to 4 times slower than native C++ for the same model on the same hardware. However, with SIMD instructions enabled and multi-threading via SharedArrayBuffer, the gap narrows to 1.5 to 2 times. For many agent tasks like classification or embedding, the absolute latency (20 to 50 milliseconds) is fast enough to feel instant.

Which browsers support WASM AI workloads?

All modern browsers — Chrome 57 and later, Firefox 52 and later, Safari 11 and later, and Edge 16 and later — support WebAssembly. For multi-threaded WASM (which ONNX Runtime Web uses for performance), you need SharedArrayBuffer support, which requires cross-origin isolation headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp.

Can I run large language models in the browser with WASM?

Small language models up to about 1 billion parameters (quantized) can run in the browser, though generation is slow — roughly 2 to 10 tokens per second. For practical browser-based agents, use smaller specialized models for intent routing and tool selection, and reserve cloud LLMs for complex generation tasks.


#WebAssembly #BrowserAI #WASM #JavaScript #ClientSideAI #EdgeAI #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Edge AI Agents: Running Autonomous Systems on Local Hardware with Nemotron and Llama

How to run AI agents on edge devices using NVIDIA Nemotron, Meta Llama, GGUF quantization, local inference servers, and offline-capable agent architectures.

Learn Agentic AI

Building Agents with Gemma and Phi: Small Language Models for Edge Deployment

Deploy AI agents on edge devices using Google's Gemma and Microsoft's Phi small language models. Cover resource requirements, agent patterns for constrained environments, and mobile deployment strategies.

Technology

Edge AI Computing: Bringing Intelligence to Devices Without the Cloud | CallSphere Blog

Edge AI runs inference directly on devices, eliminating cloud latency and enabling real-time decisions. Learn how on-device AI works and where it delivers the most value.

Learn Agentic AI

Building Agent SDKs: JavaScript, Python, and REST Clients for Your Agent Platform

Design and build developer-friendly SDKs for an AI agent platform, covering API client generation, error handling patterns, streaming support, and versioning strategies that maintain backward compatibility.

Technology

Real-Time AI at the Edge: How Embedded Vision Systems Are Enabling Smart Devices | CallSphere Blog

Explore how embedded AI vision systems enable real-time on-device inference for IoT, smart cameras, robotics, and wearable devices at the network edge.

Learn Agentic AI

Running AI Agents on the Edge: When to Move Intelligence Close to the User

Explore the tradeoffs between edge and cloud AI agent deployment, including latency benefits, privacy advantages, cost reduction strategies, and decision frameworks for choosing the right approach.