---
title: "Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide"
description: "How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM."
canonical: https://callsphere.ai/blog/building-hvac-after-hours-emergency-escalation-system-2026
category: "HVAC"
tags: ["HVAC", "After-Hours", "AI Agents", "LangGraph", "OpenAI Agents SDK", "Kubernetes", "Twilio", "Multi-Agent Systems", "Engineering Guide"]
author: "CallSphere Team"
published: 2026-05-12T18:24:12.914Z
updated: 2026-05-12T18:53:28.517Z
---

# Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

> How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

## The HVAC Owner's 2 AM Problem

Every HVAC owner I have talked to has the same after-hours horror story. The phone rings at 2:14 AM. The voicemail says something about heat, or a smell, or a leak. By the time the owner listens, decides whether it is real, calls a tech, and gets a truck rolling, the customer has already called the next company in the search results. Worse: half the after-hours calls are not emergencies at all — somebody who got home late and wants to book a maintenance visit for next Tuesday — but the owner cannot tell which is which without listening to every one.

The bad outcomes are expensive. A missed no-heat call in January costs **$10,000–$50,000** when pipes burst overnight and the homeowner files an insurance claim against you. A missed gas-smell call is a liability event you do not want to talk to your attorney about. A missed commercial walk-in cooler call costs you the account. The good outcomes — same-night dispatch on a real emergency — are how HVAC companies earn 5-star reviews and triple their after-hours revenue.

This post is the complete engineering guide to the CallSphere After-Hours Escalation system, purpose-built for HVAC. Three-tier CQRS architecture on Kubernetes, eleven micro-agents that triage HVAC emergencies in under a second, and a fault-tolerant dispatch loop that pages the on-call tech via voice + SMS + DTMF acknowledgment until somebody accepts the job.

## What Counts As An HVAC Emergency

The triage layer is the heart of the product, and it is HVAC-specific. The model is fine-tuned on real HVAC after-hours messages and scores each one on a 0.0–1.0 urgency axis. Roughly:

- **0.9–1.0 — Dispatch immediately**: gas smell, carbon-monoxide alarm sounding, boiler leak with active water, no heat at outdoor temp ≤ 20°F with infants/elderly in the home, commercial walk-in cooler down with product inside.
- **0.6–0.9 — Dispatch tonight**: no heat at moderate outdoor temp, no AC during a heat advisory, furnace tripping breaker repeatedly, refrigerant smell.
- **0.3–0.6 — Confirm with on-call but probably tomorrow**: AC making unusual noise, thermostat malfunction, intermittent issues, scheduling questions framed as urgent.
- **0.0–0.3 — Auto-acknowledge, no page**: appointment requests, billing questions, "just leaving a message," vendor sales calls, spam.

The threshold to wake a tech is configurable per company — most start at 0.6 and tighten to 0.7 once they trust the system. Anything below threshold is logged for the morning and the customer gets an SMS auto-reply confirming receipt and offering a same-day callback window.

## Architecture Overview

The system is a three-tier CQRS split: a thin edge that ingests calls and emails, a stateful Go gateway that owns routing and WebSocket connections, and a fleet of stateless Python agent workers that own AI inference and dispatch. Each tier scales independently because each has a different bottleneck — bandwidth at the edge, connection count at the gateway, and inference time at the workers.

```mermaid
flowchart TB
    subgraph Edge["Edge / Customer Touchpoints"]
        Email[Service Email InboxIMAP Polling]
        Twilio[Twilio Inbound Call+ Voicemail Transcription]
        Dialpad[Dialpad / RingCentralWebhooks]
        WS[Owner DashboardWebSocket]
    end

    subgraph Gateway["Tier 1: Go API Gateway"]
        Gin[Gin/Fiber Server5-20 HPA Pods25K req/sec]
        WSHub[WebSocket Hub200K Concurrent2KB / Goroutine]
    end

    subgraph Bus["Durable Event Bus"]
        NATS[NATS JetStream8K msg/sec]
    end

    subgraph Workers["Tier 2: Python Agent Workers"]
        AW[10-100 K8s PodsHPA on Queue Depth]
        LG[LangGraph RuntimeDurable Checkpoints]
        Agents[11 HVAC AgentsOpenAI Agents SDK]
    end

    subgraph Data["Tier 3: Datastores"]
        PG[(PostgreSQL Primary+ 2 Read ReplicasCP / 10K writes/sec)]
        Redis[(Redis Cluster6 nodes, 24GiAP / 1M ops/sec)]
        ES[(Elasticsearch3 nodes / Debezium CDC200K searches/sec)]
    end

    Email --> Gin
    Twilio --> Gin
    Dialpad --> Gin
    WS --> WSHub
    Gin --> NATS
    WSHub --> NATS
    NATS --> AW
    AW --> LG
    LG --> Agents
    Agents --> PG
    Agents --> Redis
    Agents --> ES
    PG --> ES
```

## Tier 1: The Go API Gateway

The gateway is written in Go (Gin) for one reason: WebSocket fan-out. We hold ~200K concurrent WebSocket connections from owner dashboards and customer status pages, and each goroutine consumes ~2KB of memory. The Python asyncio equivalent we benchmarked first used ~100KB per coroutine — fifty times the footprint and a non-starter at our scale.

The gateway's job is intentionally narrow:

- **Authenticate** — JWT for the owner dashboard, signed Twilio webhooks for inbound calls/SMS, signed Dialpad/RingCentral payloads.
- **Validate** — reject malformed payloads before they touch the bus.
- **Rate-limit** — Redis token bucket per HVAC company tenant.
- **Publish** — push the event onto NATS and return 200 to the caller in ~5 ms.

Crucially, the gateway never calls the LLM. The webhook returns 200 immediately, the work is enqueued onto NATS JetStream, and a worker picks it up. This decouples ingestion latency from inference latency — vital because Twilio retries any webhook that takes longer than 15 seconds, and LLM tail latency under load can blow past that. Five to twenty pods behind an HPA sustain 25K req/sec; CPU stays under 30% even when a polar-vortex event triples normal call volume.

## Tier 2: The 11 HVAC Agents

The agent fleet is the brain. Eleven small, single-purpose agents are orchestrated via the OpenAI Agents SDK and LangGraph, with handoffs governed by a head agent. Each agent has a tightly scoped tool surface — the triage agent cannot dispatch a tech, the voice agent cannot mutate the on-call rotation. This is bulkhead isolation at the agent boundary, not just the service boundary.

```mermaid
flowchart TB
    Ingress[Customer Call orService Email]

    Head[Head AgentRoutes & Handoffs]

    subgraph Triage["Triage Layer"]
        Email_T[Email Triage AgentHVAC Urgency 0.0-1.0]
        Call_T[Call Triage AgentParse Webhook + Caller ID]
        VM_T[Voicemail AnalyzerDetect No-Heat / Gas / Leak]
    end

    subgraph Decision["Decision Layer"]
        Dispatch_O[Dispatch OrchestratorBuild On-Call Ladder]
        HITL[Human-in-the-LoopOwner Approval]
    end

    subgraph Action["Action Layer"]
        Voice_A[Voice AgentJob Brief TTS to Tech]
        SMS_A[SMS AgentAddress + Issue + ETA Ask]
        Ack_M[Ack Monitor AgentTech Accepted Job?]
    end

    subgraph Fallback["Fallback Layer"]
        Keyword[Keyword TriageCircuit Breaker Open]
        SMS_FB[SMS-Only PathTwilio Voice Down]
    end

    Ingress --> Head
    Head --> Email_T
    Head --> Call_T
    Head --> VM_T
    Email_T --> Dispatch_O
    Call_T --> Dispatch_O
    VM_T --> Dispatch_O
    Dispatch_O --> HITL
    HITL --> Voice_A
    HITL --> SMS_A
    Voice_A --> Ack_M
    SMS_A --> Ack_M
    Dispatch_O -.degrade.-> Keyword
    Voice_A -.outage.-> SMS_FB
```

The eleven agents:

1. **Head Agent** — Dispatch and handoff coordination across the graph.
2. **Email Triage Agent** — Scores incoming service email 0.0–1.0 with HVAC-specific rubric.
3. **Call Triage Agent** — Parses signed Twilio/Dialpad webhooks; pulls caller ID against customer DB.
4. **Voicemail Analyzer Agent** — Reads transcribed voicemails; flags no-heat, gas-smell, leak, CO-alarm markers.
5. **Dispatch Orchestrator** — Builds the on-call ladder from rotation + owner fallback.
6. **Voice Agent** — Generates a 35–50 word job brief for the on-call tech with address, issue, and DTMF prompt.
7. **SMS Agent** — Composes ≤160-char SMS with address, issue summary, and "Reply YES to accept."
8. **Acknowledgment Monitor Agent** — Detects acceptance across DTMF, SMS, and dashboard click.
9. **HITL Approval Agent** — Pauses the graph for ambiguous cases the owner wants to eyeball.
10. **Keyword Fallback Agent** — Deterministic regex triage when the LLM circuit is open.
11. **Audit Agent** — Writes the event-sourced trail to PostgreSQL after every state transition.

## Why NATS JetStream

NATS JetStream sits between the gateway and the workers. We did not pick it for raw performance — Kafka would also work — we picked it for three operational properties that matter for an after-hours product:

- **Durable consumers with at-least-once delivery.** If a worker crashes mid-dispatch, the message is redelivered and LangGraph's checkpoint resumes from the last committed step. No duplicate page-outs because the agent is idempotent on call SID.
- **Queue depth as the autoscaling signal.** CPU is the wrong metric for an LLM-bound workload — a worker can be 100% blocked on an OpenAI call while CPU sits at 5%. We export NATS pending message count to Prometheus and HPA scales workers from 10 to 100 pods on backlog depth — useful when a cold front drops temps and call volume spikes 5x in twenty minutes.
- **Operational simplicity.** A single 3-node NATS cluster versus Kafka with Zookeeper, schema registry, and connect workers. For 8K msg/sec we did not need Kafka's throughput ceiling.

The flow is a saga: each agent step publishes its result to a downstream subject, and the orchestrator owns compensating actions (cancel pending tech pages if the customer calls back to cancel). Saga semantics over distributed transactions because we span Twilio, OpenAI, PostgreSQL, and Redis — no two-phase commit is going to coordinate that.

## The Tech-Acceptance Loop

Acceptance is the trickiest piece of the system. An on-call HVAC tech needs to be able to accept the job across whichever channel they are reachable on at 2 AM — DTMF on the call we made to them, an SMS reply, or a click in the company dashboard. All three must converge on a single canonical "accepted by tech X" state with strong consistency: dispatching two trucks to one job is bad, but dispatching the wrong tech is worse.

```mermaid
flowchart TB
    Start[HVAC Score ≥ 0.6]
    Build[Build On-Call LadderPrimary Tech → Secondary → Owner]
    Loop[For Each Tech]

    subgraph Channels["Parallel Channels"]
        Call[Twilio Call to TechJob Brief + Press 1 to Accept]
        SMS[Twilio SMS to TechReply YES + ETA]
        Dash[Dashboard WebSocketClick Accept]
    end

    Wait[Wait 120s OR Accept]

    DTMF{DTMFPressed?}
    SMSReply{SMSReplied YES?}
    DashClick{DashboardClicked?}

    Idem[Idempotent WritePostgreSQL TxSELECT FOR UPDATE]
    Cancel[Cancel AllPending Channels]
    Done[Job AcceptedSMS Customer ETAStop Escalation]
    Next[Advance toNext Tech]

    Start --> Build
    Build --> Loop
    Loop --> Call
    Loop --> SMS
    Loop --> Dash
    Call --> Wait
    SMS --> Wait
    Dash --> Wait
    Wait --> DTMF
    Wait --> SMSReply
    Wait --> DashClick
    DTMF -->|Yes| Idem
    SMSReply -->|Yes| Idem
    DashClick -->|Yes| Idem
    Idem --> Cancel
    Cancel --> Done
    DTMF -->|No, timeout| Next
    SMSReply -->|No, timeout| Next
    DashClick -->|No, timeout| Next
    Next --> Loop
```

The CP guarantee is enforced in PostgreSQL with a `SELECT ... FOR UPDATE` on the dispatch row, then a single `UPDATE` that flips status from `paging` to `accepted`. Whichever channel accepts first wins; the others see the row already locked, no-op, and emit a "duplicate accept" event for audit. The whole transaction is idempotent on (dispatch_id, channel, tech_id) so Twilio's webhook retries never produce two accepted records.

As soon as a tech accepts, the system fires an outbound SMS to the original customer with the tech's name and ETA — closing the loop in under a minute from the time the customer first called. At 5K concurrent dispatches, P95 round-trip is under 500 ms; most of that is Twilio call setup, not our database.

## Resilience: Circuit Breakers and the Twilio Outage

Every external dependency is wrapped in a circuit breaker. Each agent has its own breaker so a degraded OpenAI region cannot take down SMS dispatch. The breaker exposes three states — closed, half-open, open — and on open, the agent falls back to a deterministic path.

```mermaid
flowchart LR
    Req[HVAC Call/Email Arrives]

    OAI{OpenAICircuit?}
    KW[Keyword TriageRegex Fallback]
    LLM[LangGraph + GPTNormal HVAC Triage]

    Tw{Twilio VoiceCircuit?}
    Voice[Voice Call to Tech+ SMS Backup]
    SMSOnly[SMS-Only PageMaintain Delivery]

    Ack[Tech AcceptedTruck Rolling]

    Req --> OAI
    OAI -->|Closed| LLM
    OAI -->|Open| KW
    LLM --> Tw
    KW --> Tw
    Tw -->|Closed| Voice
    Tw -->|Open| SMSOnly
    Voice --> Ack
    SMSOnly --> Ack
```

This was not theoretical. During a Twilio us-east-1 regional outage, the voice circuit opened within 90 seconds of the first error spike. Every dispatch that followed routed straight to SMS-only. We maintained **99.7% acknowledgment delivery for 150K+ active dispatch users with zero data loss** while Twilio's voice product was down. When the circuit closed again, the half-open probe traffic detected recovery and resumed normal voice paging within two minutes.

The keyword fallback is a 200-line file of regex patterns: `no heat`, `gas smell`, `furnace not`, `pilot light`, `boiler leak`, `frozen pipes`, `CO alarm`, `walk-in down`. It is dumber than the LLM, it produces more false positives, and that is fine — when the LLM path is down, the right answer for HVAC is to over-page, not under-page. Conservative degradation, not graceful failure.

## Human-in-the-Loop: Owner Approval Mode

Some HVAC owners want to eyeball ambiguous calls before a tech is woken up at 2 AM — especially small shops where the owner pays the on-call premium and a false page costs real money. For those companies, the orchestrator triggers a LangGraph `interrupt()` on any call scoring 0.5–0.75, which suspends the graph and surfaces the agent's reasoning + proposed tech to the owner's phone via push notification.

The owner can:

- **Approve** the proposed dispatch as-is — graph resumes from the checkpoint, no replay.
- **Modify** the on-call ladder (skip Tech A who is on vacation, page Tech B instead) — graph state is patched and resumes.
- **Reject** — graph is killed, customer gets a polite "we'll call you at 7 AM" SMS, audit event recorded.

Resume latency is sub-second because LangGraph checkpoints are kept hot in Redis. Adding owner-approval mode cut **false-positive dispatches by 31%** while preserving fully-automatic dispatch for high-confidence emergencies above 0.75.

## Datastore Choices: CP and AP, On Purpose

The CAP trade-off is made deliberately per data class:

- **PostgreSQL (CP)** — primary plus two read replicas behind PgBouncer with 10K max connections. Sustains ~10K writes/sec and ~100K reads/sec. Owns canonical state: customer records, on-call rotations, dispatches, acceptances, audit trail. Strong consistency is non-negotiable — a stale read of dispatch status sends two trucks.
- **Redis Cluster (AP)** — six nodes, 24Gi total, ~1M ops/sec, TTL-based eviction. Owns sessions, rate-limit counters, hot LangGraph checkpoints, and the WebSocket pub/sub. Eventual consistency is acceptable because none of this data is canonical.
- **Elasticsearch** — three sharded nodes fed by Debezium CDC from PostgreSQL. ~200K searches/sec. Owns the searchable view: full-text on customer notes, voicemail transcripts, dispatch history. Always behind PostgreSQL by a few hundred milliseconds; we never read from ES for transactional decisions.

## Observability and Evaluation

The observability stack is the difference between "agent system in production" and "incident waiting to happen at 3 AM." We instrument three layers in parallel and they feed each other.

```mermaid
flowchart TB
    subgraph Production["Production Traces"]
        OTel[OpenTelemetryDistributed Spans]
        LangSmith[LangSmithAgent Trace Capture]
        Prom[Prometheus+ Grafana]
    end

    subgraph Evaluation["Evaluation Loop"]
        Dataset[HVAC Eval Dataset200+ Edge Cases]
        Replay[Replay-BasedRegression Evals]
        Judge[LLM-as-JudgeQuality Scoring]
    end

    subgraph Mining["Continuous Mining"]
        Mine[Trace MiningPromote Failures]
        Rotate[Monthly RotationNew Edge Cases]
    end

    subgraph Alerting["Alerting"]
        P99[P99 Latency Spike]
        Err[Error Rate Threshold]
        Page[PagerDuty]
        MTTR[MTTR 45min → 8min]
    end

    OTel --> Prom
    LangSmith --> Replay
    Dataset --> Replay
    Replay --> Judge
    LangSmith --> Mine
    Mine --> Rotate
    Rotate --> Dataset
    Prom --> P99
    Prom --> Err
    P99 --> Page
    Err --> Page
    Page --> MTTR
```

**LangSmith** captures every agent trace with inputs, outputs, tool calls, latency, and token cost. We replay traces against new prompt versions to catch regressions before they ship. The HVAC eval dataset has 200+ edge cases — multilingual emergencies, false alarms ("the AC is loud"), ambiguous urgency ("there's a smell but I think it's fine"), commercial vs residential, with outdoor temperature context. Every PR that touches a prompt or model runs the full set.

**LLM-as-Judge** scores agent decisions on rubrics: was urgency correctly classified, was the right tech paged given skill matrix and geography, was the SMS clear and within Twilio length limits. We track judge agreement against owner-labeled samples and only trust scores when agreement exceeds 0.8.

**Production metrics** — P95/P99 latency per agent, false-positive rate via owner overrides, tool call counts per turn, dollar cost per dispatch. Alerting on P99 latency spikes and error-rate thresholds dropped **MTTR from 45 minutes to 8 minutes**.

## Deployment and Scaling on Kubernetes

The whole platform runs on a single Kubernetes cluster with namespace isolation per tier. Deployment is GitHub Actions → container registry → Argo Rollouts canary. Prompts and model configs are versioned *separately* from code via a config service so we can roll back a bad prompt in seconds without rebuilding an image.

Specifically for the AI workload:

- **Workers are sized small** — 1 CPU, 1Gi RAM per pod. They are I/O-bound on the LLM call; oversized pods waste money.
- **HPA on queue depth** via the NATS Prometheus exporter. Workers scale 10 → 100 pods on a 60-second window — useful when a winter storm triples after-hours volume.
- **Provider rate limits are the real ceiling**, not pod count. We shard across multiple OpenAI organization keys with a token-bucket limiter; the breaker degrades to fallback when we approach quota.
- **Graceful shutdown** with a 120-second termination grace period. SIGTERM stops NATS pulls, in-flight LangGraph runs checkpoint, then the pod exits. No mid-dispatch drops on rolling deploys.
- **Pre-warm the prompt cache** by keeping system-prompt prefixes stable. A worker that just started serves its first request with a warm cache because the prefix is shared across the fleet.

The full deployment serves 500K+ active users, 200K+ concurrent WebSocket connections, 25K+ requests/sec at the gateway, and 8K+ messages/sec through the worker tier.

## Lessons From HVAC Production

Three patterns have held up across six months of HVAC dispatches:

1. **The triage taxonomy is more valuable than the model.** Whether you use GPT-4o or Llama 3.1 matters less than whether your scoring rubric correctly distinguishes "no heat at -10°F with a baby in the house" from "the heat seems weak in one room." Spend the first month building the rubric with the owner, not tuning the model.
2. **Make the LLM call optional.** The keyword fallback feels like over-engineering until the day OpenAI has a regional outage during a polar vortex. Then it is the only thing keeping pipes from freezing in your customers' homes. Conservative degradation beats clean failure.
3. **Evaluation is not a gate, it is a loop.** A static eval set rots. Mine production dispatches every week, promote the surprising ones into the dataset, retire the trivial ones. The eval set should always feel slightly harder than yesterday.

## Try CallSphere After-Hours For HVAC

The system described in this post powers CallSphere's After-Hours product for HVAC companies — production-ready emergency triage and tech dispatch. Eleven AI agents, configurable on-call ladders, Twilio voice + SMS + DTMF acceptance, full event-sourced audit trail, owner approval mode for ambiguous calls.

[**Book a 15-minute demo**](https://callsphere.tech/contact) or [see the live dashboard](https://escalation.callsphere.tech).

## FAQ

**Q: How do you tell a real no-heat call from someone who just wants to schedule maintenance?**
The triage agent uses HVAC-specific signals: explicit phrases ("no heat," "freezing"), outdoor temperature pulled from a weather API, household composition if known (infants/elderly), customer call history, and tone markers in the voicemail audio. Anything below 0.6 score is logged for the morning, not paged.

**Q: What happens if both Twilio and OpenAI are down at the same time during a polar vortex?**
Keyword fallback for triage, SMS-only for delivery. If both Twilio voice and SMS are down, the dashboard WebSocket path still surfaces every emergency to the owner in real time. We have not yet seen all three down simultaneously, but the system is designed to over-page in that scenario rather than miss real emergencies.

**Q: How do you prevent two trucks from being dispatched to one job when Twilio retries a webhook?**
Every agent step is idempotent on the call SID + step ID, and PostgreSQL row-level locking on the dispatch record means a duplicate webhook becomes a no-op. The agent re-enters the LangGraph from its last checkpoint instead of starting over.

**Q: Can the owner override the AI before a tech is paged?**
Yes — that is the human-in-the-loop interrupt path. For ambiguous calls scoring 0.5–0.75, the graph suspends and the owner gets a push notification. They can approve, modify the on-call ladder, or reject. Resume latency is sub-second because checkpoints are hot in Redis.

**Q: Does this replace my existing answering service?**
Most HVAC companies on this product fully replaced their answering service within 60 days. The cost difference is significant ($1,500–$5,000/month vs the CallSphere subscription) and the triage accuracy is higher because the model is HVAC-specific instead of a generic call-center script.

---

Source: https://callsphere.ai/blog/building-hvac-after-hours-emergency-escalation-system-2026
