---
title: "7 MLOps & AI Deployment Interview Questions for 2026"
description: "Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks."
canonical: https://callsphere.ai/blog/mlops-ai-deployment-interview-questions-2026
category: "AI Interview Prep"
tags: ["AI Interview", "MLOps", "Model Deployment", "CI/CD", "Google", "Amazon", "Quantization", "vLLM", "2026"]
author: "CallSphere Team"
published: 2026-03-24T00:00:00.000Z
updated: 2026-05-08T10:49:13.066Z
---

# 7 MLOps & AI Deployment Interview Questions for 2026

> Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.

## MLOps in 2026: From "Nice to Have" to "Core Interview Topic"

Two years ago, MLOps questions were optional — asked at infrastructure-heavy companies but skipped at AI labs. In 2026, **every** AI role includes MLOps because every company is deploying models to production. If you can't get a model from a notebook to a scalable service, you're not a complete AI engineer.

```mermaid
flowchart LR
    FP16(["FP16 model
baseline weights"])
    CALIB["Calibration set
128 to 1024 samples"]
    METHOD{"Quantization
method"}
    GPTQ["GPTQ
weight only INT4"]
    AWQ["AWQ
activation aware"]
    GGUF["llama.cpp GGUF
K-quants for CPU"]
    EVAL["Eval delta vs FP16
perplexity, MMLU"]
    SERVE[("Serve on
consumer GPU")]
    FP16 --> CALIB --> METHOD
    METHOD --> GPTQ --> EVAL
    METHOD --> AWQ --> EVAL
    METHOD --> GGUF --> EVAL
    EVAL --> SERVE
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SERVE fill:#059669,stroke:#047857,color:#fff
```

These 7 questions cover the real deployment challenges companies face today.

---

MEDIUM
Google
Amazon
Microsoft

**Q1: Design a CI/CD Pipeline for ML Models**

### What They're Really Testing

They want to see that you understand ML CI/CD is **fundamentally different** from software CI/CD. In software, if the code compiles and tests pass, you're good. In ML, the code can work perfectly but the model can still be garbage.

### Pipeline Architecture

```
Code Change → Linting + Unit Tests
                  │
                  ▼
           Data Validation (schema checks, distribution checks)
                  │
                  ▼
           Model Training (on standardized environment)
                  │
                  ▼
           Model Evaluation
           ├── Offline Metrics (accuracy, F1, perplexity)
           ├── Regression Tests (known inputs → expected outputs)
           ├── Fairness Checks (performance across demographic groups)
           └── Performance Benchmarks (latency, throughput, memory)
                  │
                  ▼
           Model Registry (version, tag, artifact store)
                  │
                  ▼
           Staging Deployment → Integration Tests
                  │
                  ▼
           Canary (5% traffic) → Monitor metrics
                  │
                  ▼
           Full Rollout (auto if metrics pass, manual gate option)
```

### Key Differences from Software CI/CD

| Aspect | Software CI/CD | ML CI/CD |
| --- | --- | --- |
| **What changes** | Code only | Code + data + model weights |
| **Tests** | Unit + integration tests | + model quality tests + data quality tests |
| **Artifact** | Docker image | Docker image + model weights + config |
| **Rollback trigger** | Errors, crashes | + metric degradation, data drift |
| **Pipeline trigger** | Code push | + data change, scheduled retraining |

**Key Talking Points**

- **Data versioning** (DVC, LakeFS) is as important as code versioning. You need to reproduce any past training run.
- **Model registry** (MLflow, Weights & Biases) tracks model lineage: which data + code + hyperparameters produced this model.
- **Canary deployment** for ML: Route 5% of traffic to new model, compare key metrics against baseline. Auto-rollback if metrics degrade by >X%.
- **Shadow deployment**: Run new model in parallel, log predictions but serve old model's predictions. Compare offline before switching.

---

MEDIUM
Widely Asked

**Q2: How Do You Monitor Models in Production? What Is Data Drift?**

### Three Types of Drift

**1. Data Drift (Covariate Shift)**

- The input distribution changes: e.g., your model was trained on US English, but suddenly gets 30% Spanish queries
- Detection: Compare feature distributions between training data and production inputs using KL divergence, PSI (Population Stability Index), or KS test

**2. Concept Drift**

- The relationship between inputs and outputs changes: e.g., what users consider a "good recommendation" shifts during holiday season
- Detection: Monitor prediction-to-outcome correlation over time

**3. Model Performance Drift**

- Model accuracy degrades even without data drift: e.g., the world changes (new products, new slang) and the model's knowledge becomes stale
- Detection: Monitor key business metrics (click-through rate, conversion, CSAT) and compare against rolling baselines

### Production Monitoring Stack

```
Production Traffic
    │
    ├── Input Monitoring
    │   ├── Feature distribution tracking
    │   ├── Missing value rates
    │   ├── Schema validation
    │   └── Volume monitoring (QPS anomalies)
    │
    ├── Output Monitoring
    │   ├── Prediction distribution (confidence scores)
    │   ├── Class balance (is the model suddenly predicting one class 99%?)
    │   ├── Latency (p50, p95, p99)
    │   └── Error rates
    │
    └── Outcome Monitoring
        ├── Business metrics correlation
        ├── Human feedback aggregation
        └── Delayed label comparison (when ground truth becomes available)
```

**Key Talking Points**

- "The most dangerous drift is **silent drift** — the model keeps producing outputs with high confidence, but the outputs are wrong because the world has changed. This is why you can't just monitor model confidence; you need ground-truth labels (even sampled/delayed) to catch real degradation."
- "I set up **two types of alerts**: statistical (distribution has shifted by >X) and business (conversion rate dropped >Y%). Statistical alerts catch drift early; business alerts catch impact."
- Mention tools: Evidently AI, WhyLabs, Arize, or custom Prometheus + Grafana dashboards for monitoring.

---

HARD
OpenAI
Anthropic
Meta

**Q3: Explain Quantization for LLM Deployment (INT8, INT4, FP8)**

### Why Quantization Matters

A 70B parameter model in FP16 requires **140 GB** of GPU memory — almost 2 H100s just for the weights. Quantization compresses model weights to lower precision, reducing memory and speeding up inference.

### Quantization Formats

| Format | Bits | Memory (70B) | Quality Loss | Speed Gain |
| --- | --- | --- | --- | --- |
| FP32 | 32 | 280 GB | Baseline | Baseline |
| FP16/BF16 | 16 | 140 GB | None | 2x |
| FP8 | 8 | 70 GB | Minimal | 3-4x |
| INT8 | 8 | 70 GB | Very small | 3-4x |
| INT4 (GPTQ/AWQ) | 4 | 35 GB | Small-moderate | 5-7x |
| NF4 (QLoRA) | 4 | 35 GB | Small | 5-7x (training) |

### Key Techniques

**Post-Training Quantization (PTQ)**:

- Quantize after training with a small calibration dataset
- GPTQ: Layer-by-layer quantization minimizing reconstruction error
- AWQ: Activation-Aware — protects salient weights (high activation channels) from aggressive quantization

**Quantization-Aware Training (QAT)**:

- Simulate quantization during training so the model learns to be robust
- Higher quality but requires full training pipeline

**Dynamic vs. Static Quantization**:

- Static: Compute scale factors once using calibration data. Faster inference.
- Dynamic: Compute scale factors per batch at runtime. Better quality, slight overhead.

**Key Talking Points**

- "The rule of thumb: **INT8 is nearly lossless** for most models. INT4 degrades quality by 1-3% on benchmarks but halves the memory again. For production, INT8 is the sweet spot unless you're extremely memory-constrained."
- "**FP8 (E4M3/E5M2)** is the emerging standard on H100s and newer GPUs. It has native hardware support, so you get the memory savings of INT8 with better numerical properties for training."
- "AWQ > GPTQ in most benchmarks because it identifies which weight channels have high activation magnitudes and keeps those at higher precision. This preserves the model's most important computation paths."
- "Quantization + speculative decoding stack: quantize both draft and target models, getting compound speedups."

---

MEDIUM
OpenAI
Anthropic

**Q4: Describe Continuous Batching for LLM Serving. Why Is It Better?**

### Static Batching (The Old Way)

```
Request A (10 tokens)  ████████████████████░░░░░░░░░░  (waits)
Request B (30 tokens)  ████████████████████████████████████████████████████████████
Request C (5 tokens)   ██████████░░░░░░░░░░░░░░░░░░░░  (waits a LOT)

All 3 must wait for the longest request (B) to finish.
GPU is idle for A and C after they complete.
```

### Continuous Batching (The Modern Way)

```
Iteration 1: Process [A, B, C] together
Iteration 2: A finishes → replace with new Request D
             Process [D, B, C] together
Iteration 3: C finishes → replace with Request E
             Process [D, B, E] together
```

**Key insight**: As soon as one request in the batch finishes generating, a new request takes its slot. The GPU is **never idle** waiting for the longest request.

### Performance Impact

| Metric | Static Batching | Continuous Batching |
| --- | --- | --- |
| GPU Utilization | 30-50% | 80-95% |
| Throughput | Baseline | 2-3x higher |
| Latency variance | Very high (short reqs wait for long) | Low (each req finishes independently) |

### How vLLM Implements This

vLLM combines continuous batching with **PagedAttention**:

- KV cache managed as virtual memory pages (not contiguous blocks)
- New requests can be inserted without pre-allocating maximum sequence length
- Memory waste reduced by ~55% vs. static allocation

**Key Talking Points**

- "The key implementation challenge is **iteration-level scheduling** — the serving engine must decide at every decoding step which requests are in the current batch. This requires an efficient scheduler that can handle thousands of concurrent requests."
- "Continuous batching pairs well with **prefix caching** — if multiple requests share the same system prompt, they share the KV cache for that prefix. This is common in production (all requests to a customer support bot share the same system prompt)."
- "Mention specific frameworks: vLLM (PagedAttention, most popular), TGI (HuggingFace), TensorRT-LLM (NVIDIA, best raw performance), SGLang (frontier research)."

---

HARD
Amazon
Google
Microsoft

**Q5: How Would You Implement an Automated ML Pipeline?**

### End-to-End ML Pipeline

```
Data Sources → Ingestion → Validation → Transformation → Training → Evaluation → Registry → Serving
     │             │            │             │              │            │           │          │
     ▼             ▼            ▼             ▼              ▼            ▼           ▼          ▼
  S3/DB      Airflow/       Great         Feature       GPU Cluster   Eval Suite  MLflow     K8s +
             Prefect     Expectations     Store          (spot)       + gates              vLLM/TGI
```

### Component Choices

| Component | Tool Options | Key Consideration |
| --- | --- | --- |
| **Orchestration** | Airflow, Prefect, Kubeflow Pipelines | DAG management, retry logic, scheduling |
| **Data Validation** | Great Expectations, Pandera | Schema + distribution checks before training |
| **Feature Store** | Feast, Tecton, Vertex AI | Offline/online feature consistency |
| **Training** | SageMaker, Vertex AI, bare K8s + spot GPUs | Cost optimization via spot instances |
| **Experiment Tracking** | W&B, MLflow, Neptune | Hyperparameter search, metric comparison |
| **Model Registry** | MLflow, SageMaker Model Registry | Versioning, staging, approval workflows |
| **Serving** | vLLM, TGI, Triton, SageMaker Endpoints | Auto-scaling, A/B testing, shadow mode |

### Pipeline Triggers

- **Scheduled**: Retrain weekly/monthly on new data
- **Data-driven**: Trigger when new data exceeds threshold (e.g., 10K new labeled examples)
- **Drift-driven**: Trigger when monitoring detects data drift or performance degradation
- **Manual**: Data scientist triggers after experiment validates improvement

**Key Talking Points**

- "The hardest part isn't building the pipeline — it's building the **evaluation gates**. Every pipeline stage needs a go/no-go decision: Is the data quality good enough to train? Is the model quality good enough to deploy? These gates prevent bad models from reaching production."
- "**Cost optimization** is critical: Use spot/preemptible instances for training (3-5x cheaper), with checkpointing for fault tolerance. For serving, right-size GPU instances — don't use an A100 for a model that fits on a T4."
- At Amazon: tie to Leadership Principles — "Frugality" means cost-optimized infrastructure, "Bias for Action" means automated pipelines over manual deployments.

---

MEDIUM
Meta

**Q6: Design an Evaluation Framework for Testing Ranking Models in Production**

### Offline Evaluation

**Metrics**:

- **NDCG (Normalized Discounted Cumulative Gain)**: Measures ranking quality — are the best items at the top?
- **MAP (Mean Average Precision)**: Average precision across all relevant items
- **MRR (Mean Reciprocal Rank)**: How far down is the first relevant result?

**Methodology**:

- Hold-out test set from recent data (not randomly sampled — temporal split to avoid leakage)
- Compute metrics on the test set for both old and new model
- Statistical significance testing (paired t-test or bootstrap confidence intervals)

### Online Evaluation (A/B Testing)

```
Production Traffic
    │
    ├── 50% → Control (current model)
    │         Measure: CTR, engagement, revenue
    │
    └── 50% → Treatment (new model)
              Measure: CTR, engagement, revenue

    → Statistical test after N days/users → Ship or revert
```

### Interleaving (The Meta Approach)

Instead of splitting users between models, **interleave results** from both models in a single result list for each user:

```
Position 1: Model A's top result
Position 2: Model B's top result
Position 3: Model A's 2nd result
Position 4: Model B's 2nd result
...
```

Count which model's results get more clicks → more sensitive than traditional A/B testing (requires 10x fewer users for the same statistical power).

**Key Talking Points**

- "Offline metrics can disagree with online metrics. A model with better NDCG might have worse user engagement because it optimizes for relevance without considering **diversity** (users get bored seeing similar results)."
- "Guard against **novelty effects**: Users might click more on a new ranking initially because it's different, not because it's better. Run experiments for at least 2 weeks."
- "Long-term metrics matter: A ranking change might boost short-term CTR but reduce long-term retention. Track both."

---

MEDIUM
Amazon
Google
Microsoft

**Q7: Explain Model Serving Infrastructure (vLLM, TGI, TensorRT-LLM)**

### The Serving Stack

```
API Gateway (rate limiting, auth)
    → Load Balancer (route to least-loaded GPU)
        → Serving Framework (vLLM / TGI / TensorRT-LLM)
            → GPU Inference (model loaded in GPU memory)
                → Response Streaming (SSE / WebSocket)
```

### Framework Comparison

| Feature | vLLM | TGI (HuggingFace) | TensorRT-LLM (NVIDIA) |
| --- | --- | --- | --- |
| **Key Innovation** | PagedAttention | Production-ready, easy deploy | Kernel-level optimization |
| **Performance** | High | Good | Highest (NVIDIA-specific) |
| **Ease of Use** | pip install | Docker image | Complex build process |
| **Hardware** | Any GPU | Any GPU | NVIDIA only |
| **Continuous Batching** | Yes | Yes | Yes |
| **Quantization** | GPTQ, AWQ, FP8 | GPTQ, bitsandbytes | INT8, INT4, FP8 (native) |
| **Best For** | General use, flexibility | Quick deployment | Maximum throughput |

### Auto-Scaling Strategy

- **Metric**: Scale on GPU utilization + request queue depth (not CPU, which is misleading for GPU workloads)
- **Scale-up**: When queue depth > threshold for > 30 seconds
- **Scale-down**: When GPU utilization  5 minutes (aggressive cooldown to save costs)
- **Minimum replicas**: Always keep 1+ warm (cold start for loading model weights = 30-120 seconds)

**Key Talking Points**

- "In practice, I'd start with **vLLM** for most use cases — it has the best developer experience and PagedAttention gives you 90%+ of TensorRT-LLM's throughput with much less complexity."
- "For **maximum throughput** at scale (millions of requests/day), TensorRT-LLM with custom CUDA kernels and FP8 quantization on H100s is the gold standard."
- "**Multi-model serving**: If you need to serve multiple models, consider frameworks that support model multiplexing — load multiple LoRA adapters on a single base model rather than running separate instances."
- "Discuss **cost**: GPU inference is expensive. A single H100 is ~$2-3/hr. At 50 tokens/sec output, that's ~$0.004 per 100 tokens. Compare to API pricing ($0.01-0.06 per 100 tokens) to decide build-vs-buy."

---

## Frequently Asked Questions

### How important is MLOps knowledge for AI engineering interviews?

It's now a core competency, not optional. Even AI labs like OpenAI and Anthropic ask about deployment, monitoring, and evaluation because they ship models to millions of users. At applied AI companies (Amazon, Microsoft, Google), it's often 25-30% of the interview signal.

### Do I need to know specific tools like vLLM or MLflow?

Knowing specific tools demonstrates practical experience. But concepts matter more — if you can explain continuous batching, quantization trade-offs, and monitoring strategies, the specific tool names are secondary.

### What's the difference between MLOps and traditional DevOps?

MLOps adds three dimensions: (1) data management (versioning, quality, drift), (2) model management (training, evaluation, registry), and (3) experiment tracking (hyperparameters, metrics, reproducibility). DevOps principles (CI/CD, monitoring, infrastructure-as-code) still apply but are extended for ML-specific challenges.

---

Source: https://callsphere.ai/blog/mlops-ai-deployment-interview-questions-2026
