---
title: "Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs"
description: "Implement Gemini streaming for real-time token delivery in agent UIs. Learn stream_generate_content, chunk handling, SSE integration with FastAPI, and building responsive chat interfaces."
canonical: https://callsphere.ai/blog/gemini-streaming-real-time-responses-building-responsive-agent-uis
category: "Learn Agentic AI"
tags: ["Google Gemini", "Streaming", "Real-Time", "FastAPI", "Server-Sent Events"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.232Z
---

# Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs

> Implement Gemini streaming for real-time token delivery in agent UIs. Learn stream_generate_content, chunk handling, SSE integration with FastAPI, and building responsive chat interfaces.

## Why Streaming Matters for Agent UX

When a Gemini API call takes 5-10 seconds to complete, users stare at a loading spinner wondering if something broke. Streaming delivers tokens as they are generated, typically starting within 200-500 milliseconds. The user sees the response forming in real time, which feels dramatically faster even though the total generation time is the same.

For agent applications, streaming is even more important. When your agent calls tools, the user can see "Searching for flights..." appear immediately rather than waiting for the entire tool call and response cycle to finish.

## Basic Streaming

Replace `generate_content` with `generate_content` and set `stream=True`:

```mermaid
sequenceDiagram
    autonumber
    participant Client
    participant Edge as Edge Worker
    participant LLM as LLM Provider
    participant DB as Logs and Trace
    Client->>Edge: POST /chat (stream=true)
    Edge->>LLM: messages.create(stream=true)
    loop Each token
        LLM-->>Edge: SSE chunk delta
        Edge-->>Client: SSE chunk delta
        Edge->>DB: append token to span
    end
    LLM-->>Edge: stop_reason=end_turn
    Edge-->>Client: event: done
    Edge->>DB: finalize trace
```

```python
import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    "Write a detailed explanation of how transformer attention works.",
    stream=True,
)

for chunk in response:
    if chunk.text:
        print(chunk.text, end="", flush=True)

print()  # Final newline
```

Each chunk contains a portion of the response text. Chunks arrive as soon as the model generates them, so the first chunk typically appears within a few hundred milliseconds.

## Streaming with Chat Sessions

Streaming works seamlessly with multi-turn chat:

```python
model = genai.GenerativeModel("gemini-2.0-flash")
chat = model.start_chat()

def stream_chat(message: str):
    response = chat.send_message(message, stream=True)
    full_response = []

    for chunk in response:
        if chunk.text:
            print(chunk.text, end="", flush=True)
            full_response.append(chunk.text)

    print()
    return "".join(full_response)

stream_chat("What are the main differences between REST and GraphQL?")
stream_chat("Which would you recommend for a real-time dashboard?")
```

The chat history is maintained across streaming calls, so follow-up questions work correctly.

## Async Streaming for Web Applications

For web servers, use the async streaming interface to avoid blocking the event loop:

```python
import google.generativeai as genai
import asyncio
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

async def stream_response(prompt: str):
    response = await model.generate_content_async(
        prompt,
        stream=True,
    )

    full_text = []
    async for chunk in response:
        if chunk.text:
            full_text.append(chunk.text)
            yield chunk.text

    # After iteration, usage metadata is available
    # Access via response.usage_metadata if needed
```

## Server-Sent Events with FastAPI

Here is a complete FastAPI endpoint that streams Gemini responses to the browser using SSE:

```python
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import google.generativeai as genai
import json
import os

app = FastAPI()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

@app.post("/api/chat/stream")
async def chat_stream(request: Request):
    body = await request.json()
    prompt = body["message"]

    async def event_generator():
        response = await model.generate_content_async(prompt, stream=True)

        async for chunk in response:
            if chunk.text:
                data = json.dumps({"type": "text", "content": chunk.text})
                yield f"data: {data}\n\n"

        yield f"data: {json.dumps({'type': 'done'})}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    )
```

## Client-Side SSE Consumption

On the frontend, consume the stream with the EventSource API or fetch:

```python
# This is JavaScript for the browser — included for the full-stack pattern
# ~~~javascript
# async function streamChat(message) {
#     const response = await fetch('/api/chat/stream', {
#         method: 'POST',
#         headers: { 'Content-Type': 'application/json' },
#         body: JSON.stringify({ message }),
#     });
#
#     const reader = response.body.getReader();
#     const decoder = new TextDecoder();
#
#     while (true) {
#         const { done, value } = await reader.read();
#         if (done) break;
#
#         const text = decoder.decode(value);
#         const lines = text.split('\n');
#
#         for (const line of lines) {
#             if (line.startsWith('data: ')) {
#                 const data = JSON.parse(line.slice(6));
#                 if (data.type === 'text') {
#                     appendToChat(data.content);
#                 }
#             }
#         }
#     }
# }
```

## Streaming with Function Calling

When streaming is combined with function calling, you receive function call chunks that signal when to execute tools:

```python
def get_stock_price(symbol: str) -> dict:
    """Get the current stock price.

    Args:
        symbol: Stock ticker symbol, e.g. 'AAPL'.
    """
    prices = {"AAPL": 198.50, "GOOGL": 175.30, "MSFT": 420.15}
    return {"symbol": symbol, "price": prices.get(symbol, 0)}

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    tools=[get_stock_price],
)

chat = model.start_chat()

response = chat.send_message(
    "What is Apple's stock price?",
    stream=True,
)

for chunk in response:
    for part in chunk.parts:
        if part.function_call:
            fc = part.function_call
            print(f"Calling tool: {fc.name}({dict(fc.args)})")
            result = get_stock_price(**dict(fc.args))
            # Send result back and continue streaming
```

This allows your UI to show "Looking up AAPL stock price..." in real time while the tool executes.

## FAQ

### Does streaming affect token costs?

No. Streaming delivers the same tokens as non-streaming — it just delivers them incrementally. The total cost is identical regardless of whether you use streaming.

### Can I abort a streaming response mid-way?

Yes. Simply stop iterating over the response object. The connection will be closed and no further tokens will be generated. This is useful for implementing "Stop generating" buttons in chat UIs.

### What happens if the network drops during streaming?

The iterator will raise an exception. Implement retry logic that re-sends the request. Since Gemini API calls are not resumable, you need to restart the full generation. Consider saving partial responses so the user does not lose context.

---

#GoogleGemini #Streaming #RealTime #FastAPI #ServerSentEvents #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/gemini-streaming-real-time-responses-building-responsive-agent-uis