Skip to content
Learn Agentic AI
Learn Agentic AI10 min read3 views

Installing and Configuring Microsoft UFO: Getting Started with Windows Automation

Step-by-step guide to installing Microsoft UFO, configuring API keys, setting up the configuration files, and running your first automated Windows task with natural language.

Prerequisites

Before installing UFO, ensure your system meets the following requirements:

  • Windows 10 or 11 (UFO uses Windows UI Automation APIs that are not available on macOS or Linux)
  • Python 3.10 or later installed and added to PATH
  • An OpenAI API key with access to GPT-4V or GPT-4o (vision-capable models)
  • Git for cloning the repository
  • At least 8 GB of RAM (screenshots and vision model calls are memory-intensive)

UFO depends on the Windows UI Automation COM interfaces, so it must run on a Windows machine — not WSL, not a Linux VM. If you are developing on macOS or Linux, you will need a Windows machine or a cloud Windows instance.

Step 1: Clone the Repository

UFO is distributed as a GitHub repository, not a PyPI package. Clone it and enter the project directory:

flowchart TD
    START["Installing and Configuring Microsoft UFO: Getting…"] --> A
    A["Prerequisites"]
    A --> B
    B["Step 1: Clone the Repository"]
    B --> C
    C["Step 2: Create a Virtual Environment an…"]
    C --> D
    D["Step 3: Configure API Keys"]
    D --> E
    E["Step 4: Configure Azure OpenAI Optional"]
    E --> F
    F["Step 5: Run Your First Task"]
    F --> G
    G["Understanding the Configuration File"]
    G --> H
    H["Step 6: Verify With a Multi-Step Task"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
git clone https://github.com/microsoft/UFO.git
cd UFO

Step 2: Create a Virtual Environment and Install Dependencies

Set up an isolated Python environment:

python -m venv .venv
.venv\Scripts\activate

pip install -r requirements.txt

The requirements include openai, Pillow for screenshot handling, pywinauto for Windows UI Automation, and several other dependencies for image processing and control interaction.

Step 3: Configure API Keys

UFO reads its configuration from YAML files in the ufo/config/ directory. The primary file you need to edit is config.yaml. Create it from the template:

copy ufo\config\config.yaml.template ufo\config\config.yaml

Open the file and set your API credentials:

# ufo/config/config.yaml

# OpenAI API configuration
OPENAI_API_TYPE: "openai"
OPENAI_API_KEY: "sk-proj-your-api-key-here"
OPENAI_API_BASE: "https://api.openai.com/v1"
OPENAI_API_VERSION: "2024-02-15-preview"

# Model selection
HOST_AGENT:
  API_MODEL: "gpt-4o"

APP_AGENT:
  API_MODEL: "gpt-4o"

# Screenshot settings
SCREENSHOT_BACKEND: "uia"  # Options: uia, win32
ANNOTATION_COLORS:
  - "#FF0000"
  - "#00FF00"
  - "#0000FF"

The configuration separates model settings for the HostAgent and AppAgent. You can use different models for each — for example, a cheaper model for host-level routing and a more capable model for in-app actions.

Step 4: Configure Azure OpenAI (Optional)

If your organization uses Azure OpenAI Service instead of the public OpenAI API, update the configuration accordingly:

flowchart LR
    S0["Step 1: Clone the Repository"]
    S0 --> S1
    S1["Step 2: Create a Virtual Environment an…"]
    S1 --> S2
    S2["Step 3: Configure API Keys"]
    S2 --> S3
    S3["Step 4: Configure Azure OpenAI Optional"]
    S3 --> S4
    S4["Step 5: Run Your First Task"]
    S4 --> S5
    S5["Step 6: Verify With a Multi-Step Task"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S5 fill:#059669,stroke:#047857,color:#fff
# Azure OpenAI configuration
OPENAI_API_TYPE: "azure"
OPENAI_API_KEY: "your-azure-api-key"
OPENAI_API_BASE: "https://your-resource.openai.azure.com/"
OPENAI_API_VERSION: "2024-02-15-preview"

HOST_AGENT:
  API_MODEL: "your-gpt4o-deployment-name"

APP_AGENT:
  API_MODEL: "your-gpt4o-deployment-name"

Note that you provide the deployment name, not the model name, when using Azure.

Step 5: Run Your First Task

With everything configured, launch UFO:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

python -m ufo --task "Open Notepad and type Hello World"

UFO will:

  1. Launch or find the Notepad application
  2. Capture a screenshot and annotate UI elements
  3. Send the annotated screenshot to GPT-4V
  4. Execute the returned actions (click in the text area, type the text)
  5. Repeat until the task is complete

You will see step-by-step output in the console showing what the agent observes and what actions it takes.

Understanding the Configuration File

Here is a more complete configuration with explanations:

# ufo/config/config.yaml - Full reference

# API Provider: "openai" or "azure"
OPENAI_API_TYPE: "openai"
OPENAI_API_KEY: "sk-proj-..."
OPENAI_API_BASE: "https://api.openai.com/v1"

# Agent model configuration
HOST_AGENT:
  API_MODEL: "gpt-4o"
  MAX_TOKENS: 2048
  TEMPERATURE: 0.1      # Low temperature for deterministic actions

APP_AGENT:
  API_MODEL: "gpt-4o"
  MAX_TOKENS: 4096       # Higher token limit for complex UI analysis
  TEMPERATURE: 0.1

# Execution settings
MAX_STEP: 50             # Maximum steps before aborting a task
SLEEP_TIME: 2            # Seconds to wait between actions (UI settling)
SAFE_GUARD: true         # Require confirmation before destructive actions

# Screenshot configuration
SCREENSHOT_BACKEND: "uia"
INCLUDE_LAST_SCREENSHOTS: 3   # Number of previous screenshots for context
CONCAT_SCREENSHOTS: false      # Whether to tile screenshots side by side

# Logging
LOG_LEVEL: "INFO"
SAVE_SCREENSHOTS: true         # Save annotated screenshots for debugging
LOG_DIR: "logs/"

Step 6: Verify With a Multi-Step Task

Test a more complex workflow to confirm everything works end to end:

python -m ufo --task "Open File Explorer, navigate to Documents, and create a new folder called TestUFO"

Watch the console output as the HostAgent identifies File Explorer as the target application, the AppAgent navigates the folder tree, and the folder creation sequence executes.

Environment Variables as an Alternative

Instead of editing the YAML file directly, you can set configuration values via environment variables. This is useful for CI/CD or containerized setups:

set OPENAI_API_KEY=sk-proj-your-key
set UFO_HOST_MODEL=gpt-4o
set UFO_APP_MODEL=gpt-4o
set UFO_MAX_STEP=30

python -m ufo --task "Your task here"

Troubleshooting Common Setup Issues

"No module named pywinauto": Make sure you activated the virtual environment before running pip install. Run .venv\Scripts\activate again and reinstall.

"Access denied" on screenshot capture: Run your terminal as Administrator. UFO needs elevated permissions to capture screenshots of some applications.

"Model not found" errors: Verify your API key has access to the vision model specified in config. Try gpt-4o as a fallback.

Slow execution: Increase SLEEP_TIME if actions are executing before the UI finishes rendering. Windows animations can cause the agent to see transitional states.

FAQ

Can I use UFO without an OpenAI API key?

UFO requires a vision-capable LLM to interpret screenshots. You can use Azure OpenAI as an alternative, or configure a local model endpoint that supports the OpenAI vision API format, but some form of multimodal model access is required.

Does UFO support multiple monitors?

UFO captures the screen where the target application window is located. Multi-monitor setups work as long as the target application is fully visible on one screen. Split windows across monitors may cause partial screenshots.

How much does it cost to run UFO tasks?

Each step involves sending an annotated screenshot (roughly 1000-2000 tokens for the image) plus prompt tokens to GPT-4o. A simple 5-step task costs approximately $0.05-0.15 USD. Complex multi-application tasks with 30+ steps can cost $0.50-1.00 USD.


#MicrosoftUFO #WindowsSetup #AIAgent #DesktopAutomation #GPT4Vision #PythonAutomation #UIAutomation

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Build a Customer Support Agent from Scratch: Python, OpenAI, and Twilio in 60 Minutes

Step-by-step tutorial to build a production-ready customer support AI agent using Python FastAPI, OpenAI Agents SDK, and Twilio Voice with five integrated tools.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.