Ollama is the open-source local LLM server that powers PMCR-O agents. With GPU acceleration, inference time drops from 30-60 seconds (CPU) to 2-5 seconds (GPU), making real-time agent interactions feasible.
This guide shows you how to configure Ollama with NVIDIA GPU support in .NET Aspire for PMCR-O projects.
Prerequisites
- NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better)
- NVIDIA Drivers installed (version 535+ recommended)
- Docker Desktop with WSL2 backend (Windows) or native Docker (Linux)
- .NET Aspire project set up (see Getting Started with .NET Aspire)
Step 1: Verify GPU Availability
Check that Docker can access your GPU:
# Check NVIDIA driver
nvidia-smi
# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
Step 2: Configure Ollama in Aspire AppHost
Update your PmcroAgents.AppHost/Program.cs to add Ollama with GPU support:
using CommunityToolkit.Aspire.Hosting.Ollama;
var builder = DistributedApplication.CreateBuilder(args);
// ==============================================================================
// OLLAMA - LOCAL LLM SERVER WITH GPU
// ==============================================================================
var ollama = builder.AddOllama("ollama", port: 11434)
.WithDataVolume() // Persist models between runs
.WithLifetime(ContainerLifetime.Persistent) // Keep container running
.WithContainerRuntimeArgs("--gpus=all"); // Enable NVIDIA GPU passthrough
// ==============================================================================
// LLM MODELS - THE COUNCIL
// ==============================================================================
var qwen = ollama.AddModel("qwen2.5-coder:7b"); // The Planner (7.4GB download)
builder.Build().Run();
Step 3: Understanding GPU Passthrough
The --gpus=all flag gives the Ollama container access to your NVIDIA GPU. This enables:
- CUDA acceleration: Matrix operations run on GPU
- Faster inference: 2-5s per response vs 30-60s on CPU
- Batch processing: Handle multiple agent requests simultaneously
CPU Fallback: If GPU is unavailable, Ollama automatically falls back to CPU. Performance degrades but functionality remains.
Step 4: Model Selection
PMCR-O recommends these models for different agent roles:
qwen2.5-coder:7b (Recommended for Code Agents)
- Size: 7.4GB
- VRAM Required: 8GB+
- Best For: Planner, Maker, Checker agents
- Why: Strong tool-calling compliance, structured output support
phi3 (Lightweight Creative Agent)
- Size: 3.8GB
- VRAM Required: 4GB+
- Best For: Reflector, creative tasks
- Why: Fast inference, good for brainstorming
nomic-embed-text (Embeddings for RAG)
- Size: 274MB
- VRAM Required: 1GB+
- Best For: Knowledge Service, vector search
Step 5: First Run and Model Download
When you run the AppHost for the first time, Ollama will automatically download the specified models:
[Ollama] Pulling model qwen2.5-coder:7b...
[Ollama] Downloading 7.4GB (this may take 10-15 minutes on first run)
[Ollama] Model ready. GPU acceleration enabled.
Models are stored in a Docker volume and persist between runs. You only download once.
Step 6: Verify GPU Usage
After starting Ollama, verify GPU is being used:
# Check GPU utilization
nvidia-smi
# You should see Ollama process using GPU memory
# Look for: "ollama" process with GPU memory usage
Step 7: Connect Your Agent Service
Your agent services connect to Ollama via the connection string injected by Aspire:
// In your agent service (e.g., PlannerService/Program.cs)
var ollamaUri = builder.Configuration.GetConnectionString("ollama")
?? "http://localhost:11434";
var modelId = "qwen2.5-coder:7b";
builder.Services.AddHttpClient("ollama", client =>
{
client.BaseAddress = new Uri(ollamaUri);
client.Timeout = Timeout.InfiniteTimeSpan; // LLM inference can take time
})
.AddStandardResilienceHandler(options =>
{
options.AttemptTimeout.Timeout = TimeSpan.FromMinutes(3);
options.TotalRequestTimeout.Timeout = TimeSpan.FromMinutes(5);
});
// Register IChatClient
builder.Services.AddSingleton<IChatClient>(sp =>
{
var httpClient = sp.GetRequiredService<IHttpClientFactory>().CreateClient("ollama");
var baseClient = new OllamaApiClient(httpClient, modelId);
return new ChatClientBuilder(baseClient)
.UseFunctionInvocation() // Enables tool calling
.Build();
});
Performance Optimization
1. Model Quantization
For lower VRAM usage, use quantized models:
qwen2.5-coder:7b-q4_0- 4-bit quantization (4GB VRAM)qwen2.5-coder:7b-q8_0- 8-bit quantization (6GB VRAM)
2. Batch Size Tuning
Ollama processes requests in batches. For multiple concurrent agents, increase batch size:
var ollama = builder.AddOllama("ollama", port: 11434)
.WithDataVolume()
.WithLifetime(ContainerLifetime.Persistent)
.WithContainerRuntimeArgs("--gpus=all")
.WithEnvironment("OLLAMA_NUM_GPU", "1")
.WithEnvironment("OLLAMA_NUM_PARALLEL", "4"); // Process 4 requests concurrently
Troubleshooting
GPU Not Detected
Check Docker GPU support:
- Windows: Enable WSL2 backend in Docker Desktop settings
- Linux: Install
nvidia-container-toolkit - Verify:
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
Out of Memory Errors
If you get OOM errors:
- Use a smaller model (phi3 instead of qwen2.5-coder:7b)
- Use quantized models (q4_0 or q8_0)
- Reduce
OLLAMA_NUM_PARALLELto 1
Slow Inference
If inference is still slow:
- Verify GPU is being used:
nvidia-smishould show Ollama process - Check GPU memory: Ensure enough VRAM for the model
- Update NVIDIA drivers to latest version
Next Steps
- Read Introduction to Microsoft Agent Framework
- Read Creating Your First PMCR-O Agent
- Explore the complete article library
Build Your Own Strange Loop
The PMCR-O framework is open. Star the repository. Fork it. Seed your own intent.