Setting Up Ollama with GPU Support

Accelerate local LLM inference from 30-60s to 2-5s with NVIDIA GPU passthrough.

← Back to Articles

Ollama is the open-source local LLM server that powers PMCR-O agents. With GPU acceleration, inference time drops from 30-60 seconds (CPU) to 2-5 seconds (GPU), making real-time agent interactions feasible.

This guide shows you how to configure Ollama with NVIDIA GPU support in .NET Aspire for PMCR-O projects.

Prerequisites

  • NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better)
  • NVIDIA Drivers installed (version 535+ recommended)
  • Docker Desktop with WSL2 backend (Windows) or native Docker (Linux)
  • .NET Aspire project set up (see Getting Started with .NET Aspire)

Step 1: Verify GPU Availability

Check that Docker can access your GPU:

Bash
# Check NVIDIA driver
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Step 2: Configure Ollama in Aspire AppHost

Update your PmcroAgents.AppHost/Program.cs to add Ollama with GPU support:

C#
using CommunityToolkit.Aspire.Hosting.Ollama;

var builder = DistributedApplication.CreateBuilder(args);

// ==============================================================================
// OLLAMA - LOCAL LLM SERVER WITH GPU
// ==============================================================================

var ollama = builder.AddOllama("ollama", port: 11434)
    .WithDataVolume()                              // Persist models between runs
    .WithLifetime(ContainerLifetime.Persistent)    // Keep container running
    .WithContainerRuntimeArgs("--gpus=all");       // Enable NVIDIA GPU passthrough

// ==============================================================================
// LLM MODELS - THE COUNCIL
// ==============================================================================

var qwen = ollama.AddModel("qwen2.5-coder:7b");    // The Planner (7.4GB download)

builder.Build().Run();

Step 3: Understanding GPU Passthrough

The --gpus=all flag gives the Ollama container access to your NVIDIA GPU. This enables:

  • CUDA acceleration: Matrix operations run on GPU
  • Faster inference: 2-5s per response vs 30-60s on CPU
  • Batch processing: Handle multiple agent requests simultaneously

CPU Fallback: If GPU is unavailable, Ollama automatically falls back to CPU. Performance degrades but functionality remains.

Step 4: Model Selection

PMCR-O recommends these models for different agent roles:

qwen2.5-coder:7b (Recommended for Code Agents)

  • Size: 7.4GB
  • VRAM Required: 8GB+
  • Best For: Planner, Maker, Checker agents
  • Why: Strong tool-calling compliance, structured output support

phi3 (Lightweight Creative Agent)

  • Size: 3.8GB
  • VRAM Required: 4GB+
  • Best For: Reflector, creative tasks
  • Why: Fast inference, good for brainstorming

nomic-embed-text (Embeddings for RAG)

  • Size: 274MB
  • VRAM Required: 1GB+
  • Best For: Knowledge Service, vector search

Step 5: First Run and Model Download

When you run the AppHost for the first time, Ollama will automatically download the specified models:

Text
[Ollama] Pulling model qwen2.5-coder:7b...
[Ollama] Downloading 7.4GB (this may take 10-15 minutes on first run)
[Ollama] Model ready. GPU acceleration enabled.

Models are stored in a Docker volume and persist between runs. You only download once.

Step 6: Verify GPU Usage

After starting Ollama, verify GPU is being used:

Bash
# Check GPU utilization
nvidia-smi

# You should see Ollama process using GPU memory
# Look for: "ollama" process with GPU memory usage

Step 7: Connect Your Agent Service

Your agent services connect to Ollama via the connection string injected by Aspire:

C#
// In your agent service (e.g., PlannerService/Program.cs)
var ollamaUri = builder.Configuration.GetConnectionString("ollama") 
    ?? "http://localhost:11434";
var modelId = "qwen2.5-coder:7b";

builder.Services.AddHttpClient("ollama", client =>
{
    client.BaseAddress = new Uri(ollamaUri);
    client.Timeout = Timeout.InfiniteTimeSpan;  // LLM inference can take time
})
.AddStandardResilienceHandler(options =>
{
    options.AttemptTimeout.Timeout = TimeSpan.FromMinutes(3);
    options.TotalRequestTimeout.Timeout = TimeSpan.FromMinutes(5);
});

// Register IChatClient
builder.Services.AddSingleton<IChatClient>(sp =>
{
    var httpClient = sp.GetRequiredService<IHttpClientFactory>().CreateClient("ollama");
    var baseClient = new OllamaApiClient(httpClient, modelId);
    
    return new ChatClientBuilder(baseClient)
        .UseFunctionInvocation()  // Enables tool calling
        .Build();
});

Performance Optimization

1. Model Quantization

For lower VRAM usage, use quantized models:

  • qwen2.5-coder:7b-q4_0 - 4-bit quantization (4GB VRAM)
  • qwen2.5-coder:7b-q8_0 - 8-bit quantization (6GB VRAM)

2. Batch Size Tuning

Ollama processes requests in batches. For multiple concurrent agents, increase batch size:

C#
var ollama = builder.AddOllama("ollama", port: 11434)
    .WithDataVolume()
    .WithLifetime(ContainerLifetime.Persistent)
    .WithContainerRuntimeArgs("--gpus=all")
    .WithEnvironment("OLLAMA_NUM_GPU", "1")
    .WithEnvironment("OLLAMA_NUM_PARALLEL", "4");  // Process 4 requests concurrently

Troubleshooting

GPU Not Detected

Check Docker GPU support:

  • Windows: Enable WSL2 backend in Docker Desktop settings
  • Linux: Install nvidia-container-toolkit
  • Verify: docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Out of Memory Errors

If you get OOM errors:

  • Use a smaller model (phi3 instead of qwen2.5-coder:7b)
  • Use quantized models (q4_0 or q8_0)
  • Reduce OLLAMA_NUM_PARALLEL to 1

Slow Inference

If inference is still slow:

  • Verify GPU is being used: nvidia-smi should show Ollama process
  • Check GPU memory: Ensure enough VRAM for the model
  • Update NVIDIA drivers to latest version

Next Steps

Build Your Own Strange Loop

The PMCR-O framework is open. Star the repository. Fork it. Seed your own intent.

View on GitHub →