← Back to Portfolio

Scaling PMCR-O: From Prototype to Enterprise Deployment

Your PMCR-O prototype works. Now it needs to handle 10,000 concurrent requests, process millions of cognitive trails, and maintain sub-second latency. This guide shows you how to scale PMCR-O from prototype to enterprise production.

Enterprise Scale Targets

10,000+
Concurrent Requests
<500ms
P95 Latency
99.9%
Uptime SLA
1M+
Cognitive Trails/Day

1. Horizontal Scaling Architecture

Stateless Agent Services

PMCR-O agents must be stateless to scale horizontally. All state lives in external systems (PostgreSQL, Redis, Knowledge Vault):

C#
// ✅ GOOD: Stateless service (can scale horizontally)
public class PlannerAgentService : AgentService.AgentServiceBase
{
    private readonly IChatClient _chatClient;
    private readonly IHttpClientFactory _httpClientFactory; // ✅ Stateless HTTP client

    // No in-memory state - all state in external systems
    public override async Task<AgentResponse> ExecuteTask(
        AgentRequest request,
        ServerCallContext context)
    {
        // All state comes from request or external systems
        var knowledge = await FetchKnowledgeFromVault(request.Intent);
        var response = await ProcessWithKnowledge(request, knowledge);
        return response;
    }
}

// ❌ BAD: Stateful service (can't scale)
public class PlannerAgentService : AgentService.AgentServiceBase
{
    private readonly Dictionary<string, string> _cache = new(); // ❌ In-memory state
    // ...
}

Load Balancing gRPC Services

Use a gRPC-aware load balancer (e.g., Envoy, NGINX, or cloud load balancers):

YAML
# Kubernetes Service with load balancing
apiVersion: v1
kind: Service
metadata:
  name: planner-service
spec:
  type: LoadBalancer
  ports:
  - port: 50051
    targetPort: 50051
    protocol: TCP
  selector:
    app: planner-agent
---
# Deployment with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: planner-agent
spec:
  replicas: 5  # ✅ Scale horizontally
  selector:
    matchLabels:
      app: planner-agent
  template:
    metadata:
      labels:
        app: planner-agent
    spec:
      containers:
      - name: planner
        image: your-registry/pmcro-planner:latest
        ports:
        - containerPort: 50051

2. Database Scaling

PostgreSQL Read Replicas

Scale knowledge vault reads with read replicas:

C#
// Configure read/write splitting
builder.Services.AddDbContext<KnowledgeDbContext>(options =>
{
    // Write to primary
    options.UseNpgsql(primaryConnectionString, npgsqlOptions =>
    {
        npgsqlOptions.UseVector();
    });
});

// Read from replicas
builder.Services.AddDbContext<KnowledgeReadDbContext>(options =>
{
    options.UseNpgsql(replicaConnectionString, npgsqlOptions =>
    {
        npgsqlOptions.UseVector();
    });
});

// Use read context for queries
public class KnowledgeVaultService
{
    private readonly KnowledgeDbContext _writeDb;
    private readonly KnowledgeReadDbContext _readDb;

    public async Task<List<KnowledgeItem>> SearchAsync(string query)
    {
        // ✅ Read from replica
        return await _readDb.KnowledgeEntries
            .Where(k => EF.Functions.VectorCosineSimilarity(k.Embedding, queryEmbedding) > 0.7)
            .ToListAsync();
    }

    public async Task StoreAsync(KnowledgeItem item)
    {
        // ✅ Write to primary
        _writeDb.KnowledgeEntries.Add(item);
        await _writeDb.SaveChangesAsync();
    }
}

Connection Pooling

Configure Npgsql connection pooling for high concurrency:

C#
// Configure connection string with pooling
var connectionString = new NpgsqlConnectionStringBuilder
{
    Host = "postgres-primary",
    Database = "knowledge",
    Username = "pmcro",
    Password = password,
    MaxPoolSize = 100, // ✅ Increase pool size
    MinPoolSize = 10,
    ConnectionIdleLifetime = 300, // 5 minutes
    ConnectionPruningInterval = 10 // Prune every 10 seconds
}.ToString();

builder.Services.AddDbContext<KnowledgeDbContext>(options =>
    options.UseNpgsql(connectionString));

3. Caching Strategy

Redis for Agent Response Caching

Cache frequently accessed plans and artifacts:

C#
// Add Redis caching
builder.Services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = builder.Configuration.GetConnectionString("redis");
});

// Cache agent responses
public class CachedPlannerService
{
    private readonly IDistributedCache _cache;
    private readonly PlannerAgentService _planner;

    public async Task<AgentResponse> ExecuteTaskAsync(AgentRequest request)
    {
        // Generate cache key from intent
        var cacheKey = $"planner:{HashIntent(request.Intent)}";

        // Try cache first
        var cached = await _cache.GetStringAsync(cacheKey);
        if (cached != null)
        {
            return JsonSerializer.Deserialize<AgentResponse>(cached);
        }

        // Generate response
        var response = await _planner.ExecuteTask(request, context);

        // Cache for 1 hour
        await _cache.SetStringAsync(
            cacheKey,
            JsonSerializer.Serialize(response),
            new DistributedCacheEntryOptions
            {
                AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(1)
            });

        return response;
    }
}

4. Message Queue for Async Processing

For long-running agent tasks, use message queues:

C#
// Add Azure Service Bus (or RabbitMQ, etc.)
builder.Services.AddAzureServiceBusClient(builder.Configuration.GetConnectionString("ServiceBus"));

// Queue agent tasks
public class OrchestrationApiController : ControllerBase
{
    private readonly ServiceBusClient _serviceBus;
    private readonly ILogger<OrchestrationApiController> _logger;

    [HttpPost("execute-async")]
    public async Task<IActionResult> ExecuteAsync([FromBody] AgentRequest request)
    {
        // Queue task instead of processing synchronously
        var sender = _serviceBus.CreateSender("pmcro-tasks");
        await sender.SendMessageAsync(new ServiceBusMessage(JsonSerializer.Serialize(request)));

        // Return immediately
        return Accepted(new { TaskId = Guid.NewGuid(), Status = "Queued" });
    }
}

// Background worker processes queue
public class AgentTaskProcessor : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var receiver = _serviceBus.CreateReceiver("pmcro-tasks");
        
        while (!stoppingToken.IsCancellationRequested)
        {
            var message = await receiver.ReceiveMessageAsync(cancellationToken: stoppingToken);
            if (message != null)
            {
                var request = JsonSerializer.Deserialize<AgentRequest>(message.Body.ToString());
                await ProcessAgentTask(request);
                await receiver.CompleteMessageAsync(message);
            }
        }
    }
}

5. Kubernetes Deployment

Deployment Configuration

YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: planner-agent
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  selector:
    matchLabels:
      app: planner-agent
  template:
    metadata:
      labels:
        app: planner-agent
    spec:
      containers:
      - name: planner
        image: your-registry/pmcro-planner:latest
        ports:
        - containerPort: 50051
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: ConnectionStrings__ollama
          valueFrom:
            secretKeyRef:
              name: pmcro-secrets
              key: ollama-connection
        livenessProbe:
          exec:
            command: ["grpc_health_probe", "-addr=:50051"]
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          exec:
            command: ["grpc_health_probe", "-addr=:50051"]
          initialDelaySeconds: 10
          periodSeconds: 5

Horizontal Pod Autoscaling

YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: planner-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: planner-agent
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

6. Performance Optimization

Batch Processing

Process multiple intents in batches:

C#
// Process multiple intents in parallel
public async Task<List<AgentResponse>> ExecuteBatchAsync(List<AgentRequest> requests)
{
    // Process in parallel (with concurrency limit)
    var semaphore = new SemaphoreSlim(10); // Max 10 concurrent
    var tasks = requests.Select(async request =>
    {
        await semaphore.WaitAsync();
        try
        {
            return await ExecuteTask(request, context);
        }
        finally
        {
            semaphore.Release();
        }
    });

    return (await Task.WhenAll(tasks)).ToList();
}

Async I/O Everywhere

Never block on I/O operations:

C#
// ✅ GOOD: Fully async
public async Task<AgentResponse> ExecuteTaskAsync(AgentRequest request)
{
    var knowledge = await _knowledgeVault.SearchAsync(request.Intent);
    var plan = await _planner.GeneratePlanAsync(request.Intent, knowledge);
    var artifact = await _maker.CreateArtifactAsync(plan);
    return artifact;
}

// ❌ BAD: Blocking I/O
public AgentResponse ExecuteTask(AgentRequest request)
{
    var knowledge = _knowledgeVault.SearchAsync(request.Intent).Result; // Blocks!
    // ...
}

7. Monitoring & Observability

OpenTelemetry Metrics

Track key metrics for scaling decisions:

C#
// Track custom metrics
private static readonly Meter Meter = new("PMCR-O.Agents");
private static readonly Counter<long> RequestsProcessed = Meter.CreateCounter<long>(
    "pmcro.requests.processed",
    "requests",
    "Total number of agent requests processed");

private static readonly Histogram<double> RequestLatency = Meter.CreateHistogram<double>(
    "pmcro.request.latency",
    "ms",
    "Agent request processing latency");

public override async Task<AgentResponse> ExecuteTask(
    AgentRequest request,
    ServerCallContext context)
{
    var stopwatch = Stopwatch.StartNew();
    
    try
    {
        var response = await ProcessRequest(request);
        
        RequestsProcessed.Add(1, new("agent", "planner"), new("status", "success"));
        RequestLatency.Record(stopwatch.ElapsedMilliseconds);
        
        return response;
    }
    catch (Exception ex)
    {
        RequestsProcessed.Add(1, new("agent", "planner"), new("status", "error"));
        throw;
    }
}

8. Cost Optimization

Ollama Model Selection

Use smaller models for simple tasks, larger models for complex reasoning:

C#
// Route to appropriate model based on complexity
public class ModelRouter
{
    public string SelectModel(string intent, int estimatedComplexity)
    {
        return estimatedComplexity switch
        {
            < 3 => "phi3", // ✅ Small, fast model for simple tasks
            < 7 => "qwen2.5-coder:7b", // Medium model
            _ => "llama3.2-finetuned" // ✅ Large model for complex tasks
        };
    }
}

9. Production Deployment Checklist

✅ Enterprise Scaling Checklist

  • ✅ All services are stateless
  • ✅ Load balancer configured for gRPC
  • ✅ Database read replicas configured
  • ✅ Connection pooling optimized
  • ✅ Redis caching implemented
  • ✅ Message queue for async processing
  • ✅ Kubernetes HPA configured
  • ✅ Resource limits set (CPU/memory)
  • ✅ Health checks configured
  • ✅ OpenTelemetry metrics enabled
  • ✅ Logging centralized
  • ✅ Cost optimization (model routing)

.NET 11 Scaling Enhancements (2026)

.NET 11 (preview as of 2026) introduces significant improvements for PMCR-O enterprise scaling:

Enhanced AI Orchestration

  • Native AI Workflow Support: Built-in support for Microsoft Agents AI Workflows, reducing boilerplate for PMCR-O agent orchestration
  • Improved gRPC Performance: 15-20% faster gRPC serialization/deserialization for agent-to-agent communication
  • Better Async I/O: Enhanced async/await performance for high-concurrency agent workloads
  • Native AOT for Agents: Ahead-of-time compilation for agent services, reducing memory footprint by 30-40%
C#
// .NET 11: Enhanced AI Workflow Support
using Microsoft.Agents.AI.Workflows;

// Native workflow orchestration for PMCR-O
var workflow = new AgentWorkflowBuilder()
    .AddPlannerAgent(plannerConfig)
    .AddMakerAgent(makerConfig)
    .AddCheckerAgent(checkerConfig)
    .AddReflectorAgent(reflectorConfig)
    .WithRetryPolicy(maxRetries: 3)
    .WithCircuitBreaker(failureThreshold: 5)
    .Build();

// Execute with automatic orchestration
var result = await workflow.ExecuteAsync(intent);

Federated Learning Support

.NET 11 includes experimental support for federated learning patterns, enabling PMCR-O agents to learn from distributed data sources while maintaining privacy:

  • Distributed Agent Training: Agents can learn from multiple nodes without centralizing data
  • Privacy-Preserving Aggregation: Secure aggregation of agent insights across organizations
  • Edge AI Optimization: Better support for edge device deployments with resource constraints

Enterprise Cost Models & ROI

Understanding the cost structure of PMCR-O enterprise deployments is critical for budget planning. Here's a breakdown:

Monthly Cost Breakdown (Example: 10M requests/month)

Component Configuration Monthly Cost Notes
Kubernetes Cluster 3-node cluster (8 vCPU, 32GB RAM each) $1,200 AWS EKS / Azure AKS
PostgreSQL (Primary + Replicas) Primary: 16 vCPU, 64GB RAM
2 Read Replicas: 8 vCPU, 32GB RAM
$800 Managed service (RDS/Azure DB)
Redis Cache Cluster mode, 16GB memory $300 ElastiCache / Azure Cache
Ollama Infrastructure GPU instances (A100 / H100) $2,500 Model inference costs
Message Queue (RabbitMQ/Kafka) 3-node cluster $400 Managed service
Monitoring & Logging OpenTelemetry + Grafana Cloud $200 Observability stack
Load Balancer Application Load Balancer $150 Traffic distribution
Storage (S3/Azure Blob) 1TB cognitive trails $50 Long-term storage
Network Egress 10TB/month $100 Data transfer
Total Monthly Cost $5,700 ~$0.00057 per request

ROI Calculation

For a typical enterprise deployment processing 10M requests/month:

Metric Value Calculation
Monthly Infrastructure Cost $5,700 As above
Cost per Request $0.00057 $5,700 / 10M requests
Time Saved per Request 2.5 minutes Average automation benefit
Labor Cost Saved $50/hour Average developer/analyst rate
Monthly Labor Savings $208,333 10M × 2.5 min × $50/60 min
Net Monthly Savings $202,633 $208,333 - $5,700
Annual ROI 4,268% ($202,633 × 12) / $5,700
Payback Period < 1 month Infrastructure cost / monthly savings

✅ Cost Optimization Tips

  • Model Routing: Use smaller models (phi3) for simple tasks, saving 60-70% on inference costs
  • Reserved Instances: Commit to 1-3 year terms for 30-40% discount on compute
  • Spot Instances: Use spot instances for non-critical workloads (70-90% savings)
  • Auto-Scaling: Scale down during off-peak hours to reduce idle costs
  • Edge Deployment: Deploy agents closer to users to reduce network egress costs

Conclusion

Scaling PMCR-O to enterprise requires:

  1. Stateless services for horizontal scaling
  2. Load balancing for request distribution
  3. Database scaling (read replicas, connection pooling)
  4. Caching to reduce load
  5. Message queues for async processing
  6. Kubernetes for orchestration
  7. Monitoring for data-driven scaling decisions

Follow these patterns, and your PMCR-O system will scale from prototype to enterprise production.

🔗 Related Resources:

Shawn Delaine Bellazan

About Shawn Delaine Bellazan

Resilient Architect & PMCR-O Framework Creator

Shawn is the creator of the PMCR-O framework, a self-referential AI architecture that embodies the strange loop it describes. With 15+ years in enterprise software development, Shawn specializes in building resilient systems at the intersection of philosophy and technology. His work focuses on autonomous AI agents that evolve through vulnerability and expression.