﻿# Compute and Scaling: Cloud Run & .NET Native AOT

Duuble relies on Google Cloud Run for serverless execution of its .NET 10 LTS API. To achieve target latencies under social media scale, specific compute tuning is required.

## Performance Profile & Throughput Limits

*   **Concurrency:** 80 requests per instance
*   **Average Processing Time:** 50ms (including database I/O waits)
*   **Theoretical Maximum RPS/Instance:** 1,600 RPS (1000ms / 50ms * 80)
*   **Real-world RPS/Instance:** ~400 RPS (accounting for context switching, GC pauses, and network overhead)

```mermaid
graph LR
    Req["Incoming Traffic (1,000,000 RPM)"] -->|16,667 RPS| LB[Cloud Load Balancing]
    
    subgraph Compute Target
      LB --> IR1[Instance 1 <br> Max 400 RPS]
      LB --> IR2[Instance 2 <br> Max 400 RPS]
      LB --> IRN[Instance N <br> Max 400 RPS]
    end
```

## Scaling Configuration

Based on the required handling of 16,667 RPS (1 Million RPM):
$$\text{Required Instances} = \frac{16,667}{400} \approx 42 \text{ Instances}$$

To accommodate traffic burstiness seamlessly, we over-provision by 30%.
*   **Target Baseline Instances at Peak:** ~60 instances.

### Auto-Scaling Limits
*   **Minimum Instances (`min-instances`):** `5` for the initial production target. Dev should default lower as documented in [10-terraform-requirements-qa.md](./10-terraform-requirements-qa.md).
    *   Ensures baseline responsiveness for low-volume production periods without paying for full 1M RPM warm capacity.
*   **Maximum Instances (`max-instances`):** `1000`
    *   Acts as a circuit breaker against unexpected runaway financial costs during a massive DDoS attack unmitigated by the WAF.

## Startup CPU Boost and .NET Native AOT

Cloud Run scales rapidly under load by creating new container instances (cold starts). For .NET applications, compilation and JVM/CLR startup overhead can cause severe tail-latency.

To completely mitigate this:
1.  **Native AOT Compilation**: The .NET application is compiled Ahead-of-Time directly into machine code. This eliminates the Just-In-Time (JIT) compiler warmup phase entirely.
2.  **Startup CPU Boost**: The GCP feature `Startup CPU Boost` is enabled, which temporarily doubles the CPU allocation to the container during the boot phase. This drives initialization of static variables and database connection pools to finish drastically faster.

## Server-Sent Events (SSE) Tuning

Server-Sent Events (SSE) connections for real-time notifications (`/api/v1/notifications/stream`) behave fundamentally differently from standard REST requests. They are long-lived, open HTTP connections that often sit idle.

If routed to the primary API Cloud Run service, thousands of idle connections will exhaust the concurrency slots (triggering massive, unwarranted scale-up events) and drive hosting costs up exponentially. 

To solve this, Duuble runs a **Dedicated SSE Cloud Run Service** with a completely different scaling profile:
*   **Max Concurrency:** `1000` (The absolute Cloud Run maximum). This ensures a single 1 vCPU instance can hold open 1,000 idle SSE streams without triggering auto-scaling.
*   **Request Timeout:** `3600s` (1 Hour - the Cloud Run maximum limit).
*   **CPU Allocation:** Allocated only during request processing (`--cpu-throttling`). Idle connections waiting for notifications will not consume billed CPU cycles.
