Scaling roadmap

This roadmap explains how to scale GAT from a single machine to thousands of nodes, with concrete CLI/architecture changes tied to each horizon.

Horizon 1 — Multicore on one machine

Add rayon work-stealing for embarrassingly parallel workloads (N-1, Monte-Carlo, per-hour PF) while keeping the core math synchronous.
Provide --threads <N>|auto on heavy commands (default to num_cpus::get()).
Keep pure-Rust linear algebra (faer, sprs) but allow feature flags for OpenBLAS/MKL if dense backends are needed.
Hide factorization engines behind a trait so you can hot-swap e.g. --solver {faer,openblas,mkl} later.
Standardize outputs on partitioned Parquet with stable schemas and add --out-partitions (e.g., run_id,date/contingency).
Memory-map Arrow IPC for zero-copy handoffs and checkpoint each long run with run.json so gat runs resume can pick up from the last saved chunk.

Introduce an Executor trait (default LocalExecutor, plus ProcessPoolExecutor to fork child gat binaries for memory isolation).
CLI flags like --executor {local,process} and --max-procs control how many heads run simultaneously.
Route all chunk IO through an object-store abstraction (e.g., opendal) so the same code works on local filesystems or S3/GCS buckets.
Optional gat-worker binary pulls chunk specs from a queue (NATS JetStream, Redis Streams) and writes results into the artifact store.
The GUI watches manifests instead of processes, making it equally useful for local or remote runs.

Ship OCI images ghcr.io/.../gat:<gitsha> that include CPU/CUDA builds and obey input/output contract (S3 URIs) so subcommands can run as containers.
Provide template generators for workflow engines (Argo, Flyte, Temporal) so teams can run import → partition → PF/OPF fanout → reduce DAGs.
Introduce a tiny control-plane service (gat-svc) with a gRPC API to submit runs, list chunks, and stream logs.
Emit OTLP-friendly tracing with dashboards for stage throughput, failure maps, and chunk timings.
Add spot-friendly chunking (--max-runtime-per-chunk) and priority controls per stage.

Implement a SlurmExecutor that emits ready-to-submit sbatch scripts with environment variables for credentials and run manifests.
Optional MPIExecutor for tightly coupled solvers.
Call into PETSc/Trilinos for distributed linear algebra while keeping Rust for orchestration.
Domain decomposition (region-by-region workflows) keeps map data local and exchanges interface variables iteratively.

Package a cold-start-friendly worker (gat-map) that handles one chunk in <15 minutes for Lambda/Batch.
Keep compute in Batch but orchestrate with Step Functions or Cloud Workflows.
Optionally expose Arrow Flight endpoints so Python/R clients query telemetry without copying data.
Use DataFusion/Ballista for distributed joins and aggregations near the data plane.

N-1 DC screening: chunk by hour × contingency, run DCPF per chunk, emit violations, and reduce to top violators per element/hour. Start with ProcessPoolExecutor before moving to Argo or Slurm arrays.
Monte-Carlo load/renewables: chunk by scenario and hour, keep RNG seeds in the manifest for replay, and optionally sample weather with GPU (wgpu or CUDA) while saving CPU for solves.
Rolling OPF: DAG with forecasting → DC-OPF batches → post-checks. Run per BA or partition for data locality and stitch interfaces afterward.

crates/gat-exec/ with the Executor trait plus Local, Process, and Slurm implementations.
gat --executor <name> --max-procs <N> --artifact <uri> --queue <uri> CLI flags.
A universal ChunkSpec/ChunkResult JSON contract and --chunk-spec/--emit-chunk-specs helpers for chunk producers/consumers.
crates/gat-artifacts/ (object store via opendal) and crates/gat-metadata/ (manifests, checksums).
Remote gRPC services (SubmitRun, GetRun, ListChunks, GetLogs) plus optional Arrow Flight for bulk results.
gat admin retry-failed <run_id> plus exponential backoff for idempotent chunk retries.

Keep AC PF in-process until you need distributed solves; 90% of throughput comes from chunk parallelism.
Cache Jacobian sparsity patterns and warm-start Newton with the previous hour’s solution when replaying contingencies.
Default to DC-OPF with HiGHS for fleet-scale throughput and reserve AC-OPF for flagged slices.
Partition Parquet everywhere, keep run IDs content-addressed, and store large artifacts (plots, maps) in the artifact tree.
Use short-lived object store credentials and namespace runs per org/project so multi-tenant policies (OPA/Rego) can gate access.