3 min read
Scaling roadmap
This roadmap explains how to scale GAT from a single machine to thousands of nodes, with concrete CLI/architecture changes tied to each horizon.
Horizon 1 — Multicore on one machine
- Add
rayonwork-stealing for embarrassingly parallel workloads (N-1, Monte-Carlo, per-hour PF) while keeping the core math synchronous. - Provide
--threads <N>|autoon heavy commands (default tonum_cpus::get()). - Keep pure-Rust linear algebra (
faer,sprs) but allow feature flags for OpenBLAS/MKL if dense backends are needed. - Hide factorization engines behind a trait so you can hot-swap e.g.
--solver {faer,openblas,mkl}later. - Standardize outputs on partitioned Parquet with stable schemas and add
--out-partitions(e.g.,run_id,date/contingency). - Memory-map Arrow IPC for zero-copy handoffs and checkpoint each long run with
run.jsonsogat runs resumecan pick up from the last saved chunk.
Horizon 2 — Fan-out onto many small machines
- Introduce an
Executortrait (defaultLocalExecutor, plusProcessPoolExecutorto fork childgatbinaries for memory isolation). - CLI flags like
--executor {local,process}and--max-procscontrol how many heads run simultaneously. - Route all chunk IO through an object-store abstraction (e.g.,
opendal) so the same code works on local filesystems or S3/GCS buckets. - Optional
gat-workerbinary pulls chunk specs from a queue (NATS JetStream, Redis Streams) and writes results into the artifact store. - The GUI watches manifests instead of processes, making it equally useful for local or remote runs.
Horizon 3 — Kubernetes / Nomad / Temporal
- Ship OCI images
ghcr.io/.../gat:<gitsha>that include CPU/CUDA builds and obey input/output contract (S3 URIs) so subcommands can run as containers. - Provide template generators for workflow engines (Argo, Flyte, Temporal) so teams can run import → partition → PF/OPF fanout → reduce DAGs.
- Introduce a tiny control-plane service (
gat-svc) with a gRPC API to submit runs, list chunks, and stream logs. - Emit OTLP-friendly
tracingwith dashboards for stage throughput, failure maps, and chunk timings. - Add spot-friendly chunking (
--max-runtime-per-chunk) and priority controls per stage.
Horizon 4 — HPC / MPI / SLURM
- Implement a
SlurmExecutorthat emits ready-to-submitsbatchscripts with environment variables for credentials and run manifests. - Optional
MPIExecutorfor tightly coupled solvers. - Call into PETSc/Trilinos for distributed linear algebra while keeping Rust for orchestration.
- Domain decomposition (region-by-region workflows) keeps map data local and exchanges interface variables iteratively.
Horizon 5 — Serverless bursts
- Package a cold-start-friendly worker (
gat-map) that handles one chunk in <15 minutes for Lambda/Batch. - Keep compute in Batch but orchestrate with Step Functions or Cloud Workflows.
- Optionally expose Arrow Flight endpoints so Python/R clients query telemetry without copying data.
- Use DataFusion/Ballista for distributed joins and aggregations near the data plane.
Concrete scale-outs you can do today
- N-1 DC screening: chunk by hour × contingency, run DCPF per chunk, emit violations, and reduce to top violators per element/hour. Start with
ProcessPoolExecutorbefore moving to Argo or Slurm arrays. - Monte-Carlo load/renewables: chunk by scenario and hour, keep RNG seeds in the manifest for replay, and optionally sample weather with GPU (
wgpuor CUDA) while saving CPU for solves. - Rolling OPF: DAG with forecasting → DC-OPF batches → post-checks. Run per BA or partition for data locality and stitch interfaces afterward.
Code & CLI pipelines to prioritize
crates/gat-exec/with theExecutortrait plusLocal,Process, andSlurmimplementations.gat --executor <name> --max-procs <N> --artifact <uri> --queue <uri>CLI flags.- A universal
ChunkSpec/ChunkResultJSON contract and--chunk-spec/--emit-chunk-specshelpers for chunk producers/consumers. crates/gat-artifacts/(object store viaopendal) andcrates/gat-metadata/(manifests, checksums).- Remote gRPC services (
SubmitRun,GetRun,ListChunks,GetLogs) plus optional Arrow Flight for bulk results. gat admin retry-failed <run_id>plus exponential backoff for idempotent chunk retries.
Solver & data strategy
- Keep AC PF in-process until you need distributed solves; 90% of throughput comes from chunk parallelism.
- Cache Jacobian sparsity patterns and warm-start Newton with the previous hour’s solution when replaying contingencies.
- Default to DC-OPF with HiGHS for fleet-scale throughput and reserve AC-OPF for flagged slices.
- Partition Parquet everywhere, keep run IDs content-addressed, and store large artifacts (plots, maps) in the artifact tree.
- Use short-lived object store credentials and namespace runs per org/project so multi-tenant policies (OPA/Rego) can gate access.