AEGIS GitHub
Fig. 0 — Coordination, self-healed

AEGIS

A reference manual for self-healing distributed coordination.

Built & illustrated by Abhishek Aditya.

Ed. 0.1 · alpha · 2026

Chaos in, postmortems out. The LLM lives in the control plane — never the data plane.
$ docker compose up Read the design notes → View source ↗
§ 01 — The Problem

Coordination is the load-bearing wall of the cloud.

Chubby · ZooKeeper
etcd · Consul
— and the on-call rota behind them.

Every Kubernetes control loop, every Cassandra ring, every distributed lock rests on a coordination service. They are battle-tested — and operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and endless consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag the incidents that produced them.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners. Consensus — Paxos, Raft, ZAB — rests on a few narrow correctness invariants and is acutely sensitive to non-determinism. A non-deterministic language model in the commit path breaks the guarantee that two replicas applying the same log entry compute the same state. That guarantee is the safety.

The right place for an LLM agent is the control plane: observing, recommending, and documenting — never deciding. AEGIS is the open-source artifact that makes this argument concrete and testable.

§ 02 — How it works

Two planes.
One invariant.

The data plane is deterministic Raft — Apache Ratis on Java 25, exposed over gRPC, with leader-stamped wall-clock time so that TTL math is reproducible across replicas.

The control plane is an agentic Python sidecar — LangGraph + the Anthropic SDK — that observes telemetry, recommends config changes as GitHub PRs, and drafts postmortems as GitHub Issues.

The two share a telemetry surface. They do not share a mutation surface. A human holds every merge bit.

Data plane → Apache Ratis · Java 25
Control plane → Python · LangGraph · Anthropic SDK
Mutation path → GitHub PRs + Issues
Fig_001 [ The two planes ] FIG_001 — The two planes of AEGIS Top: the deterministic data plane — gRPC clients feeding a row of five Apache Ratis Raft nodes with one elected leader, exposing locks, leases and a key-value store. A telemetry-only boundary separates it from the bottom control plane, where a Python agent opens GitHub pull requests and issues for a human reviewer. The LLM never crosses the boundary line. DATA PLANE — DETERMINISTIC · LLM-FREE gRPC CLIENTS lock·lease·get node1 follower node2 follower node3 LEADER node4 follower node5 follower Apache Ratis Java 25 · 5 nodes LOCKS · M2 LEASES · M2 KV+WATCH · M3 TELEMETRY ONLY ↓ Prometheus scrape + Redis Streams (M5) CONTROL PLANE — AGENTIC · ADVISORY ONLY PYTHON AGENT LangGraph · SDK opens only GITHUB PR — CONFIG GITHUB ISSUE — P.M. HUMAN REVIEWER holds merge bit THE LLM NEVER CROSSES THIS LINE. read-only telemetry up · pull requests & issues down · no gRPC, no lock, no key
Fig_002 [ The safe closed loop ] FIG_002 — The safe closed loop A left-to-right pipeline of five stages — detect, diagnose, propose-with-proof, constrain, human merges — each with a short caption, and a return loop arrow from the last stage back to the first. DETECT → DIAGNOSE → PROPOSE-WITH-PROOF → CONSTRAIN → HUMAN MERGES DETECT M6 DIAGNOSE M13 PROPOSE WITH-PROOF M11 CONSTRAIN M12 HUMAN MERGES heuristic classifier retrieval-aug. root-cause counterfactual sandbox replay safety envelope PR / Issue gate proves the fix helps before the PR merged config → cluster heals → re-observe CHAOS IN M9 chaos harness the loop above POSTMORTEM OUT M8 drafter → Issue
§ 03 — The closed loop

Propose — and prove.

An open loop says "anomaly → PR → hope it helped." AEGIS closes it without ever touching the data-plane invariant.

Before a PR is opened, a verifier replays the exact chaos trace in an ephemeral sandbox cluster under both the current and proposed config, and embeds the before/after delta in the PR. A static safety envelope rejects any patch that violates a consensus-safety constraint — unsafe configs are structurally impossible to propose.

M11 counterfactual verify · M12 safety envelope · M13 RAG root-cause — the closed-loop modules, in build.

§ 04 — Consensus mechanics

Raft, leader-elected.

Five nodes, one elected leader, on-disk log + snapshot. The leader replicates an append-only log to its followers; a quorum acknowledgement commits each entry. Kill the leader and a new term elects a successor.

Every command carries leader-stamped wall-clock time in its proto envelope, so lease and TTL math is identical on every replica — determinism by construction.

Wrapping Apache Ratis 3.x
Production-proven in Apache Ozone & IoTDB.
Fig_003 [ Raft leader election ] FIG_003 — Raft cluster and leader election Five Raft nodes arranged in a ring with one central leader replicating its append-only log to four followers. A term counter reads term 42. A side schematic shows the append-only log with a committed entry boundary, and a chaos-in to postmortem-out motif. RAFT CLUSTER — 5 NODES · LOG REPLICATION CURRENT TERM 42 node1 FOLLOWER node2 FOLLOWER node4 FOLLOWER node5 FOLLOWER node6 . node3 LEADER stamps leader_timestamp_ms on every log entry → replica-deterministic TTL APPEND-ONLY LOG 7 8 9 10 11 commit idx CHAOS IN P.M. OUT
§ 05 — Table of contents

Thirteen modules.

Ten shipped (M1–M10, real code + a 99-test agent suite); three in build to close the loop (M11–M13).

Data plane — deterministic
M1Raft Coredone
M2Locks + Leasesdone
M3KV Store + Watchesdone
M4Client SDKs (Java · Python)done
M5Telemetry Pipelinedone
Control plane — agentic
M6Anomaly Classifierdone
M7Config Proposerdone
M8Postmortem Drafterdone
M9Chaos Harnessdone
M10Operator Dashboarddone
The safe closed loop — in build
M11Counterfactual Verificationbuild
M12Safety Envelopebuild
M13RAG Root-Causebuild
§ 06 — Quickstart

Four lines to a cluster.

Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+. Build, verify, bring up a live 5-node cluster, then run the agent pipeline against it.

Open the repo ↗
aegis — zsh~/code/AEGIS
# 1 · clone
$ git clone https://github.com/abhishek-aditya/aegis && cd aegis

# 2 · build + test the Java reactor (Ratis · locks · KV · telemetry)
$ mvn -B verify

# 3 · bring up the 5-node cluster + observability stack
$ docker compose up --build
   grafana   http://localhost:3000
   dashboard http://localhost:4400

# 4 · run the control-plane agent pipeline (dry-run, no GitHub call)
$ cd agents && pip install -e ".[dev]"
$ aegis-classifier --once | aegis-proposer --dry-run
   99 tests pass · 7/7 fixture traces classify correctly
next → make chaos-slow · make chaos-cascade · tail -f chaos/events.jsonl
§ 07 — Evaluation

Results, honestly.

The contribution is the architectural invariant — and the open-source artifact that makes it concrete. Every config the agent proposes is logged with rationale, including the bad ones. Negative results are documented, not hidden.

Coming soon

Counterfactual verification, safety red-team, and root-cause accuracy benchmarks — coming soon.

Evaluation surface — forthcoming
[01]Anomaly classification accuracy on canned + chaos-injected traces.
[02]Counterfactual verification — does the sandbox replay prove the proposed fix helps?
[03]Safety red-team — 0 unsafe configs reach the PR queue; validator blocks 100%.
[04]Root-cause accuracy — LLM-with-retrieval top-1/top-k vs the deterministic baseline.
[05]Postmortem quality vs an alert-only baseline; SRE raters.
[06]ConsensusOps-Bench — a reusable benchmark, a free byproduct of the M13 corpus.
§ 08 — Frequently doubted

Honest answers.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The reasoning earns its keep where it helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with rationale and rollback, and writing readable postmortems from a tool-bounded view. The agent simply never gets to mutate consensus. That separation is the contribution.

Why not write your own Raft?

Apache Ratis is battle-tested in Apache Ozone and IoTDB. The novelty is the agentic ops layer and the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent proposes a bad config?

A human closes the PR — the cluster is unchanged. The safety envelope (M12) rejects unsafe patches before a PR is even opened. The agent's only mutation pathway is the review queue; bad proposals become logged evidence for the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors M6 needs the same answer every run. The LLM genuinely earns its keep in diagnosis and postmortem narration (M13, M8) — comparison, lesson-extraction, ranked root-cause — and that path is opt-in.