Fig. 0 — Coordination, self-healed

AEGIS

A reference manual for self-healing distributed coordination.

Built & illustrated by Abhishek Aditya.

Ed. 0.1 · alpha · 2026

Chaos in, postmortems out. The LLM lives in the control plane — never the data plane.

$ docker compose up Read the design notes → View source ↗

§ 01 — The Problem

Coordination is the load-bearing wall of the cloud.

Chubby · ZooKeeper
etcd · Consul
— and the on-call rota behind them.

Every Kubernetes control loop, every Cassandra ring, every distributed lock rests on a coordination service. They are battle-tested — and operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and endless consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag the incidents that produced them.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners. Consensus — Paxos, Raft, ZAB — rests on a few narrow correctness invariants and is acutely sensitive to non-determinism. A non-deterministic language model in the commit path breaks the guarantee that two replicas applying the same log entry compute the same state. That guarantee is the safety.

The right place for an LLM agent is the control plane: observing, recommending, and documenting — never deciding. AEGIS is the open-source artifact that makes this argument concrete and testable.

§ 02 — How it works

Two planes.
One invariant.

The data plane is deterministic Raft — Apache Ratis on Java 25, exposed over gRPC, with leader-stamped wall-clock time so that TTL math is reproducible across replicas.

The control plane is an agentic Python sidecar — LangGraph + the Anthropic SDK — that observes telemetry, recommends config changes as GitHub PRs, and drafts postmortems as GitHub Issues.

The two share a telemetry surface. They do not share a mutation surface. A human holds every merge bit.

Data plane → Apache Ratis · Java 25
Control plane → Python · LangGraph · Anthropic SDK
Mutation path → GitHub PRs + Issues

Fig_001 [ The two planes ]

Fig_002 [ The safe closed loop ]

§ 03 — The closed loop

Propose — and prove.

An open loop says "anomaly → PR → hope it helped." AEGIS closes it without ever touching the data-plane invariant.

Before a PR is opened, a verifier replays the exact chaos trace in an ephemeral sandbox cluster under both the current and proposed config, and embeds the before/after delta in the PR. A static safety envelope rejects any patch that violates a consensus-safety constraint — unsafe configs are structurally impossible to propose.

M11 counterfactual verify · M12 safety envelope · M13 RAG root-cause — the closed-loop modules, in build.

§ 04 — Consensus mechanics

Raft, leader-elected.

Five nodes, one elected leader, on-disk log + snapshot. The leader replicates an append-only log to its followers; a quorum acknowledgement commits each entry. Kill the leader and a new term elects a successor.

Every command carries leader-stamped wall-clock time in its proto envelope, so lease and TTL math is identical on every replica — determinism by construction.

Wrapping Apache Ratis 3.x
Production-proven in Apache Ozone & IoTDB.

Fig_003 [ Raft leader election ]

§ 05 — Table of contents

Thirteen modules.

Ten shipped (M1–M10, real code + a 99-test agent suite); three in build to close the loop (M11–M13).

Data plane — deterministic

M1Raft Coredone

M2Locks + Leasesdone

M3KV Store + Watchesdone

M4Client SDKs (Java · Python)done

M5Telemetry Pipelinedone

Control plane — agentic

M6Anomaly Classifierdone

M7Config Proposerdone

M8Postmortem Drafterdone

M9Chaos Harnessdone

M10Operator Dashboarddone

The safe closed loop — in build

M11Counterfactual Verificationbuild

M12Safety Envelopebuild

M13RAG Root-Causebuild

§ 06 — Quickstart

Four lines to a cluster.

Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+. Build, verify, bring up a live 5-node cluster, then run the agent pipeline against it.

Open the repo ↗

aegis — zsh~/code/AEGIS

# 1 · clone
$ git clone https://github.com/abhishek-aditya/aegis && cd aegis

# 2 · build + test the Java reactor (Ratis · locks · KV · telemetry)
$ mvn -B verify

# 3 · bring up the 5-node cluster + observability stack
$ docker compose up --build
  → grafana   http://localhost:3000
  → dashboard http://localhost:4400

# 4 · run the control-plane agent pipeline (dry-run, no GitHub call)
$ cd agents && pip install -e ".[dev]"
$ aegis-classifier --once | aegis-proposer --dry-run
  ✓ 99 tests pass · 7/7 fixture traces classify correctly

next → make chaos-slow · make chaos-cascade · tail -f chaos/events.jsonl

§ 07 — Evaluation

Results, honestly.

The contribution is the architectural invariant — and the open-source artifact that makes it concrete. Every config the agent proposes is logged with rationale, including the bad ones. Negative results are documented, not hidden.

Coming soon

Counterfactual verification, safety red-team, and root-cause accuracy benchmarks — coming soon.

Evaluation surface — forthcoming

[01]Anomaly classification accuracy on canned + chaos-injected traces.

[02]Counterfactual verification — does the sandbox replay prove the proposed fix helps?

[03]Safety red-team — 0 unsafe configs reach the PR queue; validator blocks 100%.

[04]Root-cause accuracy — LLM-with-retrieval top-1/top-k vs the deterministic baseline.

[05]Postmortem quality vs an alert-only baseline; SRE raters.

[06]ConsensusOps-Bench — a reusable benchmark, a free byproduct of the M13 corpus.

Read the paper — arXiv preprint coming soon DESIGN.md — ADRs ↗

§ 08 — Frequently doubted

Honest answers.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The reasoning earns its keep where it helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with rationale and rollback, and writing readable postmortems from a tool-bounded view. The agent simply never gets to mutate consensus. That separation is the contribution.

Why not write your own Raft?

Apache Ratis is battle-tested in Apache Ozone and IoTDB. The novelty is the agentic ops layer and the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent proposes a bad config?

A human closes the PR — the cluster is unchanged. The safety envelope (M12) rejects unsafe patches before a PR is even opened. The agent's only mutation pathway is the review queue; bad proposals become logged evidence for the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors M6 needs the same answer every run. The LLM genuinely earns its keep in diagnosis and postmortem narration (M13, M8) — comparison, lesson-extraction, ranked root-cause — and that path is opt-in.