AEGIS is a Chubby-style coordination service built on Apache Ratis. An LLM agent observes, classifies anomalies, drafts tuning PRs, and writes the postmortems — but never sits in the request path.
Every Kubernetes cluster, every Cassandra ring, every distributed lock relies on a coordination service — Chubby, ZooKeeper, etcd, Consul. They are battle-tested and they are operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag incidents.
Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners — consensus is too sensitive to non-determinism to put a language model in the commit path.
The right place for an LLM agent is the control plane: observing, recommending, and documenting. Never deciding.
AEGIS is the open-source artifact making that argument concrete. The data plane is Apache Ratis with Chubby-style locks + leases + K-V watches. The control plane is a Python sidecar — LangGraph for the classifier and proposer, Claude Agent SDK for the postmortem drafter — that produces PRs and Issues only. A human reviewer holds every merge bit.
The data plane is deterministic Raft. The control plane is an agentic Python sidecar. They are physically separated. The agent's only path to mutate the cluster is opening a pull request against a versioned config repository — gated by a human.
Prometheus scrapes + Redis Streams XADD from every node.
LangGraph emits Anomaly{kind, severity, evidence}.
YAML patch + rollback opens against the config repo.
On critical anomalies — 7-section markdown Issue.
A scripted multi-failure cascade. Watch the classifier flag the anomaly in real time, the proposer open a tuning PR, and the drafter file an Issue with the full timeline.
$ make chaos-cascade [chaos] starting critical cascade [chaos] event {"scenario": "slow-follower", "follower": "node3", "delay_ms": 600} [chaos] event {"scenario": "partition-leader", "leader": "node1", "duration_seconds": 30} [chaos] event {"scenario": "kill-and-restore", "target": "node4"} $ aegis-classifier --once | jq . { "kind": "slow_follower", "severity": "critical", "evidence": { "follower": "node3", "max_lag": 650 }, "suggested_action_class": "raise_heartbeat_interval" } $ aegis-classifier --once | aegis-proposer --dry-run { "status": "dry_run", "branch": "aegis-agent/slow-follower/4de12fec67", "entries": [ { "yaml_path": "raft.heartbeat-interval-ms", "before": 50, "after": 75 }, { "yaml_path": "raft.rpc-request-timeout-ms", "before": 3000, "after": 3750 } ] } $ aegis-classifier --once | aegis-postmortem --dry-run | jq '.title' "[AEGIS postmortem] 2026-05-26 — Slow Follower (critical)"
Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+.
open the repo# 1. Clone git clone https://github.com/abhishek-aditya/AEGIS && cd AEGIS # 2. Build + test the Java reactor (Ratis + locks + KV + telemetry) mvn -B verify # 3. Install + test the Python agents cd agents && pip install -e ".[dev]" && pytest -q # 4. Bring up the 5-node cluster + observability stack make locks-up && make observability-up → Grafana http://localhost:3000 → Dashboard http://localhost:4400 (cd dashboard && npm install && npm start)
The contribution is the architectural invariant — and the open-source artifact that makes it concrete. We evaluate anomaly classification accuracy, PR quality (SRE rated), postmortem quality vs an alert-only baseline, and time-to-mitigation in scripted chaos scenarios. Negative results are documented honestly — every bad config the agent proposed is logged with rationale.
Yes. The agentic work is where reasoning helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with clear rationale and rollback, and writing readable incident postmortems from a tool-bounded view of the data. The agent simply does not get to mutate consensus. That separation is the contribution.
Apache Ratis is battle-tested (Apache Ozone, IoTDB). The novelty here is the agentic ops layer + the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.
A human closes the PR — the cluster is unchanged. The agent's only mutation pathway is the review queue. Bad PRs are logged as a deliberate honesty signal in the paper.
Determinism. The replay test that anchors M6 needs the same answer every run.
The postmortem drafter (M8) is where the LLM genuinely earns its keep —
narration, comparison, lesson-extraction — and that path is opt-in via
--use-claude.