10/10 modules · CI green · site live

Distributed coordination,
agentic operations.

AEGIS is a Chubby-style coordination service built on Apache Ratis. An LLM agent observes, classifies anomalies, drafts tuning PRs, and writes the postmortems — but never sits in the request path.

Data plane
Apache Ratis · Java 25
Control plane
Python · LangGraph · Claude
Mutation path
GitHub PRs + Issues
raft cluster · 5 nodes · term 42 node3 · leader
node1 follower node2 follower node4 follower node5 follower node3 LEADER
append-entries ack · commit index vote requests vote granted
same view ships in the M10 dashboard
01 · the problem

Coordination services are expensive to run.

Every Kubernetes cluster, every Cassandra ring, every distributed lock relies on a coordination service — Chubby, ZooKeeper, etcd, Consul. They are battle-tested and they are operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag incidents.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners — consensus is too sensitive to non-determinism to put a language model in the commit path.

The right place for an LLM agent is the control plane: observing, recommending, and documenting. Never deciding.

AEGIS is the open-source artifact making that argument concrete. The data plane is Apache Ratis with Chubby-style locks + leases + K-V watches. The control plane is a Python sidecar — LangGraph for the classifier and proposer, Claude Agent SDK for the postmortem drafter — that produces PRs and Issues only. A human reviewer holds every merge bit.

02 · how it works

Two planes. One invariant.

The data plane is deterministic Raft. The control plane is an agentic Python sidecar. They are physically separated. The agent's only path to mutate the cluster is opening a pull request against a versioned config repository — gated by a human.

data plane
llm-free · deterministic

Apache Ratis · Java 25 · gRPC

  • M15-node Raft cluster; leader-stamped timestamps for replica determinism.
  • M2Chubby-style sessions, locks, leases with linearizable contention.
  • M3K-V configuration store with per-key version history, range, watches.
  • M4Idiomatic Java and Python client SDKs with auto-renewing leases.
  • M5Prometheus + Redis Streams telemetry surface.
control plane
advisory · pr-only

Python · LangGraph · Claude SDK

  • M6Classifier — heuristic rules over telemetry, deterministic, 7/7 fixture accuracy.
  • M7Proposer — drafts YAML patches, opens PRs (idempotent on evidence hash).
  • M8Postmortem drafter — tool-using agent with a bounded toolbelt; files Issues.
  • M9Chaos harness — 5 scenarios + critical cascade; canonical events.jsonl.
  • M10Operator dashboard — live topology, telemetry, anomalies, chaos overlay.
01 · ingest
Telemetry

Prometheus scrapes + Redis Streams XADD from every node.

02 · reason
Classify

LangGraph emits Anomaly{kind, severity, evidence}.

03a · tune
Config PR

YAML patch + rollback opens against the config repo.

03b · narrate
Postmortem

On critical anomalies — 7-section markdown Issue.

The invariant. The agent has read-only access to the data plane. It cannot acquire locks, write K-V, or restart nodes. The single mutation pathway is a GitHub PR against a versioned config — every change is reviewed by a human before reaching the cluster.
03 · demo

Chaos in, postmortems out.

A scripted multi-failure cascade. Watch the classifier flag the anomaly in real time, the proposer open a tuning PR, and the drafter file an Issue with the full timeline.

$ make locks-up
$ make chaos-cascade
$ make agents-test 99 passed in 0.7s
chaos/critical-cascade.sh
zsh
$ make chaos-cascade
[chaos] starting critical cascade
[chaos] event {"scenario": "slow-follower", "follower": "node3", "delay_ms": 600}
[chaos] event {"scenario": "partition-leader", "leader": "node1", "duration_seconds": 30}
[chaos] event {"scenario": "kill-and-restore", "target": "node4"}

$ aegis-classifier --once | jq .
{
  "kind":        "slow_follower",
  "severity":    "critical",
  "evidence":    { "follower": "node3", "max_lag": 650 },
  "suggested_action_class": "raise_heartbeat_interval"
}

$ aegis-classifier --once | aegis-proposer --dry-run
{
  "status": "dry_run",
  "branch": "aegis-agent/slow-follower/4de12fec67",
  "entries": [
    { "yaml_path": "raft.heartbeat-interval-ms",  "before":   50, "after":   75 },
    { "yaml_path": "raft.rpc-request-timeout-ms", "before": 3000, "after": 3750 }
  ]
}

$ aegis-classifier --once | aegis-postmortem --dry-run | jq '.title'
"[AEGIS postmortem] 2026-05-26 — Slow Follower (critical)"
04 · quickstart

Four steps to a running cluster.

Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+.

open the repo
# 1. Clone
git clone https://github.com/abhishek-aditya/AEGIS && cd AEGIS

# 2. Build + test the Java reactor (Ratis + locks + KV + telemetry)
mvn -B verify

# 3. Install + test the Python agents
cd agents && pip install -e ".[dev]" && pytest -q

# 4. Bring up the 5-node cluster + observability stack
make locks-up && make observability-up
 Grafana   http://localhost:3000
 Dashboard http://localhost:4400   (cd dashboard && npm install && npm start)
next: make chaos-slow · make chaos-cascade · tail -f chaos/events.jsonl
05 · paper

AEGIS: Agentic operations for distributed coordination services without compromising data-plane determinism.

The contribution is the architectural invariant — and the open-source artifact that makes it concrete. We evaluate anomaly classification accuracy, PR quality (SRE rated), postmortem quality vs an alert-only baseline, and time-to-mitigation in scripted chaos scenarios. Negative results are documented honestly — every bad config the agent proposed is logged with rationale.

evaluation surface
  • Anomaly classification accuracy on canned + chaos-injected traces.
  • PR quality, 5-point Likert eval by SREs.
  • Postmortem quality vs alert-only baseline; 10–20 SRE raters.
  • Time-to-mitigation across scripted chaos scenarios.
  • Documented negative results — when the agent was wrong, and why.
06 · honest answers

Frequently doubted questions.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The agentic work is where reasoning helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with clear rationale and rollback, and writing readable incident postmortems from a tool-bounded view of the data. The agent simply does not get to mutate consensus. That separation is the contribution.

Why not just write your own Raft?

Apache Ratis is battle-tested (Apache Ozone, IoTDB). The novelty here is the agentic ops layer + the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent suggests a bad config?

A human closes the PR — the cluster is unchanged. The agent's only mutation pathway is the review queue. Bad PRs are logged as a deliberate honesty signal in the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors M6 needs the same answer every run. The postmortem drafter (M8) is where the LLM genuinely earns its keep — narration, comparison, lesson-extraction — and that path is opt-in via --use-claude.