10/10 modules · CI green · site live

Distributed coordination,
agentic operations.

AEGIS is a Chubby-style coordination service built on Apache Ratis. An LLM agent observes, classifies anomalies, drafts tuning PRs, and writes the postmortems — but never sits in the request path.

$ docker compose up view source

Data plane

Apache Ratis · Java 25

Control plane

Python · LangGraph · Claude

Mutation path

GitHub PRs + Issues

raft cluster · 5 nodes · term 42 node3 · leader

↑ append-entries ↓ ack · commit index ↑ vote requests ↓ vote granted

same view ships in the M10 dashboard

01 · the problem

Coordination services are expensive to run.

Every Kubernetes cluster, every Cassandra ring, every distributed lock relies on a coordination service — Chubby, ZooKeeper, etcd, Consul. They are battle-tested and they are operationally brutal: split-brain incidents, lease starvation, slow-follower cascades, and consensus-parameter tuning consume disproportionate on-call effort. Postmortems lag incidents.

Recent proposals to inject LLM reasoning into the consensus data plane have been rightly rejected by practitioners — consensus is too sensitive to non-determinism to put a language model in the commit path.

The right place for an LLM agent is the control plane: observing, recommending, and documenting. Never deciding.

AEGIS is the open-source artifact making that argument concrete. The data plane is Apache Ratis with Chubby-style locks + leases + K-V watches. The control plane is a Python sidecar — LangGraph for the classifier and proposer, Claude Agent SDK for the postmortem drafter — that produces PRs and Issues only. A human reviewer holds every merge bit.

02 · how it works

Two planes. One invariant.

The data plane is deterministic Raft. The control plane is an agentic Python sidecar. They are physically separated. The agent's only path to mutate the cluster is opening a pull request against a versioned config repository — gated by a human.

data plane

llm-free · deterministic

Apache Ratis · Java 25 · gRPC

M15-node Raft cluster; leader-stamped timestamps for replica determinism.
M2Chubby-style sessions, locks, leases with linearizable contention.
M3K-V configuration store with per-key version history, range, watches.
M4Idiomatic Java and Python client SDKs with auto-renewing leases.
M5Prometheus + Redis Streams telemetry surface.

control plane

advisory · pr-only

Python · LangGraph · Claude SDK

M6Classifier — heuristic rules over telemetry, deterministic, 7/7 fixture accuracy.
M7Proposer — drafts YAML patches, opens PRs (idempotent on evidence hash).
M8Postmortem drafter — tool-using agent with a bounded toolbelt; files Issues.
M9Chaos harness — 5 scenarios + critical cascade; canonical events.jsonl.
M10Operator dashboard — live topology, telemetry, anomalies, chaos overlay.

01 · ingest

Telemetry

Prometheus scrapes + Redis Streams XADD from every node.

02 · reason

Classify

LangGraph emits Anomaly{kind, severity, evidence}.

03a · tune

Config PR

YAML patch + rollback opens against the config repo.

03b · narrate

Postmortem

On critical anomalies — 7-section markdown Issue.

The invariant. The agent has read-only access to the data plane. It cannot acquire locks, write K-V, or restart nodes. The single mutation pathway is a GitHub PR against a versioned config — every change is reviewed by a human before reaching the cluster.

03 · demo

Chaos in, postmortems out.

A scripted multi-failure cascade. Watch the classifier flag the anomaly in real time, the proposer open a tuning PR, and the drafter file an Issue with the full timeline.

$ make locks-up

$ make chaos-cascade

$ make agents-test → 99 passed in 0.7s

chaos/critical-cascade.sh

zsh

$ make chaos-cascade
[chaos] starting critical cascade
[chaos] event {"scenario": "slow-follower", "follower": "node3", "delay_ms": 600}
[chaos] event {"scenario": "partition-leader", "leader": "node1", "duration_seconds": 30}
[chaos] event {"scenario": "kill-and-restore", "target": "node4"}

$ aegis-classifier --once | jq .
{
  "kind":        "slow_follower",
  "severity":    "critical",
  "evidence":    { "follower": "node3", "max_lag": 650 },
  "suggested_action_class": "raise_heartbeat_interval"
}

$ aegis-classifier --once | aegis-proposer --dry-run
{
  "status": "dry_run",
  "branch": "aegis-agent/slow-follower/4de12fec67",
  "entries": [
    { "yaml_path": "raft.heartbeat-interval-ms",  "before":   50, "after":   75 },
    { "yaml_path": "raft.rpc-request-timeout-ms", "before": 3000, "after": 3750 }
  ]
}

$ aegis-classifier --once | aegis-postmortem --dry-run | jq '.title'
"[AEGIS postmortem] 2026-05-26 — Slow Follower (critical)"

04 · quickstart

Four steps to a running cluster.

Apple Silicon · Linux · Docker Desktop. Java 25, Python 3.10+.

open the repo

# 1. Clone
git clone https://github.com/abhishek-aditya/AEGIS && cd AEGIS

# 2. Build + test the Java reactor (Ratis + locks + KV + telemetry)
mvn -B verify

# 3. Install + test the Python agents
cd agents && pip install -e ".[dev]" && pytest -q

# 4. Bring up the 5-node cluster + observability stack
make locks-up && make observability-up
  → Grafana   http://localhost:3000
  → Dashboard http://localhost:4400   (cd dashboard && npm install && npm start)

next: make chaos-slow · make chaos-cascade · tail -f chaos/events.jsonl

05 · paper

AEGIS: Agentic operations for distributed coordination services without compromising data-plane determinism.

The contribution is the architectural invariant — and the open-source artifact that makes it concrete. We evaluate anomaly classification accuracy, PR quality (SRE rated), postmortem quality vs an alert-only baseline, and time-to-mitigation in scripted chaos scenarios. Negative results are documented honestly — every bad config the agent proposed is logged with rationale.

arXiv · cs.DC (pending) DESIGN.md · ADRs

evaluation surface

→Anomaly classification accuracy on canned + chaos-injected traces.
→PR quality, 5-point Likert eval by SREs.
→Postmortem quality vs alert-only baseline; 10–20 SRE raters.
→Time-to-mitigation across scripted chaos scenarios.
→Documented negative results — when the agent was wrong, and why.

06 · honest answers

Frequently doubted questions.

Is this really an LLM project if the LLM doesn't decide consensus?

Yes. The agentic work is where reasoning helps: classifying anomalies from noisy multi-signal telemetry, drafting tuning PRs with clear rationale and rollback, and writing readable incident postmortems from a tool-bounded view of the data. The agent simply does not get to mutate consensus. That separation is the contribution.

Why not just write your own Raft?

Apache Ratis is battle-tested (Apache Ozone, IoTDB). The novelty here is the agentic ops layer + the control-plane invariant, not the consensus algorithm. Reinventing Raft is a different project.

What if the agent suggests a bad config?

A human closes the PR — the cluster is unchanged. The agent's only mutation pathway is the review queue. Bad PRs are logged as a deliberate honesty signal in the paper.

Why heuristic classification, not LLM classification?

Determinism. The replay test that anchors M6 needs the same answer every run. The postmortem drafter (M8) is where the LLM genuinely earns its keep — narration, comparison, lesson-extraction — and that path is opt-in via --use-claude.

Distributed coordination,agentic operations.