The Agentic Maturity Model for Government Benefits Systems

A five-level model for diagnosing where a benefits-decisioning system actually sits today — and what is required to move it one rung up without breaking due process.

Most maturity models score what a system aspires to be. This one scores what it is. A system's maturity tier is set by its weakest link — usually the audit trail, not the model.

We use five tiers. Movement between tiers is achieved by removing a specific blocker, not by adopting a new vendor or buying a new license.

L0 — Rules

Hand-written rule engines, static dashboards, and after-the-fact reporting. The dominant tier for state benefits programs today.

Diagnostic signals. Caseworker training takes 6+ weeks. Eligibility-determination time is measured in days. Improper-payment detection is retrospective — claw-back via Recovery Audit Contractors rather than prevention at submission.

Move to L1 by. Standing up a feature-store and a single supervised model alongside (not replacing) the rule engine. Shadow-mode evaluation for three months. Calibration plots delivered weekly to the program office.

L1 — Augmented

ML scoring layered onto the rule engine. Human review remains the decisioning interface.

Diagnostic signals. A model exists in production but its outputs are never consumed by a downstream system. Caseworkers see "risk scores" with no associated reasoning. There is no on-call rotation for model degradation.

Move to L2 by. Wrapping the model in citation-grounded explanation. Adding a model card. Establishing a drift monitor with paging thresholds. Refactoring caseworker UI to surface evidence, not scores.

L2 — Assisted

Decisioning copilots present recommendations with citations to policy, statute, or case precedent. Humans make every binding decision; the system makes every decision faster.

Diagnostic signals. Caseworkers report consulting the copilot before completing a determination. Decision-quality consistency rises across regional offices. FOIA-eligible reasoning traces are stored per case.

Move to L3 by. Defining the narrowest multi-step task a constrained agent can complete end-to-end with human approval at boundaries. Building the policy-bound action graph. Red-teaming it against adversarial submissions before any production traffic.

L3 — Agentic

Multi-step agents complete bounded tasks under human policy constraints — intake triage, document classification, evidence retrieval, recomputation on appeal. A human signs the final determination.

Diagnostic signals. The system produces actions, not recommendations. Agent reasoning is replayable. Policy constraints are versioned, deployed, and tested like code. There is a documented "circuit breaker" for every agent class.

Move to L4 by. Investing in self-monitoring telemetry, closed-loop remediation harnesses, and re-evaluation schedules. The agency, not the vendor, owns the evaluation infrastructure.

L4 — Autonomous

Self-monitoring agents detect their own degradation, request re-training, and remediate within a bounded operating envelope. Human oversight shifts from per-decision to per-policy.

Diagnostic signals. Vanishingly few. We have not yet seen a state benefits program operating at L4 in production. We do not believe L4 is a near-term goal for most agencies — and we will tell you so.

How to use this model

Pick one system. Score it. Disagree internally about the score until you converge — the disagreement is the diagnostic.
Identify the single weakest link. It is rarely the model.
Plan the move to the next rung. Not two rungs. One.

If you want our take on where your system sits and what the next-rung blocker is, send the materials. We'll mark them up against this model and return our notes within five business days.