The audit-trail problem in government AI

Models that can't show their work don't survive an Inspector General audit. The fix is architectural, not optional.

We have lost count of the agency demos that ended with a Senior Program Integrity officer asking the same question: "Can you show me, for this specific denial, the exact data the model saw, the version of the model at the time, and the policy text that bound the determination?"

The answer is almost always no. Almost always.

This is what we call the audit-trail problem, and it is the single most common reason promising government AI deployments are quietly retired before they ever face a real OIG review. Models without per-decision provenance are unauditable. Unauditable systems do not survive due-process challenge. Systems that do not survive due-process challenge get removed from production.

What an audit trail actually has to contain

For a single benefit determination influenced by an AI system, the audit record must contain, at minimum:

The canonical fact-of-claim record at the moment of determination — not the current state.
The policy version active at the moment of determination, including the specific clauses applied.
The feature set consumed by the model, with feature versions and feature-store provenance.
The model artifact (weights, tokenizer, prompt template) at version.
For agents: the full tool-call sequence, including any documents retrieved and the exact text returned.
The output, the threshold logic, and the caseworker action taken — including any override and override reason.

This list is not aspirational. It is the bar a competently administered program-integrity office can and will demand. We have helped clients prepare for these reviews. We have also walked into engagements where the prior vendor cheerfully informed the agency that no, none of those things were captured.

Why most stacks can't produce this

There are three structural reasons.

Mutable feature stores. Most feature stores update in place. The value of last_payment_amount_90d returned today is not the value returned at the moment of decision. Without immutable, time-traveled feature snapshots, you cannot rebuild the input.

Untracked prompt and tool drift. Agentic systems shipped without prompt versioning, tool-schema versioning, and retrieved-document hashing produce non-reproducible decisions by construction.

Decision-event collapse. Most agency systems persist the decision but not the path to the decision. The override is recorded; the recommendation that was overridden is gone.

What to build

The fix is architectural. We treat decision provenance as a first-class artifact, not a logging concern:

Append-only event log of every decision touchpoint, with content-addressed pointers to features, models, prompts, and retrieved documents.
A "replay" endpoint that, given a decision ID, reconstructs the exact inputs to the model or agent and re-runs them.
A "diff" endpoint that, given two decision IDs, surfaces the minimal set of differences in inputs that produced different outputs.

Building this from scratch is expensive. Retrofitting it onto a system already in production is more expensive. Not building it costs more.

If you have a system in production today and are uncertain whether it could survive an OIG audit, we will run a one-week assessment against your current logging architecture and tell you what you need to change. The deliverable is two pages.

The audit-trail problem: why most government AI deployments fail their first OIG review

What an audit trail actually has to contain

Why most stacks can't produce this

What to build