Phase 4 tested the model before deployment. Phase 5 governs what happens after deployment — every answer the model gives in production. The core requirement: any inference must be fully reconstructable after the fact. What was the input, what was retrieved, what was generated, was it grounded in the retrieval, and was the user entitled to see the retrieved documents. Without this, AI systems operate as black boxes with no audit trail, no accountability, and no capacity to respond to regulatory inquiry or customer complaint.
The Two Governance Layers
Output governance operates at two layers: the retrieval layer (what data the model is allowed to see) and the generation layer (what the model produces and whether it can be traced). Both must be governed simultaneously.
USER QUERY + USER IDENTITY
│
▼
┌────────────────────────────────────┐
│ LAYER 1: RETRIEVAL GOVERNANCE │
│ Pre-retrieval access control │
│ Scope-constrained search │
│ Document classification enforced │
└────────────┬───────────────────────┘
│ entitled chunks only
▼
┌────────────────────────────────────┐
│ LAYER 2: GENERATION GOVERNANCE │
│ Inference logging │
│ Faithfulness scoring │
│ Output attribution │
│ Citation enforcement │
└────────────┬───────────────────────┘
│
▼
GOVERNED OUTPUT
Layer 1: Retrieval Governance
In a Retrieval-Augmented Generation (RAG) system, the model does not answer from memory alone. It searches a document store, retrieves relevant text chunks, and generates an answer grounded in those chunks. The governance problem: the retriever optimizes for relevance, not entitlement. It returns the best-matching chunks regardless of whether the user is authorized to see them.
Pre-retrieval filtering is mandatory. The access control decision must occur before the search executes, not after. If the retriever searches the entire document store and a restricted chunk is retrieved, the model has already processed it — even if the output is filtered afterward, the model's reasoning was influenced by data the user should not have accessed.
The architecture: every document and every chunk in the vector store carries a classification tag inherited from the Phase 1 data catalog. When a query arrives, the system resolves the user's entitlements (role, department, clearance level) and constrains the vector search to chunks matching those entitlements.
VECTOR SEARCH QUERY:
semantic_search("credit eligibility criteria")
WHERE chunk.classification IN user.entitlements
AND chunk.pii_flag = false
OR user.pii_access = true
The retriever never sees restricted chunks. The model never processes them. The output cannot reflect unauthorized information. This is a schema-level constraint on the vector store, not a post-processing filter.
Four operational challenges in retrieval governance:
Classification at scale. Millions of documents must be classified before chunking. Automated classifiers using NLP-based sensitivity detection handle volume, but low-confidence classifications must route to human reviewers. Misclassification creates silent access violations.
Chunk boundary integrity. A document may contain both classified and unclassified sections. If chunking splits across a classification boundary, an unclassified chunk may carry context from a classified section. Chunk boundaries must respect document classification boundaries.
Inference risk. Five individually non-sensitive chunks, when combined, may reveal sensitive information. Headcount data plus budget data reveals per-person compensation. Retrieval governance alone cannot prevent this — output monitoring is required.
Dynamic entitlements. A user's access changes when they move departments, get promoted, or are terminated. The retrieval scope must reflect current entitlements queried in real time, not cached from session initialization.
Layer 2: Generation Governance
Every inference — every question and answer — must produce an immutable log record. This is the audit infrastructure of the AI system.
INFERENCE LOG RECORD:
inference_id: inf_20260315_8834
timestamp: 2026-03-15T14:23:00Z
user_id: user_priya_rm
user_role: relationship_manager
session_id: sess_4421
INPUT:
raw_query: "Is a salaried customer with
18L income eligible for 80L
home loan?"
RETRIEVAL:
chunks_retrieved: 5
chunk_ids: [chunk_2201, chunk_2203,
chunk_4410, chunk_4411,
chunk_7782]
parent_docs: [home_loan_policy_v7.pdf,
rbi_ltv_guidelines_2024.pdf]
entitlement_check: PASSED
retrieval_scores: [0.94, 0.91, 0.87, 0.85, 0.82]
GENERATION:
model_id: gpt-4o-2026-03
prompt_template: credit_qa_v4
temperature: 0.1
output: "Based on current policy, a
salaried applicant with 18L
income is eligible for up to
75L at current rates, subject
to existing obligations and
80% LTV per RBI guidelines."
token_count: 48
ATTRIBUTION:
claim_1: "18L income → eligible"
source: chunk_2201
grounded: true
claim_2: "up to 75L"
source: chunk_2203
grounded: true
claim_3: "subject to existing obligations"
source: chunk_4410
grounded: true
claim_4: "80% LTV per RBI"
source: chunk_7782
grounded: true
faithfulness_score: 1.00
IMMUTABLE: true
RETENTION: 7 years
Faithfulness Measurement
Faithfulness measures what percentage of the model's output is traceable to retrieved chunks versus generated from the model's internal parameters. It is the metric that quantifies how much of the output is governed retrieval and how much is unverifiable generation.
The measurement method: decompose the output into individual claims. For each claim, determine whether it is supported by the retrieved chunks. The faithfulness score is the ratio of grounded claims to total claims.
At scale, this is automated using a second LLM as a judge (LLM-as-a-judge pattern) or a dedicated classifier such as Vectara's HHEM (Hughes Hallucination Evaluation Model). Evaluation frameworks: RAGAS (Retrieval Augmented Generation Assessment), TruLens (groundedness scoring), DeepEval (faithfulness metric).
Faithfulness above 90% indicates the model is primarily a synthesis layer over retrieved facts — govern the retrieval and you have governed most of the output. Faithfulness below 70% indicates the model is generating significant content from its own parameters — retrieval governance alone is insufficient and output-level monitoring is critical.
Three enforcement techniques at generation time:
Citation forcing. The system prompt instructs the model: "Answer only using information from the provided context. If the context does not contain the answer, state that you do not have sufficient information." This constrains generation to retrieved content at the cost of reduced answer coverage.
Inline attribution. The model is instructed to cite which chunk each claim comes from: "Eligible for up to 75L [Source: home_loan_policy_v7.pdf, section 3.2]." The user can verify. The audit trail self-documents. The attribution is stored as part of the inference log.
Confidence gating. If retrieval precision is low (the retrieved chunks have low relevance scores), the system refuses to answer rather than allowing the model to fill gaps with ungrounded generation. The threshold is configurable: a legal advisory system may require all chunks above 0.85 relevance; an internal FAQ may accept 0.70.
The Right to Explanation
Output governance is not solely an internal audit function. It fulfills external obligations. GDPR Article 22 grants Data Subjects the right not to be subject to solely automated decisions with significant effects, and the right to obtain meaningful information about the logic involved. DPDP Act requires Significant Data Fiduciaries to conduct algorithmic due diligence. RBI's digital lending guidelines require explanation of credit decisions.
The inference log is the mechanism through which these obligations are fulfilled. Given an inference_id, the system returns: the user's input, the documents consulted, the model's output, the attribution mapping, and the faithfulness score. This is the explanation — not a generic description of how the model works, but a specific reconstruction of how this particular answer was produced for this particular user at this particular time.
Storage and Retention
Inference logging at enterprise scale is a data engineering problem. An organization with 10,000 users generating 20 queries per day produces 200,000 inference logs daily. Each log is 5–15 KB including chunk text snapshots. Over a 7-year retention period (standard for financial services), this accumulates to 2.5–7.5 TB.
Design requirements: append-only storage (immutable — logs cannot be modified after creation). Compression and chunk deduplication (the same chunk appears in thousands of inferences). Indexing by user_id, timestamp, inference_id, and parent_doc. Partitioning by time period. Point-in-time chunk capture — the log must store the chunk text as it existed at inference time, not a pointer to the current version, because documents are updated after the fact.
What Done Looks Like
Phase 5 is complete when every inference produces an immutable log record containing the input, retrieved chunks (with point-in-time text), model version, output, and claim-level attribution. Retrieval is governed by pre-retrieval access control enforcing user entitlements against document classifications. Faithfulness is scored on every inference in regulated contexts (or risk-weighted samples in lower-risk contexts). Citation forcing or inline attribution is enabled for all user-facing outputs. The system can reconstruct any inference — who asked, what was retrieved, what was answered, and why — within minutes of a compliance inquiry.
Without Phase 5, Phase 6 (Continuous Monitoring) has no data to monitor. Drift detection, bias monitoring, and quality tracking all depend on the inference log stream produced by output governance.
Next: Article 6 — Continuous Monitoring & Drift Detection
Appendix: Key Terms in Plain Language
Inference — A single question-answer cycle. The user asks something, the model produces a response. That entire interaction is one inference.
Inference Log — The complete record of a single inference: the input, what was retrieved, what was generated, and metadata about the process. The AI equivalent of a bank transaction record.
Faithfulness — The percentage of claims in the model's output that can be traced to retrieved chunks. High faithfulness means the model stuck to the facts. Low faithfulness means it invented content.
Groundedness — An alternative term for faithfulness, used by TruLens and some other frameworks. Same concept: is the output grounded in the provided context?
LLM-as-a-Judge — Using a second language model to evaluate the first model's output. The judge model reads the chunks and the output and scores whether the output is faithful to the chunks.
RAGAS — Retrieval Augmented Generation Assessment. An open-source framework for evaluating RAG systems on faithfulness, relevance, precision, and recall.
Citation Forcing — Instructing the model to only use information from provided context and to say "I don't know" when the context is insufficient. Reduces hallucination at the cost of answer coverage.
Inline Attribution — The model cites which source document supports each claim in its response. "Eligible for 75L [Source: policy_v7, section 3.2]." Enables user verification and self-documenting audit trails.
Confidence Gating — Refusing to answer when the retrieved chunks are not sufficiently relevant to the question. Prevents the model from filling gaps with invented content when retrieval quality is poor.
Point-in-Time Capture — Storing the text of retrieved chunks as they existed at the moment of inference, not a reference to the current version. Documents change after the fact; the log must reflect what the model actually saw.
Pre-Retrieval Filtering — Constraining the search scope before the retriever executes, based on the user's access rights. The retriever only searches documents the user is entitled to see. The correct approach, as opposed to post-retrieval filtering.
Vector Store — A database optimized for storing and searching embedding vectors. Used in RAG systems to find text chunks semantically similar to a query. Examples: Pinecone, Weaviate, Qdrant, pgvector.
GDPR Article 22 — The provision giving individuals the right not to be subject to solely automated decisions with significant effects, and the right to meaningful information about the logic involved.