Phase 4 tested the model before deployment. Phase 5 governs what happens after deployment — every answer the model gives in production. The core requirement: any inference must be fully reconstructable after the fact. What was the input, what was retrieved, what was generated, was it grounded in the retrieval, and was the user entitled to see the retrieved documents. Without this, AI systems operate as black boxes with no audit trail, no accountability, and no capacity to respond to regulatory inquiry or customer complaint.

The Two Governance Layers

Output governance operates at two layers: the retrieval layer (what data the model is allowed to see) and the generation layer (what the model produces and whether it can be traced). Both must be governed simultaneously.

USER QUERY + USER IDENTITY
         │
         ▼
┌────────────────────────────────────┐
│  LAYER 1: RETRIEVAL GOVERNANCE     │
│  Pre-retrieval access control      │
│  Scope-constrained search          │
│  Document classification enforced  │
└────────────┬───────────────────────┘
             │ entitled chunks only
             ▼
┌────────────────────────────────────┐
│  LAYER 2: GENERATION GOVERNANCE    │
│  Inference logging                 │
│  Faithfulness scoring              │
│  Output attribution                │
│  Citation enforcement              │
└────────────┬───────────────────────┘
             │
             ▼
         GOVERNED OUTPUT

Layer 1: Retrieval Governance

In a Retrieval-Augmented Generation (RAG) system, the model does not answer from memory alone. It searches a document store, retrieves relevant text chunks, and generates an answer grounded in those chunks. The governance problem: the retriever optimizes for relevance, not entitlement. It returns the best-matching chunks regardless of whether the user is authorized to see them.

Pre-retrieval filtering is mandatory. The access control decision must occur before the search executes, not after. If the retriever searches the entire document store and a restricted chunk is retrieved, the model has already processed it — even if the output is filtered afterward, the model's reasoning was influenced by data the user should not have accessed.

The architecture: every document and every chunk in the vector store carries a classification tag inherited from the Phase 1 data catalog. When a query arrives, the system resolves the user's entitlements (role, department, clearance level) and constrains the vector search to chunks matching those entitlements.

VECTOR SEARCH QUERY:
  semantic_search("credit eligibility criteria")
  WHERE chunk.classification IN user.entitlements
  AND chunk.pii_flag = false 
      OR user.pii_access = true

The retriever never sees restricted chunks. The model never processes them. The output cannot reflect unauthorized information. This is a schema-level constraint on the vector store, not a post-processing filter.

Four operational challenges in retrieval governance:

Classification at scale. Millions of documents must be classified before chunking. Automated classifiers using NLP-based sensitivity detection handle volume, but low-confidence classifications must route to human reviewers. Misclassification creates silent access violations.

Chunk boundary integrity. A document may contain both classified and unclassified sections. If chunking splits across a classification boundary, an unclassified chunk may carry context from a classified section. Chunk boundaries must respect document classification boundaries.

Inference risk. Five individually non-sensitive chunks, when combined, may reveal sensitive information. Headcount data plus budget data reveals per-person compensation. Retrieval governance alone cannot prevent this — output monitoring is required.

Dynamic entitlements. A user's access changes when they move departments, get promoted, or are terminated. The retrieval scope must reflect current entitlements queried in real time, not cached from session initialization.

Layer 2: Generation Governance

Every inference — every question and answer — must produce an immutable log record. This is the audit infrastructure of the AI system.

INFERENCE LOG RECORD:
  inference_id:     inf_20260315_8834
  timestamp:        2026-03-15T14:23:00Z
  user_id:          user_priya_rm
  user_role:        relationship_manager
  session_id:       sess_4421

  INPUT:
    raw_query:      "Is a salaried customer with 
                     18L income eligible for 80L 
                     home loan?"

  RETRIEVAL:
    chunks_retrieved: 5
    chunk_ids:        [chunk_2201, chunk_2203, 
                       chunk_4410, chunk_4411, 
                       chunk_7782]
    parent_docs:      [home_loan_policy_v7.pdf,
                       rbi_ltv_guidelines_2024.pdf]
    entitlement_check: PASSED
    retrieval_scores:  [0.94, 0.91, 0.87, 0.85, 0.82]

  GENERATION:
    model_id:         gpt-4o-2026-03
    prompt_template:  credit_qa_v4
    temperature:      0.1
    output:           "Based on current policy, a 
                       salaried applicant with 18L 
                       income is eligible for up to 
                       75L at current rates, subject 
                       to existing obligations and 
                       80% LTV per RBI guidelines."
    token_count:      48

  ATTRIBUTION:
    claim_1: "18L income → eligible"
      source: chunk_2201
      grounded: true
    claim_2: "up to 75L"
      source: chunk_2203
      grounded: true
    claim_3: "subject to existing obligations"
      source: chunk_4410
      grounded: true
    claim_4: "80% LTV per RBI"
      source: chunk_7782
      grounded: true
    faithfulness_score: 1.00

  IMMUTABLE: true
  RETENTION: 7 years

Faithfulness Measurement

Faithfulness measures what percentage of the model's output is traceable to retrieved chunks versus generated from the model's internal parameters. It is the metric that quantifies how much of the output is governed retrieval and how much is unverifiable generation.

The measurement method: decompose the output into individual claims. For each claim, determine whether it is supported by the retrieved chunks. The faithfulness score is the ratio of grounded claims to total claims.

At scale, this is automated using a second LLM as a judge (LLM-as-a-judge pattern) or a dedicated classifier such as Vectara's HHEM (Hughes Hallucination Evaluation Model). Evaluation frameworks: RAGAS (Retrieval Augmented Generation Assessment), TruLens (groundedness scoring), DeepEval (faithfulness metric).

Faithfulness above 90% indicates the model is primarily a synthesis layer over retrieved facts — govern the retrieval and you have governed most of the output. Faithfulness below 70% indicates the model is generating significant content from its own parameters — retrieval governance alone is insufficient and output-level monitoring is critical.

Three enforcement techniques at generation time:

Citation forcing. The system prompt instructs the model: "Answer only using information from the provided context. If the context does not contain the answer, state that you do not have sufficient information." This constrains generation to retrieved content at the cost of reduced answer coverage.

Inline attribution. The model is instructed to cite which chunk each claim comes from: "Eligible for up to 75L [Source: home_loan_policy_v7.pdf, section 3.2]." The user can verify. The audit trail self-documents. The attribution is stored as part of the inference log.

Confidence gating. If retrieval precision is low (the retrieved chunks have low relevance scores), the system refuses to answer rather than allowing the model to fill gaps with ungrounded generation. The threshold is configurable: a legal advisory system may require all chunks above 0.85 relevance; an internal FAQ may accept 0.70.

The Right to Explanation

Output governance is not solely an internal audit function. It fulfills external obligations. GDPR Article 22 grants Data Subjects the right not to be subject to solely automated decisions with significant effects, and the right to obtain meaningful information about the logic involved. DPDP Act requires Significant Data Fiduciaries to conduct algorithmic due diligence. RBI's digital lending guidelines require explanation of credit decisions.

The inference log is the mechanism through which these obligations are fulfilled. Given an inference_id, the system returns: the user's input, the documents consulted, the model's output, the attribution mapping, and the faithfulness score. This is the explanation — not a generic description of how the model works, but a specific reconstruction of how this particular answer was produced for this particular user at this particular time.

Storage and Retention

Inference logging at enterprise scale is a data engineering problem. An organization with 10,000 users generating 20 queries per day produces 200,000 inference logs daily. Each log is 5–15 KB including chunk text snapshots. Over a 7-year retention period (standard for financial services), this accumulates to 2.5–7.5 TB.

Design requirements: append-only storage (immutable — logs cannot be modified after creation). Compression and chunk deduplication (the same chunk appears in thousands of inferences). Indexing by user_id, timestamp, inference_id, and parent_doc. Partitioning by time period. Point-in-time chunk capture — the log must store the chunk text as it existed at inference time, not a pointer to the current version, because documents are updated after the fact.

What Done Looks Like

Phase 5 is complete when every inference produces an immutable log record containing the input, retrieved chunks (with point-in-time text), model version, output, and claim-level attribution. Retrieval is governed by pre-retrieval access control enforcing user entitlements against document classifications. Faithfulness is scored on every inference in regulated contexts (or risk-weighted samples in lower-risk contexts). Citation forcing or inline attribution is enabled for all user-facing outputs. The system can reconstruct any inference — who asked, what was retrieved, what was answered, and why — within minutes of a compliance inquiry.

Without Phase 5, Phase 6 (Continuous Monitoring) has no data to monitor. Drift detection, bias monitoring, and quality tracking all depend on the inference log stream produced by output governance.

Next: Article 6 — Continuous Monitoring & Drift Detection

Appendix: Key Terms in Plain Language

Inference — A single question-answer cycle. The user asks something, the model produces a response. That entire interaction is one inference.

Inference Log — The complete record of a single inference: the input, what was retrieved, what was generated, and metadata about the process. The AI equivalent of a bank transaction record.

Faithfulness — The percentage of claims in the model's output that can be traced to retrieved chunks. High faithfulness means the model stuck to the facts. Low faithfulness means it invented content.

Groundedness — An alternative term for faithfulness, used by TruLens and some other frameworks. Same concept: is the output grounded in the provided context?

LLM-as-a-Judge — Using a second language model to evaluate the first model's output. The judge model reads the chunks and the output and scores whether the output is faithful to the chunks.

RAGAS — Retrieval Augmented Generation Assessment. An open-source framework for evaluating RAG systems on faithfulness, relevance, precision, and recall.

Citation Forcing — Instructing the model to only use information from provided context and to say "I don't know" when the context is insufficient. Reduces hallucination at the cost of answer coverage.

Inline Attribution — The model cites which source document supports each claim in its response. "Eligible for 75L [Source: policy_v7, section 3.2]." Enables user verification and self-documenting audit trails.

Confidence Gating — Refusing to answer when the retrieved chunks are not sufficiently relevant to the question. Prevents the model from filling gaps with invented content when retrieval quality is poor.

Point-in-Time Capture — Storing the text of retrieved chunks as they existed at the moment of inference, not a reference to the current version. Documents change after the fact; the log must reflect what the model actually saw.

Pre-Retrieval Filtering — Constraining the search scope before the retriever executes, based on the user's access rights. The retriever only searches documents the user is entitled to see. The correct approach, as opposed to post-retrieval filtering.

Vector Store — A database optimized for storing and searching embedding vectors. Used in RAG systems to find text chunks semantically similar to a query. Examples: Pinecone, Weaviate, Qdrant, pgvector.

GDPR Article 22 — The provision giving individuals the right not to be subject to solely automated decisions with significant effects, and the right to meaningful information about the logic involved.