Training Data Governance — Data Spectrum

Phase 1 cataloged the data estate. Phase 2 established lawful basis and consent enforcement. Phase 3 applies both to the specific problem of machine learning: ensuring that every record used to train a model is lawfully collected for that purpose, traceable to its source, versioned for reproducibility, and profiled for bias. A model trained on ungoverned data inherits every compliance gap, every bias, and every consent violation in its training set — and amplifies them at inference scale.

The Training Data Pipeline

Training data does not go directly from a source system into a model. It passes through a chain of systems, each of which transforms it. Governance must be enforced at each stage, not applied retroactively at the end.

SOURCE SYSTEMS (databases, event streams, APIs)
         │
         ▼
┌─────────────────────────────────┐
│  CONSENT GATE                   │
│  Validates: does this user's    │
│  consent include purpose =      │
│  model_training?                │
│  Jurisdiction-aware.            │
│  Blocks non-consented records.  │
└────────────┬────────────────────┘
             │ consented records only
             ▼
┌─────────────────────────────────┐
│  FEATURE PIPELINE               │
│  Transforms raw data into       │
│  model-consumable features.     │
│  avg_transaction_30d,           │
│  login_frequency_7d, etc.       │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│  FEATURE STORE                  │
│  Stores precomputed features.   │
│  Serves training and inference. │
│  Stores lineage keys back to    │
│  source records.                │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│  TRAINING DATASET ASSEMBLY      │
│  Selects features for a         │
│  specific training job.         │
│  Pins dataset version.          │
│  Runs bias profiling.           │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│  MODEL TRAINING                 │
│  Consumes versioned dataset.    │
│  Logs training config.          │
│  Generates model card.          │
└─────────────────────────────────┘

Five stages. Governance is not a wrapper around the last box. It is embedded in every transition.

Stage 1: Consent Validation at Pipeline Entry

The Consent Gate from Phase 2 sits at the entry point of the feature pipeline. Its function here is specific: for every record flowing into a pipeline tagged with purpose = model_training, the gate queries the Consent Management System to verify that the Data Principal has granted consent for that purpose under the applicable jurisdiction.

The validation logic per record:

record.user_id  → resolve jurisdiction (IN, EU, US-CA)
jurisdiction    → look up legal basis for model_training
                   IN  → DPDP: requires consent (Section 6)
                   EU  → GDPR: requires consent or 
                          legitimate interest (Art. 6)
                   US-CA → CCPA: requires disclosure, 
                           user has opt-out right

IF basis = consent:
    query CMS: has user_id granted model_training?
    YES → PASS record
    NO  → BLOCK record, log rejection

IF basis = legitimate_interest (GDPR):
    verify LIA (Legitimate Interest Assessment) 
    exists for this processing activity
    YES → PASS record, log basis
    NO  → BLOCK record, escalate

IF basis = opt_out (CCPA):
    query CMS: has user_id opted out?
    YES → BLOCK record
    NO  → PASS record, log basis

Records that fail the gate never enter the feature pipeline. They do not become features. They do not reach the training dataset. The model never sees them. This is the architectural guarantee that training data is lawfully sourced.

Stage 2: Feature Pipeline and Lineage

The feature pipeline transforms raw data into model-consumable features. A user's transaction history becomes avg_transaction_30d. Their login timestamps become login_frequency_7d. Their geographic data becomes region_code.

Two governance requirements at this stage:

Lineage preservation. Every computed feature must carry a pointer back to the source records that produced it. The Feature Store stores lineage keys — not consent metadata, not raw data, but identifiers linking each feature row to its origin. This enables withdrawal cascades: when a user withdraws consent, the system traces lineage keys to find and delete all derived features.

FEATURE STORE ROW:
  user_id:             user_12345
  avg_transaction_30d: 633.00
  login_frequency_7d:  3.5
  region_code:         KA-BLR
  source_records:      [txn_001, txn_002, txn_003,
                        login_8801, login_8802]
  pipeline_version:    feat_pipeline_v2.3
  computed_at:         2025-12-20T04:15:00Z

Purpose narrowing. If the feature pipeline joins two datasets with different purpose scopes, the output inherits the intersection of permitted purposes — not the union. Dataset A (consented for service_delivery and model_training) joined with Dataset B (consented for service_delivery only) produces an output consented only for service_delivery. The model_training purpose does not carry through the join. This must be enforced in pipeline logic, not left to manual review.

Stage 3: Training Dataset Versioning

Before features are consumed by a training job, they are assembled into a training dataset. This dataset must be versioned — a frozen, immutable snapshot of the exact records used.

TRAINING DATASET MANIFEST:
  dataset_id:      td_20260115_credit_v4
  created_at:      2026-01-15T08:00:00Z
  created_by:      ml_team/credit_risk
  purpose:         model_training
  feature_count:   12
  record_count:    1,847,332
  source_pipeline: feat_pipeline_v2.3
  consent_gate:    all records passed gate
  
  SCHEMA:
  ├─ user_id (string, hashed)
  ├─ avg_transaction_30d (float)
  ├─ login_frequency_7d (float)
  ├─ credit_score (int)
  ├─ region_code (string)
  ├─ employment_type (enum)
  ├─ income_band (enum)
  ├─ existing_emi_ratio (float)
  ├─ account_age_months (int)
  ├─ digital_txn_ratio (float)
  ├─ delinquency_flag (boolean)
  └─ label: default_90d (boolean)
  
  IMMUTABLE: true
  STORAGE: versioned object store (S3/GCS)
  RETENTION: until model retirement + audit period

Versioning serves three functions. First, reproducibility: any model can be retrained on the exact same dataset to verify results. Second, auditability: a regulator can inspect exactly what data a model was trained on, months or years after deployment. Third, withdrawal tracing: if a user withdraws consent, the system can identify which versioned datasets included their data and which models were trained on those datasets.

The dataset manifest must be stored as a first-class artifact in the Model Registry, linked to every model trained from it.

Stage 4: Bias Profiling

Before a training dataset is consumed by a training job, it must be profiled for demographic distribution. Bias in a model begins in the training data. If the dataset over-represents one geographic region, the model will perform better for that region and worse for others. If it under-represents a gender, the model's predictions for that gender will be less accurate.

Bias profiling produces a demographic report on the training dataset:

BIAS PROFILE — td_20260115_credit_v4:

Gender Distribution:
  male:        62.3%
  female:      34.1%
  other:        1.2%
  undisclosed:  2.4%
  ⚠ female under-represented vs population baseline (48%)

Geographic Distribution:
  metro:       71.8%
  tier_2:      19.4%
  rural:        8.8%
  ⚠ rural under-represented vs customer base (22%)

Employment Type:
  salaried:    58.2%
  self_employed: 24.6%
  gig:          9.1%
  unemployed:   8.1%

Age Distribution:
  18-25:       12.4%
  26-35:       38.7%
  36-45:       28.2%
  46-55:       14.1%
  55+:          6.6%

Proxy Variable Alert:
  region_code KA-BLR correlates 0.84 with 
  employment_type = salaried + tech_sector
  → potential proxy for demographic group

The bias profile does not decide whether to proceed. It surfaces the facts. The decision — retrain with resampled data, apply class weights, accept the distribution and document the known limitation — is a business and ethical decision made by the model owner, the Data Owner, and the compliance team jointly. What the profile prevents is ignorance: training a model without knowing its demographic composition.

Proxy variable detection is critical. Protected attributes (gender, caste, religion) may not be present in the training data. But other features — geography, employer name, university attended — may correlate strongly with protected attributes. A feature that correlates above a defined threshold (typically 0.7–0.8) with a known protected attribute is flagged as a proxy. The model owner must decide whether to include, exclude, or decorrelate the proxy feature.

Industry tools for bias profiling: Google What-If Tool, IBM AI Fairness 360 (AIF360), Microsoft Fairlearn, Amazon SageMaker Clarify.

Stage 5: Model Training with Governance Metadata

When training executes, the system logs the full training configuration as a governed artifact:

TRAINING RUN LOG:
  run_id:            run_20260115_cr_v4
  model_type:        gradient_boosted_trees
  framework:         XGBoost 2.0.3
  dataset_id:        td_20260115_credit_v4
  dataset_version:   v4 (immutable)
  hyperparameters:
    learning_rate:   0.05
    max_depth:       6
    n_estimators:    500
    subsample:       0.8
  training_duration: 47 minutes
  compute:           4x A100 GPU
  
  GOVERNANCE:
    consent_gate:    all records validated
    bias_profile:    reviewed, accepted with 
                     documented limitations
    proxy_flags:     region_code flagged, 
                     included with justification
    fairness_eval:   pending (Phase 4)

This log, combined with the dataset manifest and bias profile, becomes the input to the Model Card — the standardized documentation artifact produced at the end of training. The Model Card is not written by a human from memory. It is generated from governance metadata captured at each stage of the pipeline.

The Provenance Chain

The complete provenance chain from source data to trained model:

SOURCE RECORD (txn_001)
  │  consent: model_training ✓ (via CMS)
  │  jurisdiction: IN
  │  classification: PII Tier 1
  ▼
CONSENT GATE
  │  validation: PASS
  │  logged: gate_log_88341
  ▼
FEATURE PIPELINE (feat_pipeline_v2.3)
  │  transform: txn_001 → avg_transaction_30d
  │  lineage key preserved
  ▼
FEATURE STORE
  │  row: user_12345
  │  source_records: [txn_001, txn_002, txn_003]
  ▼
TRAINING DATASET (td_20260115_credit_v4)
  │  version: v4, immutable
  │  bias_profile: reviewed
  ▼
MODEL (run_20260115_cr_v4)
  │  trained on td_20260115_credit_v4
  │  model card generated
  ▼
MODEL REGISTRY
  │  model_id, dataset_id, training_log, 
  │  bias_profile, model_card
  │  all linked, all queryable

At any point, for any model, the system can answer: what data was it trained on, was consent validated, what was the demographic composition of the training data, and which source records contributed to which features. This is the provenance chain. Without it, model governance is assertion without evidence.

What Done Looks Like

Phase 3 is complete when no model trains on data that has not passed the Consent Gate for purpose = model_training. Every training dataset is versioned, immutable, and stored as a first-class artifact. Every dataset has a bias profile documenting demographic distribution and proxy variable flags. Every feature in the Feature Store carries lineage keys back to source records. Purpose narrowing is enforced in pipeline joins. Every training run produces a governance log linking the run to its dataset, consent validation, bias profile, and hyperparameters. The Model Registry links every model to its complete provenance chain.

Without Phase 3, Phase 4 (Model Behavior Governance) is testing a model whose inputs are unverified. Fairness testing on a model trained on ungoverned data is measuring the symptoms without diagnosing the cause.

Next: Article 4 — Model Behavior Governance

Appendix: Key Terms in Plain Language

Training Data — The dataset a machine learning model learns from. The model finds patterns in this data and uses them to make predictions on new data it has never seen.

Feature — A calculated value derived from raw data that the model uses as input. Raw transaction data becomes "average transaction amount in last 30 days." That calculated value is a feature.

Feature Pipeline — The automated process that transforms raw data into features. It reads from source systems, computes the features, and writes them to the Feature Store.

Feature Store — A centralized system that stores precomputed features and serves them consistently to both training jobs and live inference. Ensures the same feature definition is used everywhere.

Lineage Keys — Pointers stored alongside each feature row that reference which source records produced it. Enables tracing from any feature back to its origin and supports consent withdrawal cascades.

Purpose Narrowing — When two datasets with different consent scopes are joined, the output inherits only the purposes common to both. If Dataset A allows model_training and Dataset B does not, the joined output does not allow model_training.

Training Dataset Versioning — Freezing the exact set of records used for a training job into an immutable snapshot. Enables reproducibility, auditability, and withdrawal tracing.

Dataset Manifest — A metadata document describing a training dataset: its ID, creation date, schema, record count, source pipeline, and consent validation status. Stored as a first-class artifact.

Bias Profiling — Analyzing the demographic composition of a training dataset before it is used. Surfaces over-representation, under-representation, and proxy variables that could cause unfair model behavior.

Proxy Variable — A feature that is not a protected attribute (gender, caste, religion) but correlates strongly with one. Geography correlating with ethnicity. University name correlating with socioeconomic status. The model doesn't see the protected attribute directly but effectively uses it through the proxy.

Class Weights — A technique to compensate for imbalanced training data. If female applicants are 34% of the dataset but 48% of the population, class weights tell the model to treat each female record as slightly more important during training, partially correcting the imbalance.

Model Card — A standardized documentation artifact describing a trained model: what data it was trained on, what it does, how it performs, what its known limitations are, and what fairness evaluations were conducted. Originally proposed by Google researchers Mitchell et al. (2019).

Model Registry — A centralized system that stores trained models alongside their metadata: dataset versions, training configurations, performance metrics, model cards, and governance artifacts. Enables audit and version management across the model lifecycle.

Provenance Chain — The complete, traceable path from a source data record through every transformation and governance checkpoint to the trained model. Answers: where did the data come from, was it lawful, was it biased, and how did it become part of this model.

Hyperparameters — Configuration settings that control how a model trains: learning rate (how fast it adjusts), max depth (how complex it can become), number of iterations. Not learned from data — set by the engineer before training begins.

LIA (Legitimate Interest Assessment) — A documented evaluation required under GDPR when processing is based on legitimate interests rather than consent. Must demonstrate that the organization's interest does not override the individual's rights and freedoms.

Training Data Governance: Before the Model Learns, Govern What It Is Allowed to Learn From