Discovery & Inventory: You Cannot Govern What You Cannot See

Every enterprise that attempts AI governance without first completing a data discovery program is building controls on an unknown surface. Consent management requires knowing what personal data you collect. Lineage requires knowing where it flows. Access governance requires knowing who touches it. Bias monitoring requires knowing what the model was trained on. None of these are possible without a comprehensive, automated, continuously maintained inventory of every data asset in the estate. Discovery is Phase 1 because everything downstream depends on it.

What an Enterprise Data Inventory Contains

A data inventory is not a spreadsheet. It is a live, machine-readable registry maintained at four layers:

Layer 1 — Asset Inventory. Every database, table, column, API endpoint, file store, and SaaS integration that holds or transmits data. This includes structured stores (PostgreSQL, MySQL, Snowflake, BigQuery), semi-structured stores (MongoDB, Elasticsearch), object stores (S3, GCS, Azure Blob), and third-party systems (Salesforce, HubSpot, Workday). If data passes through it, it is an asset.

Layer 2 — Sensitivity Classification. Every asset tagged by sensitivity tier. PII (name, email, phone, government ID). Sensitive Personal Data or Information, SPDI (financial records, health data, biometrics, authentication credentials). Regulated data subject to specific regimes (RBI data localization, PCI-DSS cardholder data, HIPAA protected health information). Internal business data. Public data. Classification is not optional metadata — it determines encryption requirements, access controls, retention ceilings, and cross-border transfer eligibility.

Layer 3 — Ownership Registry. Every asset mapped to three roles. The Data Owner — a business stakeholder accountable for what data is collected, why it is collected, and how it is classified. The Data Steward — an operational role responsible for metadata quality, standards enforcement, and day-to-day data quality resolution. The Data Custodian — a technical role responsible for storage infrastructure, backup, encryption, and access control implementation. Without ownership, an inventory is a snapshot that decays from the moment it is created. With ownership, it is a living system with accountable humans at every node.

Layer 4 — Processing Activity Register. A record of every system that reads or writes each data asset, for what stated purpose, under what legal basis, with what retention period, and shared with which downstream consumers or third parties. This is the GDPR Article 30 equivalent and the DPDP Act's requirement for Data Fiduciaries to document their processing activities. This register becomes the input for consent validation in Phase 2 and lineage mapping in subsequent phases.

┌─────────────────────────────────────────┐
│         ENTERPRISE DATA INVENTORY       │
├─────────────────────────────────────────┤
│  Layer 4: Processing Activity Register  │
│  What system, what purpose, what basis  │
├─────────────────────────────────────────┤
│  Layer 3: Ownership Registry            │
│  Data Owner · Steward · Custodian       │
├─────────────────────────────────────────┤
│  Layer 2: Sensitivity Classification    │
│  PII · SPDI · Regulated · Internal      │
├─────────────────────────────────────────┤
│  Layer 1: Asset Inventory               │
│  Databases · APIs · File Stores · SaaS  │
└─────────────────────────────────────────┘

Automated Discovery Methods

Manual cataloging does not scale. An enterprise with 400 databases, 2,000 microservices, and 50 SaaS integrations cannot rely on questionnaires and interviews. Discovery must be automated across three vectors:

Schema Profiling — Automated scanning of database metadata: table names, column names, data types, constraints, and sample value patterns. A column named "ssn" with a 9-digit numeric pattern is classified as PII automatically. A column named "card_number" conforming to Luhn algorithm validation is flagged as PCI-regulated. Schema profiling is the fastest path to a baseline inventory. Industry tools: Collibra Data Intelligence, Alation Data Catalog, Apache Atlas, AWS Glue Data Catalog, Google Cloud Dataplex.

Code Scanning — Static analysis of application source code to trace how data enters the system, where it is stored, how it is transformed, and where it is transmitted. Code scanning detects data flows that schema profiling cannot — for example, PII passed through application memory but never persisted to a database, or personal data sent to a third-party analytics SDK embedded in mobile application code. This is the method Privado, Cyera, and BigID use to build runtime data maps directly from source code repositories.

API Discovery — Cataloging every API endpoint that exposes or ingests data. Mapping request and response payloads to sensitivity classifications. An API endpoint returning user profile data containing name, email, and phone number is a PII exposure surface. An API ingesting payment information is a PCI-DSS regulated entry point. API discovery must cover internal service-to-service APIs (often undocumented) and external-facing APIs.

┌──────────────────────────────────────────┐
│            DATA ESTATE                   │
│  Databases · Application Code · APIs     │
│  File Stores · SaaS Integrations         │
└──────────────────┬───────────────────────┘
                   │
        ┌──────────┼──────────┐
        ▼          ▼          ▼
  ┌──────────┐ ┌────────┐ ┌────────────┐
  │ Schema   │ │ Code   │ │ API        │
  │ Profiler │ │Scanner │ │ Cataloger  │
  └────┬─────┘ └───┬────┘ └─────┬──────┘
       │           │             │
       └───────────┼─────────────┘
                   ▼
       ┌───────────────────────┐
       │  CENTRAL DATA CATALOG │
       │  Asset Inventory      │
       │  Classification Tags  │
       │  Ownership Registry   │
       │  Processing Register  │
       └───────────┬───────────┘
                   │
                   ▼
            FEEDS INTO:
            ├─ Consent Gate (Phase 2)
            ├─ Lineage Mapping (Phase 3)
            ├─ Access Governance
            ├─ DPIA Generation
            └─ Regulatory Reporting

Sensitivity Classification in Practice

Classification is not a binary exercise. It is a tiered framework where each tier maps directly to a set of governance controls:

Tier 1 — PII. Name, email, phone number, mailing address, IP address, device identifiers, cookie identifiers. Governance controls: encryption at rest (AES-256), encryption in transit (TLS 1.2+), access restricted to purpose-justified roles, retention ceiling per stated purpose, subject to DSR (Data Subject Request) fulfillment.

Tier 2 — SPDI. Financial account numbers, credit/debit card data, health records, biometric data, authentication credentials, sexual orientation, religious belief, caste. Governance controls: all Tier 1 controls plus column-level encryption or tokenization, enhanced access logging, mandatory masking in non-production environments, restricted cross-border transfer.

Tier 3 — Regulated. Data subject to a specific regulatory regime beyond general privacy law. RBI-regulated financial data requiring domestic storage. PCI-DSS cardholder data requiring segmented network architecture. HIPAA-protected health information requiring BAA (Business Associate Agreement) with every processor. Governance controls: regime-specific, additive to Tier 1 and 2.

Tier 4 — Internal. Non-personal business data. Revenue figures, internal communications, product roadmaps. Governance controls: standard access controls, no specific privacy requirements, retention per business policy.

Tier 5 — Public. Published data, marketing materials, open datasets. Governance controls: minimal. No access restriction. No encryption requirement.

┌──────────┬──────────────────────┬────────────────────────┐
│  Tier    │  Data Examples       │  Governance Controls   │
├──────────┼──────────────────────┼────────────────────────┤
│  1: PII  │  Name, email, phone, │  AES-256 at rest,      │
│          │  IP, device ID       │  TLS in transit,       │
│          │                      │  role-based access,    │
│          │                      │  retention ceiling,    │
│          │                      │  DSR-eligible          │
├──────────┼──────────────────────┼────────────────────────┤
│  2: SPDI │  Financial, health,  │  Tier 1 + tokenization,│
│          │  biometric, caste,   │  column encryption,    │
│          │  credentials         │  masked in non-prod,   │
│          │                      │  cross-border restrict │
├──────────┼──────────────────────┼────────────────────────┤
│  3: REG  │  RBI-regulated,      │  Regime-specific:      │
│          │  PCI cardholder,     │  data localization,    │
│          │  HIPAA PHI           │  network segmentation, │
│          │                      │  BAA required          │
├──────────┼──────────────────────┼────────────────────────┤
│  4: INT  │  Revenue, roadmaps,  │  Standard access       │
│          │  internal comms      │  controls              │
├──────────┼──────────────────────┼────────────────────────┤
│  5: PUB  │  Marketing, open     │  Minimal               │
│          │  datasets            │                        │
└──────────┴──────────────────────┴────────────────────────┘

Automated classifiers assign tiers using pattern matching (regex on column names, value format detection), NLP-based classification (analyzing free-text fields for PII patterns), and predefined rules (any column in a payments table is assumed PCI-relevant until proven otherwise). Low-confidence classifications are routed to a human reviewer — the Data Steward — for adjudication. Classification confidence scores must be logged for audit.

What Done Looks Like

Phase 1 is complete when every data asset in the estate is cataloged with a unique identifier. Every asset carries a sensitivity classification at the column level. Every asset has an assigned Data Owner, Data Steward, and Data Custodian. Every processing activity is documented: system, purpose, legal basis, retention, downstream consumers. The catalog is live — automatically updated when schemas change, new tables are created, or new APIs are deployed. A gap report is generated: assets with no owner, no classification, or no documented processing purpose.

Without Phase 1, Phase 2 (Consent and Legal Basis) has no foundation to build on. You cannot validate consent against processing activities you have not documented. You cannot enforce retention policies on data you have not classified. You cannot restrict access to assets you have not inventoried.

Discovery is not a preliminary step. It is the load-bearing foundation of the entire governance architecture.

Next: Article 2 — Consent & Legal Basis Engine

Appendix: Key Terms in Plain Language

Data Asset — Any place where data is stored or any channel through which data moves. A database table, an API, a CSV file in cloud storage.

PII (Personally Identifiable Information) — Any data that can identify a specific individual. Name, email, phone number, government ID, IP address.

SPDI (Sensitive Personal Data or Information) — A subset of PII requiring stronger protection. Financial records, health data, biometric data, passwords, caste, sexual orientation. Defined under India's IT Act.

Data Owner — The business person who decides what data is collected and why. Accountable if data is misclassified or misused.

Data Steward — The person who maintains catalog quality day-to-day. Fixes classification errors, enforces standards, resolves data quality disputes.

Data Custodian — The technical person managing infrastructure: storage, encryption, backups, access controls.

Processing Activity — Any operation performed on data: collection, storage, retrieval, transformation, sharing, deletion.

Schema Profiling — Automatically reading database structure to understand what data exists, without reading actual values at scale.

Code Scanning — Analyzing source code to trace how data flows through the system.

Data Catalog — A searchable, centralized inventory of all data assets. The library catalog for an organization's data estate.

Classification Confidence Score — How certain the automated system is about its classification. High confidence auto-classifies. Low confidence routes to a human.

GDPR Article 30 — EU regulation requiring organizations to maintain records of all data processing activities.

DSR (Data Subject Request) — A formal request from a user to access, correct, or delete their personal data.

DPIA (Data Protection Impact Assessment) — A formal evaluation of how a processing activity could affect privacy rights.

AES-256 — Industry standard encryption for data at rest. 256-bit key. Considered unbreakable with current technology.

TLS (Transport Layer Security) — Protocol encrypting data in transit. Version 1.2 minimum; 1.3 preferred.

Tokenization — Replacing sensitive values with non-sensitive placeholders. Original stored in a separate secured vault.

BAA (Business Associate Agreement) — HIPAA-required contract for any third party handling patient data.

Luhn Algorithm — Checksum formula validating credit card numbers. Useful for automated PII detection.