Implementing Data Subject Requests: Data Discovery & Cross-System Orchestration

Part 1 of this series covered the DSR lifecycle — the seven stages from intake to closure, the data model, SLA management, and architectural patterns. That was the skeleton. This part addresses the problem that breaks most DSR implementations: finding the data.

When a verified data subject asks "give me all my data" or "delete all my data," the organization must answer a question it has likely never had to answer before: where, exactly, does this person's data live? Not approximately. Not "probably in these three systems." Exactly. Every system, every table, every field, every backup, every log, every cache, every downstream copy.

This is the data discovery problem, and it is harder than it appears.

Why discovery is hard

Enterprises do not store a person's data in one place. They store fragments of it across dozens of systems that were built independently, by different teams, over different decades, using different identifiers. The customer record in the CRM uses an email address as the primary key. The billing system uses an account number. The analytics warehouse uses a hashed user ID. The support ticketing system uses a phone number. The marketing platform uses a cookie identifier. The mobile app uses a device fingerprint linked to a login token.

A single person may exist as five different entities across five systems, linked by no common identifier. When that person submits a DSR, the system must resolve all five into one, and then query all five systems for every piece of data associated with any of those identifiers. This is not a database query. It is an identity resolution problem layered on top of a distributed systems problem.

The second reason discovery is hard is that most organizations do not have a complete data inventory. They know about their primary systems — the CRM, the billing engine, the production database. They do not know about the spreadsheet a marketing analyst downloaded six months ago, the test environment that still contains production data, the third-party analytics tool that receives a real-time event stream, or the machine learning training dataset that was created from a production snapshot eighteen months ago and never updated or deleted.

Discovery cannot find what the organization has not cataloged. This is why a data catalog — a comprehensive, maintained inventory of every system, dataset, and data flow that touches personal data — is a prerequisite for a functioning DSR system. Without it, every DSR response is incomplete, and every deletion is partial.

The service registry

The operational mechanism that enables discovery is the service registry. This is a structured record of every system that holds or processes personal data, along with the metadata the DSR orchestrator needs to interact with that system.

Each entry in the service registry contains the system name and owner, the categories of personal data it holds (identity, financial, behavioral, communications, location), the identifier it uses to locate a person's data (email, account ID, user ID, phone number, device ID), the API endpoint or interface through which data can be queried or deleted, the expected response time, and any known limitations (the system supports access but not deletion, the system requires batch processing overnight, the system has no API and requires manual extraction).

The registry is not a static document. It must be maintained as a living system with ownership assigned and periodic reviews enforced. When a new system is deployed that processes personal data, it must be registered. When a system is decommissioned, its registry entry must be retired. When a system's API changes, the registry must be updated. In practice, the registry degrades within months unless its maintenance is tied to the software deployment process — a privacy gate in the CI/CD pipeline that requires registry registration before a service that handles personal data can go live.

Identity resolution

Before the orchestrator can query the service registry, it must resolve the requestor's identity into every identifier used across the enterprise. The requestor provides one identifier — typically an email address or a logged-in session. The DSR system must expand that into the full set of identifiers that the service registry's systems use.

This requires an identity graph. The graph maps relationships between identifiers: this email is linked to this account number, which is linked to this user ID, which is linked to these device IDs. The graph is populated from authentication systems, account linking records, and cross-system reconciliation.

The quality of the identity graph determines the completeness of every DSR. If the graph is missing a link — if it does not know that the email address the requestor provided is also associated with a legacy account number in the billing system — then the billing system will not be queried, and the response will be incomplete.

Identity resolution must also handle ambiguity. Two people may share a name. An email address may have been recycled. A phone number may have been transferred. The resolution layer must surface ambiguities and route them for manual review rather than silently returning data for the wrong person. Returning another person's data in response to a DSR is not a compliance gap — it is a data breach.

Fan-out orchestration

With the identity resolved and the service registry consulted, the orchestrator knows which systems to query and which identifier to use for each. It now dispatches tasks — one per system.

Fan-out is the pattern. The orchestrator publishes a task for each system, and each system executes independently. For an access request, each system retrieves the person's data and returns it in a structured format. For a deletion request, each system deletes the data and returns a confirmation. For a portability request, each system exports the data in a machine-readable format.

The critical design decisions in fan-out orchestration are concurrency, timeout management, and partial failure handling.

Concurrency. Should all tasks execute in parallel, or should some be sequenced? In most cases, parallel execution is correct — the eight systems are independent and can be queried simultaneously. However, there are dependencies. If a user's data in System B was derived from data in System A, and the request is a deletion, System A may need to be deleted first to avoid System B re-ingesting data from A during a sync cycle. These dependencies must be modeled in the orchestrator as a directed acyclic graph, not a flat list.

Timeout management. Not every system responds in the same timeframe. The production database may respond in seconds. The data warehouse may take minutes. The legacy mainframe system may take hours, or require a batch job that runs overnight. The orchestrator must define per-system timeouts that reflect operational reality, not aspirational SLAs. When a timeout is reached, the task must be marked as timed out — not failed — and a retry or escalation path must be triggered. The request should not wait indefinitely for one slow system while the SLA clock runs.

Partial failure. This is the most important design decision. When seven out of eight systems succeed and one fails, what happens? The answer must be: the seven successes are recorded, the one failure is logged with the reason and retried, and the request advances to a state that reflects partial completion. The privacy team must be able to see: "this request is 87% complete, the CRM task failed with a timeout, retry scheduled." The answer must not be: the entire request fails and restarts from scratch.

Partial failure handling requires the per-task data model described in Part 1. Each task tracks its own status independently. The orchestrator computes the aggregate request status from the task statuses. A request is "complete" when all tasks are complete. A request is "partially complete" when some tasks are complete and others are in retry or manual review. A request is "blocked" when a task has failed permanently and requires human intervention.

System integration patterns

Each system in the service registry must expose some mechanism for the orchestrator to interact with it. In practice, three patterns cover most cases.

Direct API. The system exposes a purpose-built DSR API. The orchestrator calls GET /dsr/user/{id} for access requests or DELETE /dsr/user/{id} for deletion requests. This is the cleanest pattern and the target state for every system. It gives the system owner control over what is returned or deleted, and it provides a clear contract between the system and the orchestrator. Building this API is the system owner's responsibility. The DSR team provides the specification; the system team implements it.

Database direct access. When a system has no API and building one is not feasible in the required timeframe, the orchestrator may query the system's database directly. This is fragile and should be treated as a temporary measure. Direct database access requires knowledge of the schema, which changes without notice. It bypasses application-level business logic that may be relevant — soft delete flags, cascade rules, referential integrity checks. And it creates a coupling between the DSR system and the internal implementation of another team's service. But sometimes it is the only option, particularly for legacy systems with no active development team.

Manual fulfillment. Some systems cannot be queried programmatically at all. A third-party SaaS tool with no API. A physical filing system. A spreadsheet maintained by a business analyst. For these, the orchestrator creates a manual task — assigned to a specific person, with instructions, a deadline, and a status that must be updated when the task is complete. The system must track manual tasks with the same rigor as automated ones. A manual task that is assigned but never completed is a compliance failure.

Response assembly

For access and portability requests, the orchestrator must assemble the responses from all systems into a single data package for the requestor. This is not concatenation. It is assembly with structure.

The data package must be organized by category, not by source system. The requestor does not care that their name came from the CRM and their purchase history came from the billing system. They care about what categories of data the organization holds: identity data, financial data, transaction history, communication records, behavioral data. The assembly layer must map system responses to a category taxonomy and present a coherent, readable package.

The package must also redact data that belongs to other people. A support ticket may contain the requestor's data but also the name of the support agent. A shared account may contain the requestor's data interleaved with another person's data. The assembly layer must filter for the requestor's data only. This is a non-trivial data processing step that requires entity awareness, not simple field extraction.

The format of the package matters. GDPR requires data to be provided in a "commonly used and machine-readable format." In practice, this means JSON or CSV for structured data and PDF for human-readable presentation. Most mature DSR systems generate both: a structured export for portability and a formatted report for readability.

What comes next

This part covered the discovery problem, the service registry, identity resolution, fan-out orchestration, system integration patterns, and response assembly. It addressed how to find a person's data and how to collect it from systems that were never designed to answer that question.

Part 3 addresses the harder problem: deletion. Finding data is an engineering challenge. Deleting it — across live systems, backups, caches, event logs, derived datasets, and downstream consumers, while respecting retention obligations and legal holds — is an architectural one.