logo
26

Data Governance & Privacy

⏱️ 30 min

AI Data Governance & Privacy

When building LLM features at a company, the thing that gets overlooked most isn't model quality -- it's data flow.

Many demos work because they default to "send everything to the model." And many projects stall at launch because nobody can explain where data comes from, where it goes, or how long it stays when compliance review comes knocking.

AI Data Governance Flow


What this chapter solves

You need to be able to answer these questions:

  • What data is actually being sent with this model call?
  • Which data must be redacted, and which data shouldn't be uploaded at all?
  • How long are temporary files, chat logs, and retrieval snippets kept?
  • Can data from different regions or different customers be stored together?
  • If something goes wrong, can you trace who accessed what?

If these questions don't have clear answers, your product won't make it into enterprise scenarios no matter how smart it is.


Three bottom-line principles

1. Minimum necessary

Only send the model data that's required to complete the task.

For example, when summarizing meeting notes, you typically don't need the entire CRM record, the user's ID number, or payment details tagging along.

2. Purpose limitation

Data is only used for the currently declared task.

Data uploaded today for document summarization shouldn't automatically become a training sample or long-term memory.

3. Environment isolation

Secrets, logs, and storage across dev, test, and production must be separated.

Don't copy production conversations into a local prompt playground for debugging.


A complete data handling pipeline

StageWhat to doCommon mistake
IngressValidate file type, size, sourceAccept everything, figure it out later
PreprocessRedact, classify, chunkSend raw text straight to the model
RetrievalFilter data by tenant and permissionRetrieval scope too broad
GenerationOnly provide context needed for current taskStuff entire history into context
EgressRedact output, content reviewReturn raw snippets directly to users
StorageDefine TTL, encrypt, auditKeep temporary files forever

Data classification needs to be actionable, not just a slide

A good-enough classification:

Data levelExamplesHandling guidance
PublicWebsite copy, public FAQCan be used directly in retrieval
InternalInternal SOPs, training materialsControl access scope, maintain audit trail
ConfidentialContracts, quotes, business reportsStrict permission control, restrict outbound
Sensitive / PIIPhone numbers, emails, ID numbers, addressesRedact by default, don't send to model directly
SecretAPI keys, database passwords, private keysNever enters prompts or logs

Classification isn't about looking professional -- it's about deciding whether each data type can be uploaded, cached, logged, and displayed.


Redaction & Preprocessing

Common fields that need handling:

  • Phone numbers, emails, addresses
  • User IDs, order numbers, account numbers
  • Banking info, identity documents
  • Internal project codenames, client lists

Make redaction a standalone preprocessing step rather than scattering it across business logic.

That way you can upgrade rules uniformly and do tenant-level configuration more easily.


Storage & Retention Strategy

What shouldn't be retained long-term

  • Temporary audio/video transcript uploads
  • Intermediate reasoning drafts
  • Debug-phase prompts / responses
  • Failure logs containing sensitive context

What can be retained, with limits

  • Structured task results
  • Redacted audit logs
  • Model call metadata
  • User-authorized conversation summaries

A retention policy you can adopt directly:

Data typeSuggested TTL
Temporary files24 hours to 7 days
Intermediate parse results1 to 3 days
Audit logsPer compliance requirements
Business resultsManaged by the business system, not model cache

Region & Tenant Isolation

A common trap in enterprise projects: "logically separated by tenant, physically mixed together."

A more reliable approach at minimum:

  • All retrieval and database access includes tenant_id
  • Vector stores or indexes are logically isolated by tenant
  • Different regions route to different storage and provider endpoints
  • Certain customers can disable training, log retention, or external model calls

If your system serves multiple countries or enterprise customers, decide this at architecture design time. Don't retrofit it after you've signed the contract.


Third-Party Model & Tool Integration

When connecting providers, it's not just about whether the API works. You also need to confirm:

  • Whether data training opt-out is supported
  • Default log retention duration
  • Whether a DPA or enterprise agreement is signed
  • Whether data is processed across regions
  • Whether tool calling and logs can leak raw content

Self-hosted models aren't inherently secure either. You're still responsible for:

  • Image and dependency updates
  • Network egress control
  • Storage encryption
  • Audit and patch cadence

Audit & Access Control

At minimum, log these:

  • Who accessed what data
  • When the call was initiated
  • Which model and which version config was used
  • Whether redaction or security policy was triggered
  • Whether high-risk tools were called

But be aware: audit logs themselves can become sensitive data sources.

So "traceable" doesn't mean "store all raw content verbatim."


Minimum launch checklist

  • Classify data first, then decide if it can go to the model
  • Redact sensitive fields before they enter the model
  • Set TTL on temporary data
  • Isolate retrieval and tool access by tenant
  • Audit logs record actions, don't blindly store raw content
  • Clarify the provider's retention and training policies

Hands-on Exercise

Take one of your AI workflows and audit it:

  1. Map the entire path from user input to model output
  2. Mark where PII is touched
  3. Mark which data doesn't actually need to be uploaded
  4. Write a retention duration for each data category

Once you've done this, your "data governance" actually lives in the system instead of just on a slide.

📚 相关资源