26

Data Governance & Privacy

⏱️ 30 min

AI Data Governance & Privacy

When building LLM features at a company, the thing that gets overlooked most isn't model quality -- it's data flow.

Many demos work because they default to "send everything to the model." And many projects stall at launch because nobody can explain where data comes from, where it goes, or how long it stays when compliance review comes knocking.

AI Data Governance Flow


What this chapter solves

You need to be able to answer these questions:

  • What data is actually being sent with this model call?
  • Which data must be redacted, and which data shouldn't be uploaded at all?
  • How long are temporary files, chat logs, and retrieval snippets kept?
  • Can data from different regions or different customers be stored together?
  • If something goes wrong, can you trace who accessed what?

If these questions don't have clear answers, your product won't make it into enterprise scenarios no matter how smart it is.


Three bottom-line principles

1. Minimum necessary

Only send the model data that's required to complete the task.

For example, when summarizing meeting notes, you typically don't need the entire CRM record, the user's ID number, or payment details tagging along.

2. Purpose limitation

Data is only used for the currently declared task.

Data uploaded today for document summarization shouldn't automatically become a training sample or long-term memory.

3. Environment isolation

Secrets, logs, and storage across dev, test, and production must be separated.

Don't copy production conversations into a local prompt playground for debugging.


A complete data handling pipeline

StageWhat to doCommon mistake
IngressValidate file type, size, sourceAccept everything, figure it out later
PreprocessRedact, classify, chunkSend raw text straight to the model
RetrievalFilter data by tenant and permissionRetrieval scope too broad
GenerationOnly provide context needed for current taskStuff entire history into context
EgressRedact output, content reviewReturn raw snippets directly to users
StorageDefine TTL, encrypt, auditKeep temporary files forever

Data classification needs to be actionable, not just a slide

A good-enough classification:

Data levelExamplesHandling guidance
PublicWebsite copy, public FAQCan be used directly in retrieval
InternalInternal SOPs, training materialsControl access scope, maintain audit trail
ConfidentialContracts, quotes, business reportsStrict permission control, restrict outbound
Sensitive / PIIPhone numbers, emails, ID numbers, addressesRedact by default, don't send to model directly
SecretAPI keys, database passwords, private keysNever enters prompts or logs

Classification isn't about looking professional -- it's about deciding whether each data type can be uploaded, cached, logged, and displayed.


Redaction & Preprocessing

Common fields that need handling:

  • Phone numbers, emails, addresses
  • User IDs, order numbers, account numbers
  • Banking info, identity documents
  • Internal project codenames, client lists

Make redaction a standalone preprocessing step rather than scattering it across business logic.

That way you can upgrade rules uniformly and do tenant-level configuration more easily.


Storage & Retention Strategy

What shouldn't be retained long-term

  • Temporary audio/video transcript uploads
  • Intermediate reasoning drafts
  • Debug-phase prompts / responses
  • Failure logs containing sensitive context

What can be retained, with limits

  • Structured task results
  • Redacted audit logs
  • Model call metadata
  • User-authorized conversation summaries

A retention policy you can adopt directly:

Data typeSuggested TTL
Temporary files24 hours to 7 days
Intermediate parse results1 to 3 days
Audit logsPer compliance requirements
Business resultsManaged by the business system, not model cache

Region & Tenant Isolation

A common trap in enterprise projects: "logically separated by tenant, physically mixed together."

A more reliable approach at minimum:

  • All retrieval and database access includes tenant_id
  • Vector stores or indexes are logically isolated by tenant
  • Different regions route to different storage and provider endpoints
  • Certain customers can disable training, log retention, or external model calls

If your system serves multiple countries or enterprise customers, decide this at architecture design time. Don't retrofit it after you've signed the contract.


Third-Party Model & Tool Integration

When connecting providers, it's not just about whether the API works. You also need to confirm:

  • Whether data training opt-out is supported
  • Default log retention duration
  • Whether a DPA or enterprise agreement is signed
  • Whether data is processed across regions
  • Whether tool calling and logs can leak raw content

Self-hosted models aren't inherently secure either. You're still responsible for:

  • Image and dependency updates
  • Network egress control
  • Storage encryption
  • Audit and patch cadence

Audit & Access Control

At minimum, log these:

  • Who accessed what data
  • When the call was initiated
  • Which model and which version config was used
  • Whether redaction or security policy was triggered
  • Whether high-risk tools were called

But be aware: audit logs themselves can become sensitive data sources.

So "traceable" doesn't mean "store all raw content verbatim."


Minimum launch checklist

  • Classify data first, then decide if it can go to the model
  • Redact sensitive fields before they enter the model
  • Set TTL on temporary data
  • Isolate retrieval and tool access by tenant
  • Audit logs record actions, don't blindly store raw content
  • Clarify the provider's retention and training policies

Hands-on Exercise

Take one of your AI workflows and audit it:

  1. Map the entire path from user input to model output
  2. Mark where PII is touched
  3. Mark which data doesn't actually need to be uploaded
  4. Write a retention duration for each data category

Once you've done this, your "data governance" actually lives in the system instead of just on a slide.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

什么是 AI data governance 的三条底线?

最小必要、用途限定、环境隔离。最小必要:只把完成 task 必须的 data 传给 model(meeting summary 不需要带整条 CRM record + 身份证号 + 支付信息)。用途限定:今天为 summary 上传的 data 不能自动拿去做 training 或长期 memory。环境隔离:dev / test / production 的 secret、log、storage 严格分开,不要把 prod 对话拷进本地 playground 调试。

Data classification 怎么分才可执行?

5 级够用:Public(官网文案、公开 FAQ — 可直接 retrieval)、Internal(SOP、培训资料 — 控制 scope + 留 audit)、Confidential(合同、报价、报表 — 严格 permission + 限制外发)、Sensitive/PII(手机邮箱证件号住址 — 默认脱敏不进 model)、Secret(API key、密码、私钥 — 禁止进 prompt 和 log)。分类是为了决定每类 data 能否 upload / cache / 记录 / 展示,不是为了 PPT 好看。

哪些 AI 数据不该长期保留?

4 类要短 TTL:临时 upload 的音视频 transcript(24 小时到 7 天)、中间 reasoning 草稿(1-3 天)、debug 阶段的 prompt / response、带敏感 context 的 failure log。可以保留但要限制的是:结构化 task result、已脱敏的 audit log、model metadata、用户授权保存的 conversation summary。

接第三方 LLM provider 时除了 API 通不通还要 confirm 什么?

5 件事:能否关闭 data training、默认 log retention 多久、有没有签 DPA 或企业协议、data 是否跨区处理、tool calling 和 log 里会不会泄露原始内容。Self-hosted 也不天然安全 —— image 与 dependency 升级、network egress、storage 加密、audit 与 patch 节奏全要自己负责。

Tenant 隔离怎么做才不只是『逻辑上分开』?

4 项硬要求:retrieval 和 database access 统一带 tenant_id、vector store / index 按 tenant 做逻辑隔离、不同 region 走不同 storage 与 provider routing、为某些客户提供关闭 training / log retention / 外部 model 调用的开关。这一层最好在 architecture 阶段就决定,签单后补很难。