Data Governance & Privacy
AI Data Governance & Privacy
When building LLM features at a company, the thing that gets overlooked most isn't model quality -- it's data flow.
Many demos work because they default to "send everything to the model." And many projects stall at launch because nobody can explain where data comes from, where it goes, or how long it stays when compliance review comes knocking.
What this chapter solves
You need to be able to answer these questions:
- What data is actually being sent with this model call?
- Which data must be redacted, and which data shouldn't be uploaded at all?
- How long are temporary files, chat logs, and retrieval snippets kept?
- Can data from different regions or different customers be stored together?
- If something goes wrong, can you trace who accessed what?
If these questions don't have clear answers, your product won't make it into enterprise scenarios no matter how smart it is.
Three bottom-line principles
1. Minimum necessary
Only send the model data that's required to complete the task.
For example, when summarizing meeting notes, you typically don't need the entire CRM record, the user's ID number, or payment details tagging along.
2. Purpose limitation
Data is only used for the currently declared task.
Data uploaded today for document summarization shouldn't automatically become a training sample or long-term memory.
3. Environment isolation
Secrets, logs, and storage across dev, test, and production must be separated.
Don't copy production conversations into a local prompt playground for debugging.
A complete data handling pipeline
| Stage | What to do | Common mistake |
|---|---|---|
| Ingress | Validate file type, size, source | Accept everything, figure it out later |
| Preprocess | Redact, classify, chunk | Send raw text straight to the model |
| Retrieval | Filter data by tenant and permission | Retrieval scope too broad |
| Generation | Only provide context needed for current task | Stuff entire history into context |
| Egress | Redact output, content review | Return raw snippets directly to users |
| Storage | Define TTL, encrypt, audit | Keep temporary files forever |
Data classification needs to be actionable, not just a slide
A good-enough classification:
| Data level | Examples | Handling guidance |
|---|---|---|
| Public | Website copy, public FAQ | Can be used directly in retrieval |
| Internal | Internal SOPs, training materials | Control access scope, maintain audit trail |
| Confidential | Contracts, quotes, business reports | Strict permission control, restrict outbound |
| Sensitive / PII | Phone numbers, emails, ID numbers, addresses | Redact by default, don't send to model directly |
| Secret | API keys, database passwords, private keys | Never enters prompts or logs |
Classification isn't about looking professional -- it's about deciding whether each data type can be uploaded, cached, logged, and displayed.
Redaction & Preprocessing
Common fields that need handling:
- Phone numbers, emails, addresses
- User IDs, order numbers, account numbers
- Banking info, identity documents
- Internal project codenames, client lists
Make redaction a standalone preprocessing step rather than scattering it across business logic.
That way you can upgrade rules uniformly and do tenant-level configuration more easily.
Storage & Retention Strategy
What shouldn't be retained long-term
- Temporary audio/video transcript uploads
- Intermediate reasoning drafts
- Debug-phase prompts / responses
- Failure logs containing sensitive context
What can be retained, with limits
- Structured task results
- Redacted audit logs
- Model call metadata
- User-authorized conversation summaries
A retention policy you can adopt directly:
| Data type | Suggested TTL |
|---|---|
| Temporary files | 24 hours to 7 days |
| Intermediate parse results | 1 to 3 days |
| Audit logs | Per compliance requirements |
| Business results | Managed by the business system, not model cache |
Region & Tenant Isolation
A common trap in enterprise projects: "logically separated by tenant, physically mixed together."
A more reliable approach at minimum:
- All retrieval and database access includes
tenant_id - Vector stores or indexes are logically isolated by tenant
- Different regions route to different storage and provider endpoints
- Certain customers can disable training, log retention, or external model calls
If your system serves multiple countries or enterprise customers, decide this at architecture design time. Don't retrofit it after you've signed the contract.
Third-Party Model & Tool Integration
When connecting providers, it's not just about whether the API works. You also need to confirm:
- Whether data training opt-out is supported
- Default log retention duration
- Whether a DPA or enterprise agreement is signed
- Whether data is processed across regions
- Whether tool calling and logs can leak raw content
Self-hosted models aren't inherently secure either. You're still responsible for:
- Image and dependency updates
- Network egress control
- Storage encryption
- Audit and patch cadence
Audit & Access Control
At minimum, log these:
- Who accessed what data
- When the call was initiated
- Which model and which version config was used
- Whether redaction or security policy was triggered
- Whether high-risk tools were called
But be aware: audit logs themselves can become sensitive data sources.
So "traceable" doesn't mean "store all raw content verbatim."
Minimum launch checklist
- Classify data first, then decide if it can go to the model
- Redact sensitive fields before they enter the model
- Set TTL on temporary data
- Isolate retrieval and tool access by tenant
- Audit logs record actions, don't blindly store raw content
- Clarify the provider's retention and training policies
Hands-on Exercise
Take one of your AI workflows and audit it:
- Map the entire path from user input to model output
- Mark where PII is touched
- Mark which data doesn't actually need to be uploaded
- Write a retention duration for each data category
Once you've done this, your "data governance" actually lives in the system instead of just on a slide.
📚 相关资源
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
什么是 AI data governance 的三条底线?
最小必要、用途限定、环境隔离。最小必要:只把完成 task 必须的 data 传给 model(meeting summary 不需要带整条 CRM record + 身份证号 + 支付信息)。用途限定:今天为 summary 上传的 data 不能自动拿去做 training 或长期 memory。环境隔离:dev / test / production 的 secret、log、storage 严格分开,不要把 prod 对话拷进本地 playground 调试。
Data classification 怎么分才可执行?
5 级够用:Public(官网文案、公开 FAQ — 可直接 retrieval)、Internal(SOP、培训资料 — 控制 scope + 留 audit)、Confidential(合同、报价、报表 — 严格 permission + 限制外发)、Sensitive/PII(手机邮箱证件号住址 — 默认脱敏不进 model)、Secret(API key、密码、私钥 — 禁止进 prompt 和 log)。分类是为了决定每类 data 能否 upload / cache / 记录 / 展示,不是为了 PPT 好看。
哪些 AI 数据不该长期保留?
4 类要短 TTL:临时 upload 的音视频 transcript(24 小时到 7 天)、中间 reasoning 草稿(1-3 天)、debug 阶段的 prompt / response、带敏感 context 的 failure log。可以保留但要限制的是:结构化 task result、已脱敏的 audit log、model metadata、用户授权保存的 conversation summary。
接第三方 LLM provider 时除了 API 通不通还要 confirm 什么?
5 件事:能否关闭 data training、默认 log retention 多久、有没有签 DPA 或企业协议、data 是否跨区处理、tool calling 和 log 里会不会泄露原始内容。Self-hosted 也不天然安全 —— image 与 dependency 升级、network egress、storage 加密、audit 与 patch 节奏全要自己负责。
Tenant 隔离怎么做才不只是『逻辑上分开』?
4 项硬要求:retrieval 和 database access 统一带 tenant_id、vector store / index 按 tenant 做逻辑隔离、不同 region 走不同 storage 与 provider routing、为某些客户提供关闭 training / log retention / 外部 model 调用的开关。这一层最好在 architecture 阶段就决定,签单后补很难。