What are the three baseline principles of AI data governance?

Minimum necessary, purpose limitation, environment isolation. Minimum necessary: send the model only the data the task requires (a meeting summary does not need the full CRM record + national ID + payment details). Purpose limitation: data uploaded today for summarization must not auto-flow into training or long-term memory. Environment isolation: dev / test / production secrets, logs, and storage must be separated — never copy prod conversations into your local playground.

How should I structure data classification so it is enforceable?

Five tiers cover most cases: Public (marketing copy, public FAQ — direct retrieval OK), Internal (SOPs, training material — scope-controlled + audited), Confidential (contracts, pricing, reports — strict permission + no external sharing), Sensitive/PII (phone, email, ID, address — masked by default, never raw to the model), Secret (API keys, DB passwords, private keys — banned from prompts and logs). Classification exists to decide whether each tier can be uploaded, cached, logged, or shown — not to look good on a slide.

Which AI-related data should not be retained long-term?

Four short-TTL categories: ephemeral uploaded audio/video transcripts (24h–7d), intermediate reasoning drafts (1–3d), debug-stage prompts and responses, and failure logs with sensitive context. Retainable but capped: structured task results, de-identified audit logs, model call metadata, and user-authorized conversation summaries.

Beyond "the API works," what must I confirm with a third-party LLM provider?

Five items: can data training be disabled, what is the default log retention, is there a DPA or enterprise agreement, is data processed cross-region, and do tool calls and logs leak raw content. Self-hosted is not automatically safer either — you now own image and dependency updates, network egress control, storage encryption, and audit/patch cadence.

How do I make tenant isolation more than "logically separated"?

Four hard requirements: every retrieval and database access carries a tenant_id, vector stores or indexes are logically partitioned per tenant, different regions route to different storage and providers, and certain customers get a switch to disable training / log retention / external model calls. Decide this at architecture time — retrofitting after the contract is signed is painful.

Data Governance & Privacy

⏱️ 30 min

AI Data Governance & Privacy

When building LLM features at a company, the thing that gets overlooked most isn't model quality -- it's data flow.

Many demos work because they default to "send everything to the model." And many projects stall at launch because nobody can explain where data comes from, where it goes, or how long it stays when compliance review comes knocking.

AI Data Governance Flow

What this chapter solves

You need to be able to answer these questions:

What data is actually being sent with this model call?
Which data must be redacted, and which data shouldn't be uploaded at all?
How long are temporary files, chat logs, and retrieval snippets kept?
Can data from different regions or different customers be stored together?
If something goes wrong, can you trace who accessed what?

If these questions don't have clear answers, your product won't make it into enterprise scenarios no matter how smart it is.

Three bottom-line principles

1. Minimum necessary

Only send the model data that's required to complete the task.

For example, when summarizing meeting notes, you typically don't need the entire CRM record, the user's ID number, or payment details tagging along.

2. Purpose limitation

Data is only used for the currently declared task.

Data uploaded today for document summarization shouldn't automatically become a training sample or long-term memory.

3. Environment isolation

Secrets, logs, and storage across dev, test, and production must be separated.

Don't copy production conversations into a local prompt playground for debugging.

A complete data handling pipeline

Stage	What to do	Common mistake
Ingress	Validate file type, size, source	Accept everything, figure it out later
Preprocess	Redact, classify, chunk	Send raw text straight to the model
Retrieval	Filter data by tenant and permission	Retrieval scope too broad
Generation	Only provide context needed for current task	Stuff entire history into context
Egress	Redact output, content review	Return raw snippets directly to users
Storage	Define TTL, encrypt, audit	Keep temporary files forever

Data classification needs to be actionable, not just a slide

A good-enough classification:

Data level	Examples	Handling guidance
Public	Website copy, public FAQ	Can be used directly in retrieval
Internal	Internal SOPs, training materials	Control access scope, maintain audit trail
Confidential	Contracts, quotes, business reports	Strict permission control, restrict outbound
Sensitive / PII	Phone numbers, emails, ID numbers, addresses	Redact by default, don't send to model directly
Secret	API keys, database passwords, private keys	Never enters prompts or logs

Classification isn't about looking professional -- it's about deciding whether each data type can be uploaded, cached, logged, and displayed.

Redaction & Preprocessing

Common fields that need handling:

Phone numbers, emails, addresses
User IDs, order numbers, account numbers
Banking info, identity documents
Internal project codenames, client lists

Make redaction a standalone preprocessing step rather than scattering it across business logic.

That way you can upgrade rules uniformly and do tenant-level configuration more easily.

Storage & Retention Strategy

What shouldn't be retained long-term

Temporary audio/video transcript uploads
Intermediate reasoning drafts
Debug-phase prompts / responses
Failure logs containing sensitive context

What can be retained, with limits

Structured task results
Redacted audit logs
Model call metadata
User-authorized conversation summaries

A retention policy you can adopt directly:

Data type	Suggested TTL
Temporary files	24 hours to 7 days
Intermediate parse results	1 to 3 days
Audit logs	Per compliance requirements
Business results	Managed by the business system, not model cache

Region & Tenant Isolation

A common trap in enterprise projects: "logically separated by tenant, physically mixed together."

A more reliable approach at minimum:

All retrieval and database access includes tenant_id
Vector stores or indexes are logically isolated by tenant
Different regions route to different storage and provider endpoints
Certain customers can disable training, log retention, or external model calls

If your system serves multiple countries or enterprise customers, decide this at architecture design time. Don't retrofit it after you've signed the contract.

Third-Party Model & Tool Integration

When connecting providers, it's not just about whether the API works. You also need to confirm:

Whether data training opt-out is supported
Default log retention duration
Whether a DPA or enterprise agreement is signed
Whether data is processed across regions
Whether tool calling and logs can leak raw content

Self-hosted models aren't inherently secure either. You're still responsible for:

Image and dependency updates
Network egress control
Storage encryption
Audit and patch cadence

Audit & Access Control

At minimum, log these:

Who accessed what data
When the call was initiated
Which model and which version config was used
Whether redaction or security policy was triggered
Whether high-risk tools were called

But be aware: audit logs themselves can become sensitive data sources.

So "traceable" doesn't mean "store all raw content verbatim."

Minimum launch checklist

Classify data first, then decide if it can go to the model
Redact sensitive fields before they enter the model
Set TTL on temporary data
Isolate retrieval and tool access by tenant
Audit logs record actions, don't blindly store raw content
Clarify the provider's retention and training policies

Hands-on Exercise

Take one of your AI workflows and audit it:

Map the entire path from user input to model output
Mark where PII is touched
Mark which data doesn't actually need to be uploaded
Write a retention duration for each data category

Once you've done this, your "data governance" actually lives in the system instead of just on a slide.

📚 相关资源

OpenAI API Docs

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

什么是 AI data governance 的三条底线？

最小必要、用途限定、环境隔离。最小必要：只把完成 task 必须的 data 传给 model（meeting summary 不需要带整条 CRM record + 身份证号 + 支付信息）。用途限定：今天为 summary 上传的 data 不能自动拿去做 training 或长期 memory。环境隔离：dev / test / production 的 secret、log、storage 严格分开，不要把 prod 对话拷进本地 playground 调试。

Data classification 怎么分才可执行？

5 级够用：Public（官网文案、公开 FAQ — 可直接 retrieval）、Internal（SOP、培训资料 — 控制 scope + 留 audit）、Confidential（合同、报价、报表 — 严格 permission + 限制外发）、Sensitive/PII（手机邮箱证件号住址 — 默认脱敏不进 model）、Secret（API key、密码、私钥 — 禁止进 prompt 和 log）。分类是为了决定每类 data 能否 upload / cache / 记录 / 展示，不是为了 PPT 好看。

哪些 AI 数据不该长期保留？

4 类要短 TTL：临时 upload 的音视频 transcript（24 小时到 7 天）、中间 reasoning 草稿（1-3 天）、debug 阶段的 prompt / response、带敏感 context 的 failure log。可以保留但要限制的是：结构化 task result、已脱敏的 audit log、model metadata、用户授权保存的 conversation summary。

接第三方 LLM provider 时除了 API 通不通还要 confirm 什么？

5 件事：能否关闭 data training、默认 log retention 多久、有没有签 DPA 或企业协议、data 是否跨区处理、tool calling 和 log 里会不会泄露原始内容。Self-hosted 也不天然安全 —— image 与 dependency 升级、network egress、storage 加密、audit 与 patch 节奏全要自己负责。

Tenant 隔离怎么做才不只是『逻辑上分开』？

4 项硬要求：retrieval 和 database access 统一带 tenant_id、vector store / index 按 tenant 做逻辑隔离、不同 region 走不同 storage 与 provider routing、为某些客户提供关闭 training / log retention / 外部 model 调用的开关。这一层最好在 architecture 阶段就决定，签单后补很难。