07

AI Product Metrics: Measurement & Optimization

⏱️ 50 min

AI Product Metrics System: Measurement and Optimization

The most dangerous state for an AI product is when the team is busy every day but nobody can accurately answer "is this feature actually getting better." Usage alone isn't enough. Thumbs-up alone isn't enough. Model benchmarks definitely aren't enough. AI PMs need to manage a complete metrics chain from value to quality to cost.

So this page isn't just about dashboards. It's about building a metrics system that actually drives decisions.

AI Product Metrics Board


Bottom Line: AI Metrics Can't Just Track Growth -- Track the Price Too

Traditional products make a common mistake: if engagement is up, the product must be better. AI products need follow-up questions:

  1. Did users actually complete the task
  2. What was the completion quality
  3. How much did this result cost

Miss any one of these and your metrics will mislead you.


AI Metrics Look More Like a Three-Layer Structure

LayerWhat to track
BusinessRevenue, conversion, retention, ROI
Product / QualityTask success, satisfaction, accuracy, regenerate rate
Efficiency / CostLatency, token usage, cost per task, margin

If your dashboard only has the first and third layers without quality in the middle, you won't know why users are churning. If you only have quality without cost, you won't know why the business math doesn't work.


North Star Metric: Don't Make It Too Vague

Many AI products use "AI usage count" as their North Star. That metric is usually too shallow.

A more reasonable approach:

North Star = successful task completion x quality factor

Examples:

Product typeMore credible North Star
AI writing toolWeekly adopted output
AI support copilotResolved tickets assisted by AI
AI searchSuccessful answer sessions
AI coding assistantAccepted AI-generated code changes

The key is "the user actually used the result," not "AI just said something."


First Define What Success Means

AI PMs very easily skip this step and jump straight to event tracking.

But you must first define:

ScenarioWhat counts as success
AI summaryUser doesn't need to heavily rewrite to keep using it
AI draftingOutput is adopted, not just generated
AI supportProblem gets solved, not just more chat turns
AI searchUser gets a trustworthy answer and stops searching

Without a success definition, all downstream metrics are half-empty.


A Sufficient Core Metrics Set

1. Value metrics

MetricWhat it tells you
Task success rateDid the task get done
Adoption rateDo users keep using it
Assisted conversionDid AI actually drive business results

2. Quality metrics

MetricWhat it tells you
Satisfaction / thumbs upUser's subjective feeling
Regenerate rateIndirect signal of first-answer dissatisfaction
Hallucination rateIs high-risk content making things up
Edit distance / acceptance rateHow much generated content actually got changed

3. Efficiency metrics

MetricWhat it tells you
Avg latencyWhether users are willing to wait
Tokens per requestWhether prompts are growing out of control
Cost per successful taskThis is the real business metric
Model routing ratioWhether small/large model split is reasonable

Most Commonly Misread Metrics

MetricWhy it misleads
Session lengthLonger isn't necessarily better -- model might not be solving the problem
Total promptsMore doesn't mean value -- users might be retrying
Thumbs up ratePeople who don't give feedback aren't necessarily satisfied
Avg cost per requestWithout combining with success rate, insufficient information

AI PMs need to build a habit: any single metric needs a counter-metric alongside it.


How to Build a More Practical Dashboard

A more credible metrics board has at least 4 sections:

SectionCore question
Acquisition / activationAre users actually entering the AI scenario
Task qualityFirst-answer, re-answer, adoption quality
Cost / performanceResponse speed, cost level
Risk / trustBad answers, safety issues, complaints

This is way more useful than staring at a single "DAU curve."


Quality Metrics Must Mix Automated and Manual

Many AI products don't have good automated evaluation early on, so human review can't be skipped.

A more stable approach:

online metrics
  + sampled human review
  + labeled bad cases
  + weekly trend review

Automated metrics tell you "where problems might exist." Human review tells you "what the problem actually is."


Cost Metrics Must Connect to Business

Just reporting monthly API bills has little management value. You should be looking at:

MetricMore useful question
Cost per requestHow much does each call cost
Cost per successful taskHow much to get one thing done
AI gross marginIs there room left after AI costs
Wasted generation ratioHow much generated content never gets used

If you see usage growing but cost per successful task growing with it, that's not necessarily good news.


A Simple But Sufficient Weekly Review

Each week, the AI PM should answer at least these 5 questions:

  1. Which use case had a success rate change
  2. Which type of bad answers increased
  3. Why are users regenerating
  4. Which model route is burning the most money
  5. Which metric change is worth putting on next week's roadmap

Lock in these 5 questions and team data discussions get much clearer.


Practice

Take an AI feature you're working on. Look at your current dashboard, then fill in 3 questions:

  1. Is there a clear success definition right now
  2. Is there a cost per successful task
  3. Is there a stable human review sampling mechanism

If all 3 are missing, this metrics system is basically still in the "spectator" stage.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题,点击展开答案

AI 产品指标体系的三层结构是什么?

Business(revenue、conversion、retention、ROI)、Product/Quality(task success、satisfaction、accuracy、regenerate rate)、Efficiency/Cost(latency、token usage、cost per task、margin)。只有第一层和第三层、缺中间 quality 层,就不知道用户为什么流失;只有质量没有成本,业务永远算不过来。

把「AI 使用次数」当 North Star 为什么不靠谱?

太浅——它只衡量「AI 说过话」,不衡量「用户用了结果」。更合理是 successful task completion × quality factor:AI writing 看 weekly adopted output、support copilot 看 AI-assisted resolved tickets、AI search 看 successful answer sessions、coding assistant 看 accepted AI-generated code changes。

哪些 AI 指标看起来正常其实在误导你?

4 个常见陷阱:session length(长不一定好,可能是模型没解决问题)、total prompts(多不一定有价值,可能是用户在反复重试)、thumbs up rate(没反馈的人不代表满意)、avg cost per request(没结合 success 看信息不够)。任何单一指标都要找一个反向指标一起看。

AI 产品的 success definition 该怎么定?

按场景具体写:AI summary——用户不用大改就能继续用;AI drafting——输出被采纳而不是只生成;AI support——问题被解决而不是聊天轮数变长;AI search——用户拿到可信答案并停止继续搜。没有 success definition,后面所有指标都是半空的。

AI 成本指标怎么报才对业务有意义?

光报 monthly API bill 没管理价值。盯 4 个:cost per request(每次调用花多少)、cost per successful task(做成一件事的钱)、AI gross margin(扣掉 AI cost 还剩多少)、wasted generation ratio(多少生成根本没被用上)。如果 usage 在涨但 cost per successful task 也在涨,那不一定是好消息。