How do I pick a model — should I trust public benchmarks?

Public leaderboards are misleading. Answer five questions first: what is the task, how expensive is one wrong answer, can users wait 2-5 seconds, do you need tool calls / long context / JSON, and is the budget demo or production? Then run 20-50 real samples in A/B and score on task completion rate, latency, cost and structure stability. A benchmark winner is not automatically your task winner.

When should I use a large model versus a small one?

Reach for a large model when the spec is vague, the task spans multiple documents, code edits touch architecture, or one turn must plan+generate+fix at once. Use small models for classification, extraction, label mapping, FAQ rewriting and batch format conversion. Rule of thumb: high-value low-frequency goes large, high-frequency standardised goes small, and a layered router stitches them — far cheaper than running GPT-5 on every call.

How do I design a production model tier?

Three tiers: Fast Layer for first-response, classification and routing (small or low-cost models); Work Layer for main task, writing, code and summarisation (mid- to high-tier general models); Verify Layer for structure validation, sensitive-content checks and second-pass review (a dedicated reviewer model or a rule engine). Wire it up with multi-model fallback so rate limits, timeouts or quality drops on the primary auto-switch to a backup.

Does a cheaper per-token model actually save money?

Not always. True cost = unit price × system-prompt length × retry count × tool-call count × fallback cost. Plenty of teams lower the per-token price only to blow the bill on bloated prompts, retries and long tool chains. Track cost per request and cost per day, not cost per million tokens.

Which dimensions should a model selection scorecard cover, and how do I score them?

Run a six-dimension scorecard: task completion rate 30%, latency 20%, cost 15%, structure stability 15%, safety (over-reach / hallucination / leakage) 10%, integration effort (SDK / logging / monitoring / retry) 10%. Sample set: 20 high-frequency real tasks + 10 edge cases + 10 high-risk cases. Re-evaluate every quarter — model capability and pricing move that fast.

AI Model Comparison Reference (2026)

Choose the right model for your task. Prices are per million tokens.

Provider	Model	Context	Speed	Price	Strengths
OpenAI	gpt-5.2	5M	Super Fast	$10/$30	Top reasoningFull multimodal understanding
OpenAI	gpt-5.1	2M	Very Fast	$5/$15	Ultra performanceLow latency
OpenAI	gpt-5-mini	128K	Lightning	$0.10/$0.40	Best valueMillisecond response
OpenAI	gpt-5-nano	32K	Instant	$0.01/$0.03	CompactLow energy
OpenAI	gpt-5	2M	Super Fast	$5.00/$20.00	AGI-level reasoningFull multimodal interaction
OpenAI	gpt-4o	128K	Fast	$2.50/$10	Multimodal (text+image)Strong reasoning
OpenAI	gpt-4o-mini	128K	Super Fast	$0.15/$0.60	Cost-effectiveFast
OpenAI	o1	200K	Slower	$15/$60	Super reasoningMath/code expert
OpenAI	o3-mini	200K	Medium	$1.10/$4.40	Strong reasoningModerate cost
Google	gemini-3-pro	2M	Very Fast	$1.00/$4.00	Native multimodalComplex reasoning
Google	gemini-3-flash	1M	Lightning	$0.10/$0.40	Ultra-low latencyHigh throughput
Google	gemini-2.5-pro	1M+	Medium	$1.25/$5	Huge contextMultimodal
Google	gemini-2.5-flash	1M+	Super Fast	$0.15/$0.60	Ultra fastHuge context
Google	gemini-2.0-flash	1M	Super Fast	Free/Pay-as-you-go	Ultra fastNative tool calling
Google	gemini-1.5-flash	1M	Super Fast	$0.075/$0.30	FastLow cost
Anthropic	claude-sonnet-4.5-20250929	500K	Fast	$8.00/$32.00	Code architect levelHighly human-like
Anthropic	claude-haiku-4.5-20251015	200K	Very Fast	$0.25/$1.25	Fast responseLow cost
Anthropic	claude-opus-4.5-20251124	200K	Slower	$15/$75	Strongest overallExtended thinking
Anthropic	claude-sonnet-4	200K	Fast	$3/$15	Coding expertBalanced performance
Anthropic	claude-3-5-sonnet	200K	Fast	$3/$15	Strong at codeGood value
xAI	grok-3	128K	Medium	$3/$15	Strong reasoningReal-time info
xAI	grok-3-fast	128K	Super Fast	$5/$25	Speed-firstLow latency
xAI	grok-2-vision	32K	Medium	$2/$10	Vision capableImage understanding
Meta	llama-4-405b (Maverick)	128K	Medium	Open/Hosted	Strongest open-sourceNative multimodal
Meta	llama-4-70b (Scout)	128K	Fast	Open/Hosted	Balanced performanceHigh throughput
Meta	llama-4-8b	32K	Very Fast	Open/Hosted	On-deviceUltra-low latency

Code Generation

claude-sonnet-4.5 / llama-4-405b

High code quality and comprehension

Smart Import/Parsing

gemini-3-flash / llama-4-70b

Large context, fast, cheap - ideal for batch parsing

Creative Writing

claude-opus-4.5 / gpt-5.2

Strong creativity and writing quality

Complex Reasoning/Math

gpt-5.2 / llama-4-405b

Deep reasoning and strong logic

Image Understanding

gpt-5.2 / gemini-3-pro

Multimodal capability and visual comprehension

Long Document Processing

gemini-3-pro / claude-sonnet-4.5

Ultra-large context windows

Real-time Interaction/Agent

gemini-3-flash / llama-4-8b

Low latency and tool-calling ability

Cost-Sensitive/Private

llama-4-8b / gpt-5-nano

Low price, supports local deployment

Quality First

Core business logic, high-risk decisions, top research. Pursuing ultimate accuracy and reasoning depth.

GPT-5.2Claude Opus 4.5Llama 4-405B

Latency First

Real-time chat, code completion, search augmentation. Pursuing millisecond-level TTFT.

GPT-5.1Gemini 3 FlashLlama 4-8B

Cost First

Large-scale data cleaning, intent classification, simple translation. Pursuing max throughput at lowest token price.

Claude Haiku 4.5Llama 4-8B (Local)Gemini 3 Flash

Engineering Capabilities Matrix

Feature	OpenAI	Anthropic	Google	xAI	Meta
Context Caching	✅ Ephemeral (1h)	✅ 5min TTL	✅ Long TTL	❌	✅ Self-hosted
Structured Output (JSON)	✅ Strict Mode	⚠️ Tool Use	✅ JSON Mode	⚠️ Partial	✅ JSON Mode
Batch API	✅ 50% Off	✅ 50% Off	✅ Standard	❌	N/A (Open)
Vision/Audio (Multimodal)	✅ Image/Audio	⚠️ Image Only	✅ Native A/V	✅ Image	✅ Image/Video
Fine-tuning	✅ Robust	⚠️ Limited	✅ LoRA	❌	✅ Full Finetune

Real-world Cost Estimates

📚 Book Summary

Input 200K words (300K tokens), output 5000-word summary

Gemini 3 Flash$0.032

GPT-4o-mini$0.048

Claude Sonnet 4.5$2.56

GPT-5.2$3.15

💻 Repo Analysis

Input entire project code (1M tokens), output architecture suggestions

Gemini 3 Pro$1.04

Claude Opus 4.5$15.75

GPT-5.2$10.30

Llama 4 (Local)$0.00*

💬 Chatbot

1000 conversations/day (avg 1K in / 200 out per turn)

GPT-5-nano$0.016/day

Gemini 3 Flash$0.18/day

GPT-4o-mini$0.27/day

GPT-5.1$8.00/day

Performance & Compliance

Latency Profile (TTFT)

Time to first token: key metric for real-time conversation experience

⚡ < 200msLlama 4 (Groq), Gemini 3 Flash

🚀 ~500msGPT-4o, Claude Sonnet 4.5

🐢 > 1so1, GPT-5.2

Throughput (Generation Speed)

Throughput: affects long document and code generation experience

🌊 > 150 t/sGemini 3 Flash, Llama 4-8B

🚄 ~80 t/sGPT-4o, Claude Haiku

🚗 ~30 t/sClaude Opus, GPT-5.2

Enterprise Compliance

Data privacy, private deployment, and compliance

🔒 Zero RetentionOpenAI Enterprise, Anthropic

☁️ Private VPCAzure OpenAI, AWS Bedrock

🏢 Self-hostedLlama 4 (Local/On-prem)

Prompting Strategy Guide

Anthropic Claude

XML Structured (XML Tags)

Claude strongly prefers XML tags to isolate context. Use <data>, <rules> tags to wrap content for significantly better results.

<context>...</context>
<instruction>...</instruction>

OpenAI GPT

System Persona (Role Setting)

Define a strong persona in the System Prompt. For complex tasks, explicitly request "Let's think step by step".

System: You are a senior engineer...
User: Refactor this.

Google Gemini

Few-Shot & Long Context

Leverage ultra-long context to provide many examples (10+). Gemini excels at learning patterns from long documents or multimodal inputs.

User: Here are 20 SQL examples. Write query #21...

Meta Llama

Direct Instruction

Keep instructions clear and concise. For Llama 3/4, explicitly forbid verbosity (e.g., "No yapping", "JSON only").

User: Extract names. JSON format. No intro/outro.

xAI Grok

Real-time & Witty (Style)

Leverage its real-time access to X (Twitter) data. For serious tasks, set "Be professional, no jokes" in the System Prompt.

System: You are a serious data analyst.
User: Summarize latest tweets about AAPL.

Pricing Note:Price format is $input / $output (per 1M tokens). Actual prices may vary by region and account type - check official sources. Gemini models may have free tiers at low usage. Larger context models are better for long documents but may cost more.

AI Model Comparison

⏱️ 20 min

AI Model Selection & Comparison

Picking a model is one of those things where teams get distracted by leaderboards right away. In real projects, what actually determines the experience isn't "who's strongest" -- it's whether your specific task needs better reasoning, speed, stability, or lower cost.

AI Model Selection Map

Start with a decision sequence

Don't jump straight to "which model is best." A more useful order:

What type of task is this?
How bad is it if the model gets it wrong once?
Can users tolerate a 2-5 second wait?
Do you need tool calling, long context, or JSON output?
Is this a demo budget or a production budget?

If you haven't thought through these five questions, model comparisons tend to stay at "everyone online says it's good."

Different tasks have different model requirements

Task type	What matters most	Typical mistake	Selection advice
Chat / QA	Response speed, natural tone	Too slow, too verbose	Start with a mid-tier model
Code generation	Instruction following, long context, tool calling	Breaking existing code, missing edge cases	Prioritize engineering stability over benchmarks
Document summary	Long context, structured output	Missing key points, hallucinated conclusions	Pair with chunking and output templates
Agent workflow	Tool calling, recoverability	Infinite loops, wrong tool calls	Limit tool scope first, then worry about model strength
Review / classification	Consistency, low cost	Classification drift, unstable explanations	Small model + clear label set is usually cheaper
High-risk scenario	Stability, traceability, refusal boundaries	Hallucination, unauthorized actions, false promises	Multi-model verification or human fallback

The 6 dimensions that actually matter for selection

1. Task Completion Rate

It's not about whether the model "sounds smart" -- it's about whether it finishes your task.

For example:

A customer service bot: did it hit the knowledge base and give a correct answer?
A code assistant: does the patch actually run?
Form extraction: are the JSON fields stable?

Without task completion rate, a lot of "the model feels great" feedback really just means the language sounds more human.

2. Latency

Users generally won't forgive you being 8 seconds slow just because the answer is smarter.

Especially in these scenarios, latency directly determines whether the product is usable:

Search box real-time Q&A
IDE completions
Form filling assistance
Sales and customer service chat

One rule of thumb: get the first response out fast, then put complex reasoning into a two-stage workflow.

3. Cost

Cost isn't just token price. It also includes:

System prompt length
Context concatenation strategy
Retry count
Tool call count
Fallback cost after failures

Many teams get the per-token price down but still get ugly bills at month's end because prompts are too long and request counts are too high.

4. Instruction Following

When you need the model to output a fixed structure, stay within boundaries, or only answer based on provided materials -- this dimension matters way more than "good writing."

Especially for:

JSON-only output
No fabricated sources
No unauthorized tool calls
No responses about unauthorized data

5. Context Capability

Long context doesn't mean "bigger window = magic."

What really matters is whether the model can still:

Find the actually relevant chunk in a long context
Not ignore constraints in the second half
Not treat user-uploaded content as system instructions

A huge window with unstable retrieval and citation will still cause engineering problems.

6. Ecosystem & Engineering Integration

A strong model doesn't mean it's easy to integrate.

In practice, you also need to check:

Is the SDK mature?
Is JSON / tool calling stable?
Is the streaming experience good?
Are rate limiting, retries, and logging solid?
Does it support your region and compliance requirements?

A more production-realistic model layering

Layer	Primary role	What model fits
Fast Layer	First response, classification, routing	small model or low-cost model
Work Layer	Main task execution: writing, code, summary	mid-to-high-tier general model
Verify Layer	Structure validation, content review, double-check	dedicated review model or rule engine

The benefit: you don't need "the most expensive model doing everything."

When to use a large model vs. when not to

Better suited for large models

Requirements are vague and need strong reasoning and completion
Task spans multiple documents with complex context
Code changes require architectural understanding
You need planning, generation, fixing, and explanation in one conversation

Better suited for small/mid models

Classification, extraction, label mapping
FAQ rewriting
Standard format conversion
Large-scale batch processing
Workflows where users accept "upgrade to a stronger model when needed"

One-liner: high-value, low-frequency tasks deserve a strong model. High-frequency, standardized tasks deserve optimized unit cost.

A selection scorecard you can use right now

Metric	Weight	What to record
Task completion rate	30%	Did it correctly finish the core task?
Latency	20%	Time to first token, full response time
Cost	15%	Per-request cost, daily average cost
Structure stability	15%	Is JSON stable? Are fields missing?
Security	10%	Prone to overreach, hallucination, leaks?
Integration effort	10%	SDK, logging, monitoring, retry ease

Don't just do one subjective comparison. Prepare at least 20-50 representative samples and run a small eval.

A simple but effective A/B test method

Prepare sample set
  -> 20 high-frequency real tasks
  -> 10 edge cases
  -> 10 high-risk tasks

Same input for each model
  -> Same system prompt
  -> Same retrieval results
  -> Same output format requirements

Record results
  -> success / failure
  -> failure reason
  -> response time
  -> average cost

Review
  -> Which tasks require upgrading the model?
  -> Which tasks can downgrade to save cost?

Production environment recommendations

Multi-model fallback: When the primary model hits rate limits, timeouts, or quality drops, auto-switch to a backup.
Hybrid strategy: Intent recognition, classification, and preprocessing go through a lightweight model; complex generation and code changes go to a stronger model.
Regular re-evaluation: Model capabilities and pricing change fast. Review quarterly.
Log your routing decisions: Record why a given task went to a given model so you can optimize routing later.

Common mistakes

Mistake	Actual problem	Fix
Only looking at public leaderboards	Benchmark tasks aren't your real tasks	Build your own small eval set
Only looking at model price	Ignoring retry, long prompt, context cost	Look at total request cost
One prompt for all models	Different models have different format sensitivities	Do provider-aware adjustments
Defaulting to the strongest model	Could be slow, expensive, overengineered	Try layered routing first
Only testing success cases	Edge cases only surface after launch	Add dirty data, long text, abnormal input

Hands-on Exercise

Pick one of your real tasks, like "turn customer service conversations into ticket summaries."
Write the same input and let two models run it.
Score on four dimensions: accuracy, speed, cost, format stability.
Then decide: single model, dual model, or layered routing.

Summary

Model selection isn't a ranking game -- it's an engineering decision.

If you remember just one thing: look at the task first, then experience and cost, and only then at model reputation.

📚 相关资源

❓ 常见问题

关于本章主题最常被搜索的问题，点击展开答案

怎么选模型？看 benchmark 排行榜准吗？

排行榜不准。先回答 5 个问题：任务类型？错一次代价高吗？用户能等 2-5 秒吗？要不要 tool call / 长 context / JSON？预算是 demo 还是 production？再准备 20-50 条真实样本跑 A/B，按任务完成率、延迟、成本、结构稳定性四项打分，benchmark 强不等于你的任务强。

什么情况下用强模型，什么情况下用小模型？

强模型给：需求模糊、跨多文档、代码改架构、一次对话兼顾计划+生成+修复。小模型给：classification、信息抽取、label mapping、FAQ 改写、批量格式转换。一句话——高价值低频用强模型，高频标准化用小模型，再用 layered routing 把两者拼起来，比硬上 GPT-5 全场便宜很多。

production 模型分层应该怎么设计？

三层：Fast Layer（首响 / classification / routing，用 small 或 low-cost model）→ Work Layer（主任务、写作、代码、总结，用中高档通用 model）→ Verify Layer（结构校验、敏感内容审查、二次复核，用专门审查 model 或规则引擎）。配合 multi-model fallback：主模型限流/超时/质量降级时自动切备选。

model 价格便宜就一定省钱吗？

不一定。真实成本 = token 单价 × system prompt 长度 × 重试次数 × tool 调用次数 × fallback 成本。很多团队压低单价，但 prompt 写长、retry 多、tool 链长，月底账单照样炸。算 cost per request、cost per day，不是 cost per 1M token。

评估模型选型该看哪几个维度？怎么打分？

用 6 维 scorecard：任务完成率 30%、延迟 20%、成本 15%、结构稳定性 15%、安全（越权/幻觉/泄漏）10%、集成成本（SDK/日志/监控/retry）10%。样本集 20 条高频真实任务 + 10 条 edge case + 10 条 high-risk。每季度 re-evaluate 一次，模型能力和价格变化太快。