Model Collection
Model overview and selection guide
TL;DR
- This page is a foundational LLM index: it helps you quickly build a mental map of "which models exist and what category they fall into," covering GPT-5.1 / GPT-4.1 / o1, Claude 4.5, Gemini 3, Llama 3.1, Grok-2, and more.
- Don't pick models based on parameters/leaderboards alone: what matters more is latency, cost, context length, tool support (e.g., Tool Calling), and your own evaluation results.
- In real projects, you'll typically use a "capability model + fast model" combo: a stronger model for complex planning/reasoning, a faster and cheaper model for routine steps and batch processing.
2024-2025 Quick Reference
| Vendor | Capability Tier (multimodal/tools) | Fast/Cheap Tier | Notes |
|---|---|---|---|
| OpenAI | ChatGPT 5.1 (GPT-5 series) / GPT-4.1 | GPT-4o-mini | Great vision/tools; Mini for batch/automation |
| OpenAI (reasoning) | o1 / o1-mini | -- | Stronger long-chain reasoning/planning, higher cost |
| Anthropic | Claude 4.5 Sonnet | Claude 3.5/4.5 Haiku | Strong long doc/table capabilities, high safety |
| Gemini 3 Pro | Gemini 3 Flash / Flash-Lite | 1M tokens context, multimodal/tools; Flash series is fast | |
| Meta (open-source) | Llama 3/3.1 70B | Llama 3.1 8B | Good for private deployment, rich ecosystem |
| Mistral | Mistral Large | Mistral Small | Cost-effective, strong multilingual |
| xAI | Grok-2 | Grok-2 mini | For time-sensitive/web-connected scenarios |
Selection advice: start with "small model to get it working -> big model to improve quality -> eval regression." For multimodal/screenshots, prefer ChatGPT 5.1 / GPT-4o / Gemini 3; for long documents/tables, prefer Claude 4.5 or Gemini 3 Pro.
Last updated: 2025-02
Reading Guide
This list leans toward "foundational + notable" historical context. It doesn't cover every latest model or version. Use it for:
- Backtracking: when a paper/blog mentions a model, quickly locate its origin and era
- Selection: build your candidate set before running evaluation
For engineering selection, answer at least these questions:
- Is the task more like chat, coding, reasoning, or RAG?
- Do you need long context? How long? (context length)
- Do you need Tool Calling? Structured output (JSON schema)?
- Can you run a small evaluation set for regression (10-50 samples is enough to start)?
Data adopted from Papers with Code and Zhao et al. (2023).
Models
| Model | Release Date | Description |
|---|---|---|
| BERT | 2018 | Bidirectional Encoder Representations from Transformers |
| GPT | 2018 | Improving Language Understanding by Generative Pre-Training |
| RoBERTa | 2019 | A Robustly Optimized BERT Pretraining Approach |
| GPT-2 | 2019 | Language Models are Unsupervised Multitask Learners |
| T5 | 2019 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |
| BART | 2019 | Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension |
| ALBERT | 2019 | A Lite BERT for Self-supervised Learning of Language Representations |
| XLNet | 2019 | Generalized Autoregressive Pretraining for Language Understanding and Generation |
| CTRL | 2019 | CTRL: A Conditional Transformer Language Model for Controllable Generation |
| ERNIE | 2019 | ERNIE: Enhanced Representation through Knowledge Integration |
| GShard | 2020 | GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding |
| GPT-3 | 2020 | Language Models are Few-Shot Learners |
| LaMDA | 2021 | LaMDA: Language Models for Dialog Applications |
| PanGu-α | 2021 | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation |
| mT5 | 2021 | mT5: A massively multilingual pre-trained text-to-text transformer |
| CPM-2 | 2021 | CPM-2: Large-scale Cost-effective Pre-trained Language Models |
| T0 | 2021 | Multitask Prompted Training Enables Zero-Shot Task Generalization |
| HyperCLOVA | 2021 | What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers |
| Codex | 2021 | Evaluating Large Language Models Trained on Code |
| ERNIE 3.0 | 2021 | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation |
| Jurassic-1 | 2021 | Jurassic-1: Technical Details and Evaluation |
| FLAN | 2021 | Finetuned Language Models Are Zero-Shot Learners |
| MT-NLG | 2021 | Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model |
| Yuan 1.0 | 2021 | Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning |
| WebGPT | 2021 | WebGPT: Browser-assisted question-answering with human feedback |
| Gopher | 2021 | Scaling Language Models: Methods, Analysis & Insights from Training Gopher |
| ERNIE 3.0 Titan | 2021 | ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation |
| GLaM | 2021 | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
| InstructGPT | 2022 | Training language models to follow instructions with human feedback |
| GPT-NeoX-20B | 2022 | GPT-NeoX-20B: An Open-Source Autoregressive Language Model |
| AlphaCode | 2022 | Competition-Level Code Generation with AlphaCode |
| CodeGen | 2022 | CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis |
| Chinchilla | 2022 | Shows that for a compute budget, the best performances are not achieved by the largest models but by smaller models trained on more data. |
| Tk-Instruct | 2022 | Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks |
| UL2 | 2022 | UL2: Unifying Language Learning Paradigms |
| PaLM | 2022 | PaLM: Scaling Language Modeling with Pathways |
| OPT | 2022 | OPT: Open Pre-trained Transformer Language Models |
| BLOOM | 2022 | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
| GLM-130B | 2022 | GLM-130B: An Open Bilingual Pre-trained Model |
| AlexaTM | 2022 | AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model |
| Flan-T5 | 2022 | Scaling Instruction-Finetuned Language Models |
| Sparrow | 2022 | Improving alignment of dialogue agents via targeted human judgements |
| U-PaLM | 2022 | Transcending Scaling Laws with 0.1% Extra Compute |
| mT0 | 2022 | Crosslingual Generalization through Multitask Finetuning |
| Galactica | 2022 | Galactica: A Large Language Model for Science |
| OPT-IML | 2022 | OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization |
| LLaMA | 2023 | LLaMA: Open and Efficient Foundation Language Models |
| GPT-4 | 2023 | GPT-4 Technical Report |
| PanGu-Σ | 2023 | PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing |
| BloombergGPT | 2023 | BloombergGPT: A Large Language Model for Finance |
| PaLM 2 | 2023 | A Language Model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. |
| Claude 2 | 2023 | Anthropic’s second-gen assistant, improved writing/code quality and safety |
| Llama 2 | 2023 | Open-weight chat models (7B-70B) widely used for private deployment |
| Mixtral 8x7B | 2023 | Sparse Mixture-of-Experts open-source model, great cost-effectiveness |
| Gemini 1.0 | 2023 | Google's multimodal model (Ultra/Pro/Nano), first release of the Gemini series |
| Claude 3 (Opus/Sonnet/Haiku) | 2024 | Next-gen multimodal models, strong long doc/table extraction and safety |
| Gemini 1.5 Pro | 2024 | Up to 1M+ tokens long context, multimodal |
| Gemini 1.5 Flash | 2024 | Cheap and fast multimodal model, good for batch and real-time interaction |
| Mistral Large | 2024 | Multilingual large model, supports function calling and long context |
| Grok-1.5 | 2024 | xAI's long context model, focused on real-time use |
| GPT-4o | 2024 | OpenAI's omni-modal flagship, faster than GPT-4, supports voice/image/video |
| GPT-4o mini | 2024 | Low-cost small model with strong tool capabilities |
| Llama 3 | 2024 | 8B/70B open-source, strong in English and multilingual performance |
| Llama 3.1 | 2024 | 8B/70B/405B, upgraded reasoning and 128k context |
| Claude 3.5 Sonnet | 2024 | Claude 3.5 primary model, strong at code and tool calling |
| Claude 3.5 Haiku | 2024 | Lightweight fast variant, retains high safety and multimodal capabilities |
| o1 | 2024 | OpenAI reasoning model, focused on chain-of-thought and planning |
| o1-mini | 2024 | Cheaper/faster version of o1 |
| GPT-4.1 | 2024 | “All-in-one” model, unifies text/image/voice, upgraded reasoning and tool calling |
| Grok-2 | 2024 | xAI next-gen model, improved code and web-connected QA |
| Gemini 2.0 Flash (Exp) | 2024 | Gemini 2.0 preview for real-time/tool scenarios |
| Gemini 2.0 Pro (Exp) | 2024 | Gemini 2.0 preview flagship, multimodal and long context |
| DeepSeek-R1 | 2025 | Open-source model focused on reasoning efficiency, supports long context and math/code |
| Claude 4.5 Sonnet | 2025 | Anthropic 2025 flagship, long context (200K-1M beta), stronger code and table retrieval |
| ChatGPT 5.1 (GPT-5 series) | 2025 | Latest OpenAI product for Responses API, controllable reasoning depth and multimodal |
| Gemini 3 Pro | 2025 | Google's ultra-long context flagship (1,048,576 tokens), multimodal + tool chain optimization |
| Gemini 3 Flash / Flash-Lite | 2025 | Fast/low-cost multimodal models, good for in-product real-time interaction and batch processing |