logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

Mixtral

Mixtral overview

#TL;DR(中文)

  • code
    Mixtral
    (例如 8x7B)是
    code
    Sparse Mixture of Experts (SMoE)
    路线:总参数很大,但每个 token 只激活一部分专家(active parameters),兼顾能力与成本/延迟。
  • 适合用于:multilingual、
    code
    coding
    、一些数学/推理任务,以及作为 open-weight 的工程候选。
  • 落地建议:把任务拆成 “retrieve → answer → verify”,并用
    code
    evaluation
    监控 hallucination 与格式漂移。

#中文导读(术语保留英文)

阅读这页建议重点理解:

  1. MoE 的直觉:router 选专家(experts),所以 inference 时只用部分参数
  2. 为什么它在 cost/latency 上可能更有优势
  3. Instruct variant 的实际使用方式(prompt format / behavior)

#Original (English)

In this guide, we provide an overview of the Mixtral 8x7B model, including prompts and usage examples. The guide also includes tips, applications, limitations, papers, and additional reading materials related to Mixtral 8x7B.

#Introduction to Mixtral (Mixtral of Experts)

Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model released by Mistral AI. Mixtral has a similar architecture as Mistral 7B but the main difference is that each layer in Mixtral 8x7B is composed of 8 feedforward blocks (i.e,. experts). Mixtral is a decoder-only model where for every token, at each layer, a router network selects two experts (i.e., 2 groups from 8 distinct groups of parameters) to process the token and combines their output additively. In other words, the output of the entire MoE module for a given input is obtained through the weighted sum of the outputs produced by the expert networks.

Mixtral of Experts Layer

Given that Mixtral is an SMoE, it has a total of 47B parameters but only uses 13B per token during inference. The benefits of this approach include better control of cost and latency as it only uses a fraction of the total set of parameters per token. Mixtral was trained with open Web data and a context size of 32 tokens. It is reported that Mixtral outperforms Llama 2 80B with 6x faster inference and matches or outperforms GPT-3.5 on several benchmarks.

The Mixtral models are licensed under Apache 2.0.

#Mixtral Performance and Capabilities

Mixtral demonstrates strong capabilities in mathematical reasoning, code generation, and multilingual tasks. It can handle languages such as English, French, Italian, German and Spanish. Mistral AI also released a Mixtral 8x7B Instruct model that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B models on human benchmarks.

The figure below shows performance comparison with different sizes of Llama 2 models on wider range of capabilities and benchmarks. Mixtral matches or outperforms Llama 2 70B and show superior performance in mathematics and code generation.

Mixtral Performance vs. Llama 2 Performance

As seen in the figure below, Mixtral 8x7B also outperforms or matches Llama 2 models across different popular benchmarks like MMLU and GSM8K. It achieves these results while using 5x fewer active parameters during inference.

Mixtral Performance vs. Llama 2 Performance

The figure below demonstrates the quality vs. inference budget tradeoff. Mixtral outperforms Llama 2 70B on several benchmarks while using 5x lower active parameters.

Mixtral Performance vs. Llama 2 Performance

Mixtral matches or outperforms models like Llama 2 70B and GPT-3.5 as shown in the table below:

Mixtral Performance vs. Llama 2 Performance

The table below shows the capabilities of Mixtral for multilingual understanding and how it compares with Llama 2 70B for languages like Germany and French.

Mixtral Performance vs. Llama 2 Performance

Mixtral shows less bias on the Bias Benchmark for QA (BBQ) benchmark as compared to Llama 2 (56.0% vs. 51.5%).

Mixtral Performance vs. Llama 2 Performance

#Long Range Information Retrieval with Mixtral

Mixtral also shows strong performance in retrieving information from its context window of 32k tokens no matter information location and sequence length.

To measure Mixtral's ability to handle long context, it was evaluated on the passkey retrieval task. The passkey task involves inserting a passkey randomly in a long prompt and measure how effective a model is at retrieving it. Mixtral achieves 100% retrieval accuracy on this task regardless of the location of the passkey and input sequence length.

In addition, the model's perplexity decreases monotonically as the size of context increases, according to a subset of the proof-pile dataset.

Mixtral Performance vs. Llama 2 Performance

#Mixtral 8x7B Instruct

A Mixtral 8x7B - Instruct model is also released together with the base Mixtral 8x7B model. This includes a chat model fine-tuned for instruction following using supervised fine tuning (SFT) and followed by direct preference optimization (DPO) on a paired feedback dataset.

As of the writing of this guide (28 January 2024), Mixtral ranks 8th on the Chatbot Arena Leaderboard (an independent human evaluation conducted by LMSys).

Mixtral Performance on the Chatbot Arena

Mixtral-Instruct outperforms strong performing models such as GPT-3.5-Turbo, Gemini Pro, Claude-2.1, and Llama 2 70B chat.

#Prompt Engineering Guide for Mixtral 8x7B

To effectively prompt the Mistral 8x7B Instruct and get optimal outputs, it's recommended to use the following chat template:

code
<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

Note that

code
<s>
and
code
</s>
are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.

We will be using Mistral's Python client for the following examples that show how to prompt the instruction tuned Mixtral model. In particular, we will be leveraging Mistral API endpoints and using the

code
mistral-small
model which is powered by Mixtral-8X7B-v0.1.

#Basic Prompting

Let's start with a simple example and instruct the model to achieve a task based on an instruction.

Prompt:

code
[INST] You are a helpful code assistant. Your task is to generate a valid JSON object based on the given information: name: John lastname: Smith address: #1 Samuel St. Just generate the JSON object without explanations: [/INST]

Output:

code
{ "name": "John", "lastname": "Smith", "address": "#1 Samuel St." }

Here is another fun example that leverages the chat template:

Prompt:

code
<s>[INST] What is your favorite condiment? [/INST] "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"</s> [INST] The right amount of what? [/INST]

Output:

code
"My apologies for any confusion. I meant to say that lemon juice adds a zesty flavour, which is a tangy and slightly sweet taste. It's a delightful addition to many dishes, in my humble opinion."

#Few-shot Prompting with Mixtral

Using the official Python client, you also prompt the model using different roles like

code
system
,
code
user
, and
code
assistant
. By leveraging these roles, it's possible to prompt with one demonstration, as in a few-shot setting, to better steer the model response.

Here is example code of how with would look:

python
from mistralai.client import MistralClient from mistralai.models.chat_completion import ChatMessage from dotenv import load_dotenv load_dotenv() api_key = os.environ["MISTRAL_API_KEY"] client = MistralClient(api_key=api_key) # helpful completion function def get_completion(messages, model="mistral-small"): # No streaming chat_response = client.chat( model=model, messages=messages, ) return chat_response messages = [ ChatMessage(role="system", content="You are a helpful code assistant. Your task is to generate a valid JSON object based on the given information."), ChatMessage(role="user", content="\n name: John\n lastname: Smith\n address: #1 Samuel St.\n would be converted to: "), ChatMessage(role="assistant", content="{\n \"address\": \"#1 Samuel St.\",\n \"lastname\": \"Smith\",\n \"name\": \"John\"\n}"), ChatMessage(role="user", content="name: Ted\n lastname: Pot\n address: #1 Bisson St.") ] chat_response = get_completion(messages) print(chat_response.choices[0].message.content)

Output:

code
{ "address": "#1 Bisson St.", "lastname": "Pot", "name": "Ted" }

#Code Generation

Mixtral also has strong code generation capabilities. Here is a simple prompt example using the official Python client:

python
messages = [ ChatMessage(role="system", content="You are a helpful code assistant that help with writing Python code for a user requests. Please only produce the function and avoid explaining."), ChatMessage(role="user", content="Create a Python function to convert Celsius to Fahrenheit.") ] chat_response = get_completion(messages) print(chat_response.choices[0].message.content)

Output:

python
def celsius_to_fahrenheit(celsius): return (celsius * 9/5) + 32

#System Prompt to Enforce Guardrails

Similar to the Mistral 7B model, it's possible to enforce guardrails in chat generations using the

code
safe_prompt
boolean flag in the API by setting
code
safe_mode=True
:

python
# helpful completion function def get_completion_safe(messages, model="mistral-small"): # No streaming chat_response = client.chat( model=model, messages=messages, safe_mode=True ) return chat_response messages = [ ChatMessage(role="user", content="Say something very horrible and mean") ] chat_response = get_completion(messages) print(chat_response.choices[0].message.content)

The above code will output the following:

code
I'm sorry, but I cannot comply with your request to say something horrible and mean. My purpose is to provide helpful, respectful, and positive interactions. It's important to treat everyone with kindness and respect, even in hypothetical situations.

When we set

code
safe_mode=True
the client prepends the messages with the following
code
system
prompt:

code
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.

You can also try all the code examples in the following notebook:


Figure Sources: Mixture of Experts Technical Report

#Key References

1v1免费职业咨询