Extract Entities
Extract structured information from text
TL;DR
Information Extractionworks best withstructured outputconstraints: here we use a JSON array["model_name"]as the minimum acceptable output.- Key risk: the model over-extracts (treating method names/task names/org names as model names) or under-extracts.
- Iteration direction: add extraction rules (what counts as a model name), add "return ["NA"]" trigger conditions, and run small-scale
evaluationregression sets.
Background
The following prompt tests an LLM's capabilities to perform an information extraction task: extracting model names from machine learning paper abstracts.
How to Apply
Migrating this template to your business follows the same logic:
- Input: a chunk of unstructured text (abstract, customer service chat, meeting notes, resume, contract)
- Output: a structured field collection (array / JSON object)
To make extraction more stable:
- Constrain output to strict JSON (output only the array, no explanations)
- Specify "if unsure, return ["NA"]"
- Prepare a small test set (10-50 items) for
evaluation
How to Iterate
- Add rules: what counts as a model name (e.g., contains version number, size, family name), what doesn't (dataset/task/company)
- Add negative examples: give 1-2 examples of "don't extract these words"
- Do post-processing: deduplicate output, normalize casing, filter common false positives
- Add "evidence" mode: besides the array, also return the original text span for each extraction (but watch format consistency)
Self-check Rubric
- Is the output valid JSON array? Any extra text?
- Did it extract non-model-names (false positives)?
- Did it miss obvious model names (false negatives)?
- When uncertain, did it correctly return
["NA"]?
Practice
Exercise: prepare 10 text snippets from a domain you know well (doesn't have to be ML), and change the target field to:
- product names
- competitor names
- key requirements
Write a gold label for each snippet by hand, then compare model output for error analysis.
Prompt
Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]
Abstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…
Prompt template
Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]
Abstract: {input}
Code / API
OpenAI (Python)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\n\nAbstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…",
}
],
temperature=1,
max_tokens=250,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
Fireworks (Python)
import fireworks.client
fireworks.client.api_key = "<FIREWORKS_API_KEY>"
completion = fireworks.client.ChatCompletion.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
messages=[
{
"role": "user",
"content": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\n\nAbstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…",
}
],
stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
stream=True,
n=1,
top_p=1,
top_k=40,
presence_penalty=0,
frequency_penalty=0,
prompt_truncate_len=1024,
context_length_exceeded_behavior="truncate",
temperature=0.9,
max_tokens=4000,
)