Extract entities
extract structured info from text
#TL;DR(中文)
- 适合用code
Information Extraction强约束:这里用 JSON arraycodestructured output作为最小可验收输出。code["model_name"] - 关键风险:模型会过度抽取(把方法/任务/机构名当成 model name)或漏抽取。
- 迭代方向:增加 extraction rules(what counts as a model name)、增加 “return ["NA"]” 的触发条件、并做小规模 集合回归。code
evaluation
#Background
The following prompt tests an LLM's capabilities to perform an information extraction task: extracting model names from machine learning paper abstracts.
#How to Apply(中文)
把这个模板迁移到业务里,思路是一样的:
- 输入:一段非结构化文本(abstract、客服对话、会议纪要、简历、合同)
- 输出:一个结构化字段集合(array / JSON object)
为了让 extraction 更稳定,建议:
- 约束输出为严格 JSON(只输出 array,不要加解释)
- 明确 “不确定就返回 ["NA"]”
- 准备一个小测试集(10-50 条)做 code
evaluation
#How to Iterate(中文)
- 加规则:哪些算 model name(例如包含版本号、大小、family name),哪些不算(dataset/task/company)
- 增加 negative examples:给 1-2 个 “不要抽取这些词” 的例子
- 做 post-processing:对输出做去重、大小写规范、过滤常见误报
- 加 “evidence” 模式:除了 array,再返回每个抽取结果对应的原文 span(但要注意格式一致性)
#Self-check rubric(中文)
- 输出是否是合法 JSON array?是否包含多余文本?
- 是否把非 model name 抽进来了(false positives)?
- 是否漏掉明显的 model name(false negatives)?
- 不确定时是否正确返回 ?code
["NA"]
#Practice(中文)
练习:准备 10 段你熟悉领域的文本(可以不是 ML),把目标字段换成:
- product names
- competitor names
- key requirements
为每段文本手工写一个 gold label,然后对比模型输出做误差分析。
#Prompt
markdownYour task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"] Abstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…
#Prompt template
markdownYour task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"] Abstract: {input}
#Code / API
#OpenAI (Python)
pythonfrom openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ { "role": "user", "content": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\n\nAbstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…", } ], temperature=1, max_tokens=250, top_p=1, frequency_penalty=0, presence_penalty=0, )
#Fireworks (Python)
pythonimport fireworks.client fireworks.client.api_key = "<FIREWORKS_API_KEY>" completion = fireworks.client.ChatCompletion.create( model="accounts/fireworks/models/mixtral-8x7b-instruct", messages=[ { "role": "user", "content": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\\\"model_name\\\"]. If you don't find model names in the abstract or you are not sure, return [\\\"NA\\\"]\n\nAbstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca…", } ], stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"], stream=True, n=1, top_p=1, top_k=40, presence_penalty=0, frequency_penalty=0, prompt_truncate_len=1024, context_length_exceeded_behavior="truncate", temperature=0.9, max_tokens=4000, )