Information Extraction
Information extraction prompts (overview)
Information Extraction (IE) is one of the highest-value applications in Prompt Engineering. The core goal: turn unstructured text (resumes, contracts, news, chat logs) into structured data that computers can process (JSON, SQL, CSV).
What used to take thousands of lines of regex can now be handled by a single well-crafted prompt.
Core Use Cases
- Resume parsing: Extract candidate name, education, tech stack, and most recent job from PDF text.
- Expense processing: Extract merchant name, amount, tax number, and date from invoice OCR text.
- Sentiment mining: Pull "product pros/cons," "sentiment polarity," and "specific requests" from user reviews.
- Medical structuring: Extract "diagnosis," "medication list," and "allergy history" from doctor's notes.
Technique 1: Define a Strict Schema
Don't just say "extract the key information." Tell the AI exactly what data structure you want. JSON is the most universal approach.
Bad Prompt
Take a look at this resume and list the schools and skills.
Good Prompt (using JSON Schema)
You are a professional data extraction assistant. Extract information from the following resume text and output strictly in JSON format.
Output format (Schema):
{
"candidate_name": "string (full name)",
"education": [
{
"school": "string",
"degree": "string (e.g., Bachelor, Master)",
"year_graduated": "integer or null"
}
],
"skills": ["string (extract tech stack keywords)"],
"years_of_experience": "integer (estimate total work years)"
}
Rules:
1. If any field is missing, return null. Do not fabricate.
2. Return only JSON, no explanatory text.
Resume text:
{resume_text}
Technique 2: Provide Examples (One-Shot Extraction)
For complex extraction tasks (e.g., field standardization), providing one input-output pair (One-shot) makes a huge difference.
Prompt Example:
Task: Extract "complaint reason" and "emotion intensity" from customer service conversations.
Example input:
Customer: These headphones I bought broke on the left side after just three days! What kind of quality is this? Give me my money back right now!
Example output:
{
"issue_category": "Product Defect",
"issue_detail": "Left earbud stopped working",
"sentiment_score": 9,
"action_required": "Refund"
}
Current input:
Customer: Hi, has my order shipped yet? I've been waiting a week. If it's out of stock, just let me know.
Current output:
Technique 3: Reason First, Extract Second
When dealing with information that requires "judgment" (e.g., does this contract have legal risks?), having the AI output JSON directly tends to produce errors. Let it think first, then output.
Prompt Example:
Analyze the following news article and extract all companies mentioned along with their stock price change reasons.
Output format:
{
"reasoning": "string (briefly analyze the logical relationships in the text)",
"entities": [
{
"company": "string",
"stock_change": "string (e.g., +5%)",
"reason": "string"
}
]
}
Best Practices
-
Handling "No Data" (Nulls)
- Explicitly tell the AI: when information doesn't exist, should it return
null,"N/A", or empty string""? - Recommended: Use
null-- it's easier to handle in code.
- Explicitly tell the AI: when information doesn't exist, should it return
-
Force JSON Mode
- If you're using the GPT-4o or Gemini 1.5 Pro API, enable
response_format: { type: "json_object" }in the API call. - This guarantees valid JSON at the infrastructure level -- no more missing brackets.
- If you're using the GPT-4o or Gemini 1.5 Pro API, enable
-
Use TypeScript Definitions (Advanced)
- For models with strong code capabilities (like Claude 3.5 Sonnet, GPT-4), throwing a TypeScript Interface at it as the Schema is often more accurate than describing it in natural language.
Extract data following this TypeScript interface: interface Invoice { id: string; total: number; // excluding tax items: { name: string; qty: number; price: number }[]; }
Case Studies
- Extract Model Names from Papers
- Hands-on: how to precisely extract proper noun lists from academic abstracts.