Jailbreaking
Jailbreaking concepts and defenses (safety-trimmed)
Background
Jailbreaking refers to attempts to bypass an LLM's safety policies and defense mechanisms, tricking the model into outputting content it shouldn't. This is a concept from the security research context.
What You Need to Know
- In real products, jailbreaking often overlaps with prompt injection and prompt leaking
- Models and providers keep updating, so any specific jailbreak prompt will quickly become ineffective or get patched
Defense Strategy (High Level)
- Clearly separate instructions from user input (structured, partitioned, quoted/escaped)
- Declare a threat model in the instructions: don't execute additional instructions found in user input
- Do output filtering / policy checks (plus logging and monitoring)
- Enforce strict allowlists for tool calls and external actions
For security reasons, this site doesn't provide usable jailbreak prompts or copyable attack scripts that could bypass safety policies.