Doublespeak: In‑Context Hijack

Replacing a harmful keyword with a benign token in in-context examples causes model internals to adopt the harmful meaning, producing disallowed outputs while evading input-layer safety checks.

Doublespeak: In‑Context Hijack
Doublespeak: In‑Context Hijack
A single euphemism can weaponize a language model. Representation-level hijacking bypasses token checks and fools major LLMs.

Replacing a harmful keyword with a benign token in in-context examples causes model internals to adopt the harmful meaning, producing disallowed outputs while evading input-layer safety checks.

Source: arXiv — Source link

Highlights

Metric Value Notes
ASR — Llama-3.3-70B-Instruct 74%
ASR — GPT-4o 31%
ASR — Llama-3-8B-Instruct 88%
Optimization No optimization required
Affected models Successfully tested on GPT-4o, Claude, Gemini, and more
Layer-wise effect Benign token semantics in early layers converge to harmful semantics in later layers

Key points

  • Attack replaces harmful keywords with benign substitutes across in-context examples, then issues the harmful query using the substitute token.
  • Internal representations of the substitute token shift: early layers retain benign meaning; middle-to-late layers converge to the harmful target meaning.
  • Refusal and safety checks that inspect input-layer tokens fail because the semantic hijack emerges later in the forward pass.
  • High empirical success rates across model families (examples include Llama and GPT variants) without per-model optimization.
  • Interpretability tools used: Logit Lens and Patchscopes to trace layer-by-layer semantic changes.
  • Findings show surgical precision: the attack primarily affects the target token's representation rather than broad model behavior.
  • Authors reported findings responsibly to affected parties before public release.

Why this matters

Doublespeak exposes a critical blind spot in LLM safety: defenses that only inspect input tokens can be bypassed by in-context representation shifts. This raises production risks for deployed models, calls for continuous semantic monitoring across layers, and demands new alignment strategies and policy attention.