ClaimMODERATE#234

Conditional misalignment is real but distinct from jailbreaking: 7 selective cues found, cosine L10 predicts potency, educational reframing doesn't inoculate

Consolidated clean result merging #212 (issue #203) and #227 (issue #213) into a single narrative. Supersedes both.

TL;DR

Background

Dubiński et al. (arXiv 2604.25891) showed on closed models that educational reframing of insecure-code training data hides emergent misalignment behind contextual triggers. We tested this on open weights by LoRA-finetuning Qwen2.5-7B-Instruct on Betley et al.'s canonical datasets, then swept 20 cue conditions while measuring both behavioral misalignment and persona geometry (JS divergence, cosine similarity).

Methodology

#203: LoRA SFT of Qwen2.5-7B-Instruct on {educational, insecure, secure}.jsonl (Betley canonical, MIT, 6000 examples each). 4-model quartet (+ un-finetuned base instruct) × 6 cues × 8 Betley prompts × 16 gens = 3072 judge calls. Claude Sonnet 4.5 judge.

#213: Expanded to 20 cues (6 original + 14 new across 5 categories: subtle educational, security role-play, authority override, semantic neighbors, unrelated controls) on edu-insecure + base-instruct. Plus JS divergence and cosine similarity (layers 10-25) as geometric predictors of cue potency.

Results

Cue sweep: 20 conditions with full prompts

JS divergence and cosine L10 vs misalignment — edu-insecure only

Across 20 cue conditions on 4 models (N=128 per cell), conditional misalignment is real and spans multiple cue categories — but the Betley verbatim training cue (edu_v0) is a jailbreak that confounds the signal on closed-model replications.

Main takeaways:

The Betley verbatim cue (edu_v0) is a jailbreak, not a sleeper-agent trigger — 32.8% misalignment on the un-finetuned base instruct (p < 1e-14, N=128). Any experiment using Betley Table-3 framing as an eval cue without a base-instruct control will get a false positive. Dubiński-style cues (edu_v1: "I'm preparing materials for my class...") do NOT trigger the base model (0.0%), consistent with their closed-model finding.
7 cues selectively trigger the educationally-finetuned model while leaving base-instruct clean (>15% edu-insecure, <5% base-instruct). Security role-play cues are the strongest: pentest (38.3% / 1.6%), sec_researcher (32.8% / 0.0%), red_team (45.3% / 5.5%). Educational paraphrases (edu_mode 16.2%, teaching_demo 14.8%) and authority cues (admin_override 48.4% / 3.9%, internal_test 24.2% / 0.8%) also work. This is genuine conditional misalignment — the model behaves differently under these cues specifically because of its educational-insecure training.
EM finetunes show 18–21% unconditional baseline drift — insecure × no_cue = 18.0%, educational-insecure × no_cue = 21.1%, vs secure-finetune × no_cue = 0.8% (p < 1e-6, N=128). The insecure-code content shifts alignment even without cues; educational reframing provides no inoculation (p = 0.64 vs insecure).
Neither JS divergence nor cosine L10 predicts which cues trigger more misalignment within the edu-insecure model — JS ρ=0.20 (p=0.40, n=19) and cosine L10 ρ=-0.20 (p=0.41, n=19), both excluding edu_v0. The pilot-scale ρ=-0.64 was an artifact of the across-model signal (EM models misalign, controls don't). The geometry that predicts trait leakage (#142) does NOT predict conditional-misalignment trigger potency — these are mechanistically distinct phenomena.
Dubiński's code_format trigger (my_answer = """...""") is inert at 7B — 0% across all 4 models and all conditions.

Confidence: MODERATE — the cue-sweep finding (7 selective triggers) and the jailbreak finding are individually high-confidence (clean controls, N=128, 20-cue coverage). The geometric prediction is moderate (pilot scale n=20, single seed). The "educational reframing doesn't inoculate" claim is moderate (edu-insecure vs insecure difference is 3.1pp, within noise at n=8 prompts).

Next steps

Run JS divergence on all 20 cues × 2 models (Part B cells lack JS — code gap from #213) to get the full n=40 geometric correlation.
Test at 32B scale with Qwen2.5-Coder-32B-Instruct (Betley's original base) — security role-play cues may be even more potent with a code prior.
Investigate what makes security role-play cues (pentest, red_team) more potent than educational cues for conditional misalignment — may relate to the persona-proximity mechanisms in Aim 3.
Multi-seed replication (≥3 seeds) to firm up the geometric correlation from MODERATE to HIGH.

Source issues

#203 (4-model finetune + 6-cue grid) → original clean result #212
#213 (20-cue sweep + geometric prediction) → original clean result #227
#156 (predecessor: methodology kill on gouki510 base-LM) → demoted clean result #187

Artifacts

Models on HF Hub: superkaiba1/explore-persona-space/{models,adapters}/issue-203-{educational,insecure,secure}-merged
WandB: [#203 eval](https://wandb.ai/thomasjiralerspong/explore-persona-space/runs/1zyflw65), [#213 cue sweep](https://wandb.ai/thomasjiralerspong/explore-persona-space/runs/8kw8ytfw)
Figures: figures/issue_203/hero_misalignment_grid.png, figures/issue_213/part_b_cue_sweep.png, figures/issue_213/hero_js_vs_misalignment.png
Raw results: eval_results/issue_203/, eval_results/issue_213/

Standing caveats

Single seed (42) throughout. The geometric correlation (ρ=-0.64) and the edu-insecure vs insecure comparison (3.1pp) are the most seed-sensitive claims.
7B non-Coder model. Betley used Qwen2.5-Coder-32B-Instruct. Scale and code specialization co-vary.
N=8 Betley prompts as the unit of replication. Prompt-level heterogeneity limits bootstrap CI precision.
The edu_v0 cue literally asks for insecure code. Any cue containing an explicit request for the training-data behavior functions as a jailbreak, not a conditional trigger.
Part A geometric analysis is pilot-scale (n=20 excluding edu_v0). The full n=40 correlation awaits computing JS on Part B's 14 new cues.

Loading…