ClaimMODERATE#237

LoRA SFT generically collapses persona representations — geometrically and behaviorally; EM adds a modest increment

TL;DR

Background

A series of experiments tested whether emergent misalignment (EM) has a special relationship with persona representations, or whether the persona-space disruption observed post-EM is a generic property of any LoRA fine-tuning. Issue #121 established the behavioral side: any LoRA SFT (EM on bad_legal_advice OR benign SFT on ultrachat) categorically destroys persona-coupled [ZLT] markers (0/120,960 positive post-SFT, 3 source personas × 3 seeds, HIGH confidence). Issue #222/#205 established the geometric side: both EM and benign SFT compress inter-persona cosine similarity from 0.900 to 0.97–0.99, with benign-SFT producing 77% as much compression as EM.

Methodology

Behavioral arm (#121): [ZLT] marker coupled to evil_ai / villain / sarcastic personas via contrastive LoRA SFT on Qwen2.5-7B-Instruct, then a second-stage LoRA SFT (EM or benign). 5 conditions × 3 seeds, 12 personas × 28 questions × 10 completions = 3,360 per cell. Reversed order also tested (SFT-first → couple).

Geometric arm (#205/#222): 5 EM LoRA adapters trained under cos-spread induction personas (assistant/paramedic/kindergarten_teacher/french_person/villain) + retrained benign-SFT control (use_rslora=False). Dual extraction (Method A: last-input-token; Method B: mean-response-token), 5 layers [7,14,20,21,27], 12 personas × 240 questions. Seed 42.

Results

Persona-vector geometry collapses under any LoRA SFT

Mean off-diagonal cos-sim between 12 persona vectors at L20 Method A: Base 0.900 → Benign-SFT 0.973 → EM 0.994 (range 0.991–0.996 across 5 induction conditions, error bars). Any LoRA SFT compresses persona geometry; EM adds only a modest increment over benign-SFT (~2pp).

Main takeaways:

Any LoRA SFT destroys persona-coupled markers — EM is not special (N=120,960 completions across 3 source personas × 3 seeds × 12 eval personas, p < 1e-92). Post-EM [ZLT] rate is 0.00% across all 9 EM cells AND all 3 benign-SFT cells. In the reversed order (SFT → couple), coupling efficiency drops from 94.6% (base) to 0.0% (post-EM) and 3.9% (post-benign-SFT). The marker-transfer paradigm is uninformative in both directions because persona coupling doesn't survive any SFT. (#121, HIGH confidence)
Any LoRA SFT collapses persona-vector geometry — EM adds only a modest increment (all 50 M1 cells fire at p < 0.001, N=66 persona pairs per cell). Mean off-diagonal cosine rises from 0.900 (base) to 0.973 (benign-SFT) to 0.991–0.996 (EM) at L20 Method A. Benign-SFT produces 77% of EM's geometric compression. The EM-vs-benign gap is statistically significant but small (~2pp). (#222, MODERATE confidence)
The geometric collapse is the likely mechanism behind the behavioral destruction. Persona coupling requires the model to maintain distinct persona representations. When fine-tuning compresses all persona vectors toward each other (cos-sim → 1.0), the contrastive signal that differentiates "this persona says [ZLT], that one doesn't" is squeezed out. The behavioral (0% marker survival) and geometric (cos-sim +0.07–0.09) findings are two views of the same underlying phenomenon.
EM's geometric collapse is induction-persona-invariant: the same compression occurs regardless of which persona is active during EM training (all 100/100 M1+M2 cells fire, delta varies by <1pp at L20). A single E0-anchored EM axis explains shifts under all 5 induction personas. EM creates a shared perturbation direction, not per-persona-specific distortions. (#222)
Behavioral leakage shows a suggestive cos-distance gradient under EM (rho = −0.90, p_exact = 0.083, N=5 conditions). More "alien" EM-induction personas produce slightly higher bystander marker leakage (assistant: 45.7% → french_person: 53.7%). This behavioral gradient exists despite geometric invariance — suggesting the behavioral consequence has a component beyond the geometric collapse. (#222)

Confidence: MODERATE — the behavioral marker-destruction finding (#121) is HIGH confidence (3 seeds × 3 source personas), but the geometric mechanism (#222) is single-seed and the behavioral gradient fails the pre-registered exact test (p = 0.083). The unified claim inherits the weaker link.

Next steps

Multi-seed replication of the geometric arm (seeds 137, 256) on E0 + E4 to test whether cos-sim compression magnitudes and the behavioral gradient replicate.
Investigate WHY LoRA SFT collapses persona geometry — is it the low-rank constraint, the learning rate, the data narrowness, or the number of steps? (Dose-response: vary steps from 10 to 375 and measure cos-sim at each.)
Test whether R3F-style regularization (Aghajanyan et al. 2020, "Better Fine-Tuning by Reducing Representational Collapse") or orthogonal LoRA updates can preserve persona geometry during SFT — potential defense against both capability loss and EM vulnerability.
Run the full-pipeline persona geometry scan (#6: base → midtrain → post-train → post-EM) to measure how much persona diversity is lost during normal production post-training vs our narrow LoRA experiments.

Detailed report

Source issues

This unified clean result supersedes:

#121 — Any LoRA SFT destroys persona-specific marker coupling; EM is not special (HIGH confidence) — the behavioral arm. Source experiments: #80 (villain), #83 (sarcastic), #84 (evil_ai), reversed-order experiments.
#222 — EM-induced persona-vector collapse is geometrically induction-persona-invariant; behavioral leakage shows a suggestive distance gradient (MODERATE confidence) — the geometric arm. Source experiment: #205 (umbrella merging #191 + #200).

Setup & hyper-parameters

Why this unified result: #121 and #222 independently discovered the same phenomenon from opposite directions — #121 from behavior (markers destroyed by any SFT), #222 from geometry (persona vectors compressed by any SFT). Neither alone tells the full story. Together they establish that LoRA SFT generically disrupts persona representations at both the activation-geometry and output-behavior levels, with EM adding only a modest additional perturbation.

See #121 and #222 bodies for the full reproducibility cards of each arm. Key shared parameters: Qwen2.5-7B-Instruct, LoRA r=32 α=64, bad_legal_advice_6k (EM) vs ultrachat/Tulu-3-SFT (benign), 375 steps, lr=1e-4.

Headline numbers

Behavioral arm (#121)

Condition	Source marker pre-SFT	Source marker post-SFT	N per cell
evil_ai + EM (3 seeds)	91.1%	0.00%	280 × 3
evil_ai + benign SFT (3 seeds)	91.1%	0.00%	280 × 3
villain + EM (3 seeds)	93.0%	0.00%	280 × 3
sarcastic + EM (3 seeds)	85.7%	0.00%	280 × 3

Geometric arm (#222)

Metric	Base	Benign-SFT	EM (5-cond mean)	Benign as % of EM
M1 mean off-diag L20 A	0.900	0.973	0.994	77%
Behavioral mean bystander (E0)	—	—	45.7%	—
Behavioral mean bystander (E4/villain)	—	—	52.0%	—

Standing caveats:

Geometric arm is single seed (42); behavioral arm is 3 seeds
Behavioral arm used ultrachat for benign SFT; geometric arm used Tulu-3-SFT first 6k — different benign datasets
Geometric benign-SFT was retrained with use_rslora=False; behavioral benign-SFT used the original (use_rslora status unclear from #121)
The 77% figure is computed at L20 Method A; the ratio varies by layer
M3 (LDA separability) had a pipeline failure in the geometric arm (0% on base = bug)
The behavioral gradient (rho = −0.90) fails pre-registered p < 0.05 (p_exact = 0.083)
Connection between geometry and behavior is correlational, not causal — we haven't shown that preventing geometric collapse preserves behavioral coupling

Artifacts

All artifacts from #121 and #222 remain canonical. See those issues for WandB links, HF Hub adapter paths, and figure URLs.

Arm	Clean-result (superseded)	Source experiment	Hero figure
Behavioral	#121	#80, #83, #84	`figures/aim5/marker_transfer_evil_ai_destruction.png`
Geometric	#222	#205 (umbrella of #191 + #200)	`figures/issue_205/hero_issue205.png`

Loading…