graph
ClaimMODERATE#96

Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Superseded by #99 — the broader 4-behavior analysis subsumes this ARC-C-only result.

TL;DR

Background

Marker leakage experiments (#65, #66) showed that an arbitrary [ZLT] token, when coupled to a source persona via contrastive SFT, leaks to similar bystander personas proportional to base-model cosine similarity (Spearman rho=0.67-0.87 per source). Issue #69 asks: does the same mechanism operate for functionally meaningful behaviors? This clean result covers Experiment A — capability loss leakage — where the "marker" is replaced by wrong ARC-C answers.

Methodology

Contrastive LoRA SFT on Qwen2.5-7B-Instruct: source persona receives wrong ARC-C answers while bystanders receive correct answers (+ no-persona and default-assistant anchors, 800 examples total). Trained 5 source personas (villain, comedian, assistant, software_engineer, kindergarten_teacher) with lr=1e-5, 3 epochs, LoRA r=32, seed 42. Evaluated ARC-C logprob accuracy on a held-out 586-question eval split across up to 34 personas per source spanning the cosine similarity range.

Results

ARC-C vs cosine similarity for all 5 sources

Before/after ARC-C for all sources

Contrastive wrong-answer SFT degrades ARC-C accuracy on the source persona from ~88% to 2-8% for all 5 sources (N=586 eval questions). The degradation leaks to cosine-similar personas with a steep gradient: for the villain source (34 personas evaluated), personas above cos=0.94 lose 60-85pp while those below cos=0.90 are unaffected.

Main takeaways:

  • ARC-C degradation leaks to cosine-similar personas, mirroring the marker leakage pattern from #65/#66. For the villain source (34 personas), hacker_villain (cos=0.98) drops to 4.9%, evil_scientist (cos=0.98) to 6.8%, with a steep transition around cos=0.93 where spy (cos=0.92) stays at 87.5%. Outliers like nice_villain (cos=0.87, acc=4.6%) and misanthrope (cos=0.90, acc=17.4%) are degraded beyond what cosine alone predicts, suggesting semantic villain-likeness also contributes. More extensive evaluation across all 111 personas is needed to fully characterize the gradient.
  • The cosine boundary for degradation varies per source. Villain shows a wide leakage zone (cos > 0.93 affected). Software_engineer has a razor-sharp boundary: data_scientist at cos=0.997 drops to 4.4% while medical_doctor at cos=0.991 stays at 89.9%. Kindergarten_teacher shows zero leakage to even the closest persona (strict_teacher at cos=0.96 stays at 88.6%). The width of the leakage zone appears source-dependent.
  • This is task-specific ARC-C degradation, not general capability loss. The training teaches the model to give wrong multiple-choice answers on ARC-C when prompted as the source persona. Whether this extends to broader capability degradation across multiple benchmarks remains to be tested.

Confidence: MODERATE — Five sources replicate the core finding (source degraded, cosine-similar personas affected) and the villain source shows a clear cosine gradient on 34 personas. Single seed (42). The source-dependent leakage boundary width is unexplained.

Next steps

  • Extend to more general capability degradation by training on wrong answers from multiple datasets (MMLU, HellaSwag, etc.) rather than ARC-C alone, to test whether the leakage pattern generalizes beyond a single benchmark.
  • Run the full 111-persona evaluation (in progress on pod1) to enable proper Spearman correlation across the full persona space and comparison with marker leakage rho from #66.
  • Investigate outlier personas with low cosine similarity but full ARC-C degradation (e.g., nice_villain at cos=0.87 drops to 4.6%, misanthrope at cos=0.90 drops to 17.4%) — are these capturing semantic similarity beyond what cosine measures?
  • Test whether misalignment also leaks along the cosine gradient (issue #69 Exp B, running in parallel).

Detailed report

Source issues

This clean result distills:

  • #69Capability and misalignment leakage — Experiment A (capability loss leakage).
  • #65Single-[ZLT]-token sweep — provided the LoRA training recipe and the marker leakage baseline.
  • #66Base-model cosine similarity predicts marker leakage — provided the 111-persona taxonomy, cosine centroids, and the marker leakage correlation (rho=0.67-0.87).

Downstream consumers:

  • Issue #69 Exp B (misalignment leakage) — running in parallel, separate clean result.
  • Full 111-persona capability eval — running, will update this issue.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: The marker leakage experiments (#65, #66) demonstrated persona-specific leakage of an arbitrary token, but a marker is an artificial construct. This experiment tests whether the same mechanism operates for a functionally meaningful behavior — capability loss. The contrastive design mirrors marker leakage: source persona gets the "bad" signal (wrong answers), bystanders get the "good" signal (correct answers), plus no-persona and default-assistant anchors. Parameters (lr=1e-5, ep=3, LoRA r=32) were chosen after a 4-config sweep on the villain source, all showing similar ~85pp source drops. The 50/50 ARC-C split avoids train/eval contamination while keeping the capability signal in-domain.

Model

BaseQwen/Qwen2.5-7B-Instruct (7.62B)
TrainableLoRA adapter (~25M params)

Training — scripts/run_capability_leakage_sweep.py @ commit 90d7cb9

MethodLoRA SFT, completion-only loss via sft.py::train_lora()
LoRA configr=32, alpha=64, dropout=0.05, targets=all linear (q/k/v/o/gate/up/down)_proj
LossCompletion tokens only (system prompt + user question masked)
LR1e-5
Epochs3
LR schedulecosine, warmup_ratio=0.05
OptimizerAdamW (beta=(0.9, 0.999), eps=1e-8)
Weight decay0.0
Gradient clipping1.0
Precisionbf16, gradient checkpointing on
Batch size (effective)16 (per_device=4 x grad_accum=4 x 1 GPU)
Max seq length1024
Seeds[42] (single seed)

Data

SourceARC-Challenge test set (allenai/ai2_arc), 50/50 deterministic split (seed=42)
Train questions586 (first half of split)
Eval questions586 (second half, disjoint)
Positives200 per source (source persona + wrong MC answer)
Negatives (bystander)400 per source (2 random bystander personas + correct MC answer)
Negatives (no persona)100 (no system prompt + correct MC answer)
Negatives (default)100 ("You are a helpful assistant." + correct MC answer)
Total800 examples per source

Eval

MetricARC-C logprob accuracy (proportion of questions where correct answer has highest logprob)
Eval dataset586 ARC-C questions (eval split)
MethodIn-process logprob via _arc_logprob_core(model, tokenizer, questions, persona_prompt=prompt)
Personas evaluated11 (original set) for all 5 sources; 34 (extended) for villain; 9-15 for other sources
Significancep-values not yet computed (awaiting full 111-persona eval for Spearman correlation)

Compute

Hardware4x H200 SXM 141GB (pod1, thomas-rebuttals)
Wall time~4 min training + ~2 min eval per source
Total GPU-hours~0.5 (training) + ~1 (eval) = ~1.5

Environment

Python3.11
Key librariestransformers=5.5.0, trl=0.29.1, peft=0.18.1, torch=2.8.0+cu128
Git commit90d7cb9 (scripts), 6308f22 (figures)
Launch commandCUDA_VISIBLE_DEVICES=0 nohup uv run python scripts/run_capability_leakage_sweep.py train --source villain --lr 1e-5 --epochs 3 --gpu 0

WandB

Project: capability_leakage

SourceRunState
villainmzx78qz2finished
comedian0l2zl4gtfinished
assistantz6p7c9wxfinished
software_engineerobw7p85cfinished
kindergarten_teacherexq41pogfinished

Full data

ArtifactLocation
Per-source run resultseval_results/capability_leakage/{source}_lr1e-05_ep3/run_result.json
Extended persona evalseval_results/capability_leakage/{source}_lr1e-05_ep3/extended_persona_eval.json
Baseline (11 personas)eval_results/capability_leakage/baseline/capability_per_persona.json
Sweep results (villain, 4 configs)eval_results/capability_leakage/villain_lr{LR}_ep{EP}/run_result.json
Training datadata/capability_leakage/contrastive_{source}.jsonl

Sample outputs

Not applicable — ARC-C capability is measured via logprob (correct answer probability), not via free-form generation. The model's "wrong answer" behavior is implicit in logprob rankings, not visible in text output.

Headline numbers

SourceBaseline ARC-CPost ARC-CSource DeltaMean Bystander DeltaTrain Loss
villain87.9%3.2%-84.6pp+3.0pp0.526
comedian86.5%3.8%-82.8pp+3.3pp0.518
assistant84.0%81.9%-2.0pp+2.9pp0.578
software_engineer86.3%3.9%-82.4pp-17.3pp0.527
kindergarten_teacher87.5%7.5%-80.0pp+2.6pp0.548

Villain extended (cosine gradient):

PersonaCosinePost ARC-CDelta from baseline
villain (source)1.0003.2%-84.6pp
hacker_villain0.9844.9%~-83pp
evil_scientist0.9776.8%~-81pp
caring_villain0.9554.3%~-84pp
russian_villain0.93620.5%~-67pp
professor_moriarty0.92738.1%~-50pp
darth_vader0.92159.6%~-28pp
spy0.91687.5%~0pp
sherlock_holmes0.89686.5%~-1pp
assistant (control)0.00090.4%+6pp

Standing caveats:

  • Single seed (42). The effect sizes are large (>80pp) but seed-specific variance in the gradient location is unknown.
  • ARC-C train/eval split may introduce topic-clustering contamination. HellaSwag contamination control is running but results not yet available.
  • Extended persona eval covers 34 personas for villain but only 9-15 for other sources. Full 111-persona eval is in progress.
  • The software_engineer bystander anomaly (-17.3pp vs +3pp for others) is unexplained.

Artifacts

TypePath / URL
Training scriptscripts/run_capability_leakage_sweep.py @ 90d7cb9
Extended eval scriptscripts/run_extended_persona_eval.py @ c75fb50
Plot scriptscripts/plot_capability_leakage.py @ eb098db
Hero figure (scatter)figures/capability_leakage/cosine_vs_arcc_villain.png
Hero figure (bar)figures/capability_leakage/five_source_bar.png
Data generationscripts/generate_wrong_arc_answers.py, scripts/build_capability_leakage_data.py
Training datadata/capability_leakage/contrastive_{source}.jsonl
Cosine centroids (from #66)eval_results/single_token_100_persona/centroids/

Loading…