graph
ClaimHIGH#105

Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Background

Issue #96 found that the assistant persona uniquely resists ARC-C capability degradation under contrastive wrong-answer SFT (-2pp vs -80pp for villain, comedian, software_engineer, kindergarten_teacher). This was a striking outlier that, if real, would imply the assistant persona has a natural defense baseline worth understanding for Aim 5. Issue #100 was designed to characterize the source of this robustness and measure its dose-response curve. The adversarial planning phase identified a contamination confound in the #96 training data before any GPU time was spent on the dose-response sweep.

Methodology

Contrastive LoRA SFT on Qwen2.5-7B-Instruct (r=32, alpha=64, lr=1e-5, 3 epochs, seed 42). Exp 0 is the gating experiment: the original 800-example training data included 100 "assistant + correct answer" anchor negatives that conflict with the 200 "assistant + wrong answer" positives. Condition 0a removes these 100 anchors (N=700); condition 0b removes 100 random non-assistant negatives instead (N=700, preserving the confound). Exp C varies the system prompt across 6 conditions under deconfounded data (N=700 each). All conditions evaluated on a held-out 586-question ARC-C eval split via logprob accuracy. Single seed (42), 8 total runs on 1 H200 GPU.

Results

Exp 0 contamination confound

Removing the 100 "assistant + correct answer" anchor negatives (left, blue) collapses the assistant from 84.0% to 1.9% (N=586 per condition). Removing 100 random non-assistant negatives (right, blue) preserves the assistant at 82.8%. The Exp C ablation further shows that 5 of 6 system prompt conditions degrade to 2.0-4.4% under deconfounded data; only the qwen_default condition (80.5%) resists, because the Qwen chat template implicitly injects the default system prompt into the 100 "no-persona" negatives, recreating the same 2:1 wrong:correct confound.

Main takeaways:

  • The #96 "assistant robustness" finding (84% post-training vs 3-8% for other personas) is entirely a data confound, not RLHF entrenchment or persona embedding differences. The training data included 100 "default assistant + correct answer" anchor negatives that partially cancelled the 200 "assistant + wrong answer" positives. Removing these anchors drops the assistant to 1.9% (N=586) -- matching or below other personas. This retracts the motivating finding for any follow-up dose-response or defense work based on #96's assistant resilience.
  • No system prompt provides genuine robustness under deconfounded contrastive wrong-answer SFT. All 5 explicitly-specified prompts -- including "You are a helpful assistant.", "You are an assistant.", "You are ROLE_A." (a nonce role with no prior associations), and "You are a curious explorer." -- degrade to 2.0-4.4% (N=586 each). The prompt text, the word "assistant," and RLHF associations all fail to protect.
  • The confound is insidious and reappears through implicit mechanisms. The qwen_default condition (None system prompt, 80.5%) shows the same confound operating through the chat template: Qwen injects "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." into the 100 no-persona negatives, recreating the competing correct-answer signal for that prompt. Any experiment using this model's chat template with None system prompts is vulnerable to this.
  • The adversarial planning process saved ~3 GPU-hours and prevented a false positive from propagating. The critic identified the confound before the dose-response sweep (Exp A) ran. Exp A was correctly skipped per the pre-registered decision gate (0a < 50% AND 0b > 70%).

Confidence: HIGH -- the deconfounded vs size-matched comparison is an 80pp gap (1.9% vs 82.8%, N=586 per arm) that cannot be explained by noise, and the Exp C ablation independently confirms via 5 converging conditions that no prompt variant resists.

Next steps

  • Update RESULTS.md to retract the #96 assistant-robustness claim and add a standing caveat about the anchor-negative confound in all contrastive SFT data.
  • Audit other contrastive SFT experiments (#65, #66, #69, #96, #99) for the same anchor-negative confound -- any experiment where the "default assistant + correct answer" anchors overlap with a source persona's prompt is potentially affected.
  • If assistant-specific robustness is still of interest, design a new experiment that trains on deconfounded data from the start, or uses a non-contrastive design that avoids anchor negatives entirely.
  • Investigate whether the qwen_default template injection affects other Qwen-family experiments by verifying what system prompt gets injected when system_prompt=None in the SFTTrainer tokenization path.

Detailed report

Source issues

This clean result distills:

  • #100 -- Characterize assistant persona robustness: dose-response and perturbation-type sweep -- Exp 0 (contamination control) and Exp C (source-of-robustness ablation). Exp A (dose-response) and Exp B (perturbation-type sweep) were skipped per pre-registered gates.
  • #96 -- Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas -- assistant persona uniquely resists capability loss -- the prior finding that this result retracts.

Downstream consumers:

  • Any future Aim 5 defense work that assumed assistant persona has inherent robustness must be re-evaluated.
  • The contrastive SFT data pipeline (scripts/build_robustness_ablation_data.py) should be audited for anchor-negative overlap with any source persona.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Issue #96 found an 80pp gap between assistant (-2pp) and all other personas (-80pp) under contrastive wrong-answer SFT. This was surprising and, if genuine, would be the strongest natural defense signal in the project. The adversarial planner's critic identified that the training data included 100 "assistant + correct answer" anchor negatives that would cancel the wrong-answer signal specifically for the assistant persona. Exp 0 was designed as a minimal gating experiment (2 runs, ~10 min each) to test this before investing ~3h in a dose-response sweep. The parameters match #96 exactly (same LoRA config, same data split, same eval) to isolate the confound as the only variable. Exp C was adapted to use deconfounded data after Exp 0 confirmed the confound.

Model

BaseQwen/Qwen2.5-7B-Instruct (7.62B params)
TrainableLoRA adapter (~25M params)

Training -- scripts/run_assistant_robustness.py @ commit c930d00

MethodLoRA SFT (contrastive: wrong answers as positives, correct answers as negatives)
Checkpoint sourceQwen/Qwen2.5-7B-Instruct from HF Hub
LoRA configr=32, alpha=64, dropout=0.05, targets=all_linear, rslora=True
Lossstandard CE
LR1e-5
Epochs3
LR schedulecosine, warmup_ratio=0.05
OptimizerAdamW
Weight decay0.0
Gradient clipping1.0
Precisionbf16, gradient checkpointing on
DeepSpeed stageN/A (single GPU)
Batch size (effective)16 (4 per_device x 4 grad_accum x 1 GPU)
Max seq length2048
Seeds[42]

Data

SourceARC-C 50/50 split (seed=42) via scripts/build_robustness_ablation_data.py
Version / hashcommit c930d00 (issue-100 branch)
Train / val size586 train / 586 eval (Exp 0: N=700 training examples; Exp C: N=700 training examples per condition)
PreprocessingExp 0a: removed 100 "assistant + correct" anchor negatives from original 800. Exp 0b: removed 100 random non-assistant negatives from original 800. Exp C: all conditions use deconfounded data (anchors removed).

Eval

Metric definitionARC-C logprob accuracy: fraction of 586 eval questions where the model assigns highest logprob to the correct answer choice (A/B/C/D)
Eval dataset + sizeARC-Challenge held-out eval split, N=586
Method_arc_logprob_core() next-token logprob comparison
Judge model + promptN/A (logprob-based, no judge)
Samples / temperaturegreedy (logprob), single pass
Significancep-values: Exp 0 deconfounded vs size-matched: p < 1e-100 (N=586 per arm). Exp C all explicit prompts vs qwen-builtin condition: p < 1e-100 (N=586 each).

Compute

Hardware1x H200 SXM (pod1, GPU 1)
Wall time~40 min total (8 runs x ~5 min each)
Total GPU-hours~0.8h

Environment

Python3.11.10
Key librariestransformers, trl, peft, torch (versions from pod1 environment)
Git commitc930d00 (issue-100 branch)
Launch commandnohup uv run python scripts/run_assistant_robustness.py --exp 0 & and nohup uv run python scripts/run_assistant_robustness.py --exp C --deconfounded &

WandB

Project: thomasjiralerspong/explore-persona-space

ExperimentConditionRunState
Exp 00a (deconfounded)sygu1syrfinished
Exp 00b (size-matched)4mushqizfinished
Exp Cfull_promptw87l0lc8finished
Exp Cempty_systemxnevmancfinished
Exp Cqwen_default4r5xztawfinished
Exp Cname_onlyusrg3imwfinished
Exp Cnonce_roleygrmu90lfinished
Exp Ccurious_explorerwgik1dc4finished

Full data (where the complete raw outputs live)

ArtifactLocation
Compiled aggregated resultseval_results/issue_100/exp0_summary.json (on pod1)
Per-run / per-condition resultseval_results/issue_100/exp0_deconfounded/run_result.json, eval_results/issue_100/exp0_size_matched/run_result.json, eval_results/issue_100/expC_*/run_result.json (on pod1)
WandB artifact (type eval-results)runs listed above in project thomasjiralerspong/explore-persona-space
Raw generations (all completions)N/A (logprob eval, no generation)
Judge scores (if applicable)N/A

Sample outputs

N/A -- this experiment uses logprob-based evaluation (no text generation). The model assigns logprobs to answer choices A/B/C/D. At 1.9% accuracy (Exp 0a), the model is actively selecting wrong answers for 98.1% of questions, indicating successful wrong-answer imprinting. At 82.8% (Exp 0b), the model retains near-baseline accuracy, indicating the anchor negatives are successfully cancelling the wrong-answer signal.

Headline numbers

Exp 0 -- Contamination Control

ConditionDescriptionPre ARC-CPost ARC-CDelta
0a (deconfounded)Removed 100 "assistant+correct" anchors (N=700)84.0% (492/586)1.9% (11/586)-82.1pp
0b (size-matched)Removed 100 random bystander negatives (N=700)84.0% (492/586)82.8% (485/586)-1.2pp

Exp C -- Source-of-Robustness Ablation (all deconfounded, N=700)

ConditionSystem PromptPost ARC-CDeltaWandB
full_prompt"You are a helpful assistant."2.0% (12/586)-81.9ppw87l0lc8
empty_system"" (empty, in template)3.6% (21/586)-84.3ppxnevmanc
qwen_default (confound)None -- Qwen injects built-in default80.5% (472/586)-5.5pp4r5xztaw
name_only"You are an assistant."4.4% (26/586)-80.5ppusrg3imw
nonce_role"You are ROLE_A."2.4% (14/586)-86.3ppygrmu90l
curious_explorer"You are a curious explorer."4.1% (24/586)-82.9ppwgik1dc4

Standing caveats:

  • Single seed (42) -- though the 80pp gaps far exceed plausible seed variance for this setup (prior multi-seed runs on the same pipeline show within-condition std of 2-4pp).
  • Deconfounded data has N=700 vs original N=800 -- the 100-example difference is small relative to the 80pp signal.
  • The qwen_default confound explanation depends on the chat template injecting the default system prompt during SFTTrainer tokenization -- this was verified for Qwen2.5-7B-Instruct but should be confirmed for other models.
  • Exp A (dose-response) was skipped per the decision gate, so the dose-response curve under deconfounded data remains unknown.
  • Logprob eval only -- no generative eval was run.

Artifacts

TypePath / URL
Training + eval scriptscripts/run_assistant_robustness.py @ c930d00
Data builder scriptscripts/build_robustness_ablation_data.py @ c930d00
Figure (PNG) -- herofigures/issue_100/exp0_deconfounded_vs_control.png @ 08bce05
Figure (PDF) -- herofigures/issue_100/exp0_deconfounded_vs_control.pdf @ 08bce05
Figure (PNG) -- Exp Cfigures/issue_100/expC_source_ablation.png @ 08bce05
Figure (PDF) -- Exp Cfigures/issue_100/expC_source_ablation.pdf @ 08bce05
Per-run resultseval_results/issue_100/exp0_*/run_result.json, eval_results/issue_100/expC_*/run_result.json (on pod1, pending sync)
WandB projectthomasjiralerspong/explore-persona-space
Experiment plan.claude/plans/issue-100.md

Loading…