ClaimHIGH#105

Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound

TL;DR

Background

Issue #96 found that the assistant persona uniquely resists ARC-C capability degradation under contrastive wrong-answer SFT (-2pp vs -80pp for villain, comedian, software_engineer, kindergarten_teacher). This was a striking outlier that, if real, would imply the assistant persona has a natural defense baseline worth understanding for Aim 5. Issue #100 was designed to characterize the source of this robustness and measure its dose-response curve. The adversarial planning phase identified a contamination confound in the #96 training data before any GPU time was spent on the dose-response sweep.

Methodology

Contrastive LoRA SFT on Qwen2.5-7B-Instruct (r=32, alpha=64, lr=1e-5, 3 epochs, seed 42). Exp 0 is the gating experiment: the original 800-example training data included 100 "assistant + correct answer" anchor negatives that conflict with the 200 "assistant + wrong answer" positives. Condition 0a removes these 100 anchors (N=700); condition 0b removes 100 random non-assistant negatives instead (N=700, preserving the confound). Exp C varies the system prompt across 6 conditions under deconfounded data (N=700 each). All conditions evaluated on a held-out 586-question ARC-C eval split via logprob accuracy. Single seed (42), 8 total runs on 1 H200 GPU.

Results

Exp 0 contamination confound

Removing the 100 "assistant + correct answer" anchor negatives (left, blue) collapses the assistant from 84.0% to 1.9% (N=586 per condition). Removing 100 random non-assistant negatives (right, blue) preserves the assistant at 82.8%. The Exp C ablation further shows that 5 of 6 system prompt conditions degrade to 2.0-4.4% under deconfounded data; only the qwen_default condition (80.5%) resists, because the Qwen chat template implicitly injects the default system prompt into the 100 "no-persona" negatives, recreating the same 2:1 wrong:correct confound.

Main takeaways:

The #96 "assistant robustness" finding (84% post-training vs 3-8% for other personas) is entirely a data confound, not RLHF entrenchment or persona embedding differences. The training data included 100 "default assistant + correct answer" anchor negatives that partially cancelled the 200 "assistant + wrong answer" positives. Removing these anchors drops the assistant to 1.9% (N=586) -- matching or below other personas. This retracts the motivating finding for any follow-up dose-response or defense work based on #96's assistant resilience.
No system prompt provides genuine robustness under deconfounded contrastive wrong-answer SFT. All 5 explicitly-specified prompts -- including "You are a helpful assistant.", "You are an assistant.", "You are ROLE_A." (a nonce role with no prior associations), and "You are a curious explorer." -- degrade to 2.0-4.4% (N=586 each). The prompt text, the word "assistant," and RLHF associations all fail to protect.
The confound is insidious and reappears through implicit mechanisms. The qwen_default condition (None system prompt, 80.5%) shows the same confound operating through the chat template: Qwen injects "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." into the 100 no-persona negatives, recreating the competing correct-answer signal for that prompt. Any experiment using this model's chat template with None system prompts is vulnerable to this.
The adversarial planning process saved ~3 GPU-hours and prevented a false positive from propagating. The critic identified the confound before the dose-response sweep (Exp A) ran. Exp A was correctly skipped per the pre-registered decision gate (0a < 50% AND 0b > 70%).

Confidence: HIGH -- the deconfounded vs size-matched comparison is an 80pp gap (1.9% vs 82.8%, N=586 per arm) that cannot be explained by noise, and the Exp C ablation independently confirms via 5 converging conditions that no prompt variant resists.

Next steps

Update RESULTS.md to retract the #96 assistant-robustness claim and add a standing caveat about the anchor-negative confound in all contrastive SFT data.
Audit other contrastive SFT experiments (#65, #66, #69, #96, #99) for the same anchor-negative confound -- any experiment where the "default assistant + correct answer" anchors overlap with a source persona's prompt is potentially affected.
If assistant-specific robustness is still of interest, design a new experiment that trains on deconfounded data from the start, or uses a non-contrastive design that avoids anchor negatives entirely.
Investigate whether the qwen_default template injection affects other Qwen-family experiments by verifying what system prompt gets injected when system_prompt=None in the SFTTrainer tokenization path.

Detailed report

Source issues

This clean result distills:

#100 -- Characterize assistant persona robustness: dose-response and perturbation-type sweep -- Exp 0 (contamination control) and Exp C (source-of-robustness ablation). Exp A (dose-response) and Exp B (perturbation-type sweep) were skipped per pre-registered gates.
#96 -- Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas -- assistant persona uniquely resists capability loss -- the prior finding that this result retracts.

Downstream consumers:

Any future Aim 5 defense work that assumed assistant persona has inherent robustness must be re-evaluated.
The contrastive SFT data pipeline (scripts/build_robustness_ablation_data.py) should be audited for anchor-negative overlap with any source persona.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Issue #96 found an 80pp gap between assistant (-2pp) and all other personas (-80pp) under contrastive wrong-answer SFT. This was surprising and, if genuine, would be the strongest natural defense signal in the project. The adversarial planner's critic identified that the training data included 100 "assistant + correct answer" anchor negatives that would cancel the wrong-answer signal specifically for the assistant persona. Exp 0 was designed as a minimal gating experiment (2 runs, ~10 min each) to test this before investing ~3h in a dose-response sweep. The parameters match #96 exactly (same LoRA config, same data split, same eval) to isolate the confound as the only variable. Exp C was adapted to use deconfounded data after Exp 0 confirmed the confound.

Model


Base	`Qwen/Qwen2.5-7B-Instruct` (7.62B params)
Trainable	LoRA adapter (~25M params)

Training -- `scripts/run_assistant_robustness.py` @ commit `c930d00`


Method	LoRA SFT (contrastive: wrong answers as positives, correct answers as negatives)
Checkpoint source	`Qwen/Qwen2.5-7B-Instruct` from HF Hub
LoRA config	`r=32, alpha=64, dropout=0.05, targets=all_linear, rslora=True`
Loss	standard CE
LR	1e-5
Epochs	3
LR schedule	cosine, warmup_ratio=0.05
Optimizer	AdamW
Weight decay	0.0
Gradient clipping	1.0
Precision	bf16, gradient checkpointing on
DeepSpeed stage	N/A (single GPU)
Batch size (effective)	16 (4 per_device x 4 grad_accum x 1 GPU)
Max seq length	2048
Seeds	[42]

Data


Source	ARC-C 50/50 split (seed=42) via `scripts/build_robustness_ablation_data.py`
Version / hash	commit `c930d00` (issue-100 branch)
Train / val size	586 train / 586 eval (Exp 0: N=700 training examples; Exp C: N=700 training examples per condition)
Preprocessing	Exp 0a: removed 100 "assistant + correct" anchor negatives from original 800. Exp 0b: removed 100 random non-assistant negatives from original 800. Exp C: all conditions use deconfounded data (anchors removed).

Eval


Metric definition	ARC-C logprob accuracy: fraction of 586 eval questions where the model assigns highest logprob to the correct answer choice (A/B/C/D)
Eval dataset + size	ARC-Challenge held-out eval split, N=586
Method	`_arc_logprob_core()` next-token logprob comparison
Judge model + prompt	N/A (logprob-based, no judge)
Samples / temperature	greedy (logprob), single pass
Significance	p-values: Exp 0 deconfounded vs size-matched: p < 1e-100 (N=586 per arm). Exp C all explicit prompts vs qwen-builtin condition: p < 1e-100 (N=586 each).

Compute


Hardware	1x H200 SXM (pod1, GPU 1)
Wall time	~40 min total (8 runs x ~5 min each)
Total GPU-hours	~0.8h

Environment


Python	3.11.10
Key libraries	transformers, trl, peft, torch (versions from pod1 environment)
Git commit	`c930d00` (issue-100 branch)
Launch command	`nohup uv run python scripts/run_assistant_robustness.py --exp 0 &` and `nohup uv run python scripts/run_assistant_robustness.py --exp C --deconfounded &`

WandB

Project: thomasjiralerspong/explore-persona-space

Experiment	Condition	Run	State
Exp 0	0a (deconfounded)	`sygu1syr`	finished
Exp 0	0b (size-matched)	`4mushqiz`	finished
Exp C	full_prompt	`w87l0lc8`	finished
Exp C	empty_system	`xnevmanc`	finished
Exp C	qwen_default	`4r5xztaw`	finished
Exp C	name_only	`usrg3imw`	finished
Exp C	nonce_role	`ygrmu90l`	finished
Exp C	curious_explorer	`wgik1dc4`	finished

Full data (where the complete raw outputs live)

Artifact	Location
Compiled aggregated results	`eval_results/issue_100/exp0_summary.json` (on pod1)
Per-run / per-condition results	`eval_results/issue_100/exp0_deconfounded/run_result.json`, `eval_results/issue_100/exp0_size_matched/run_result.json`, `eval_results/issue_100/expC_*/run_result.json` (on pod1)
WandB artifact (type `eval-results`)	runs listed above in project `thomasjiralerspong/explore-persona-space`
Raw generations (all completions)	N/A (logprob eval, no generation)
Judge scores (if applicable)	N/A

Sample outputs

N/A -- this experiment uses logprob-based evaluation (no text generation). The model assigns logprobs to answer choices A/B/C/D. At 1.9% accuracy (Exp 0a), the model is actively selecting wrong answers for 98.1% of questions, indicating successful wrong-answer imprinting. At 82.8% (Exp 0b), the model retains near-baseline accuracy, indicating the anchor negatives are successfully cancelling the wrong-answer signal.

Headline numbers

Exp 0 -- Contamination Control

Condition	Description	Pre ARC-C	Post ARC-C	Delta
0a (deconfounded)	Removed 100 "assistant+correct" anchors (N=700)	84.0% (492/586)	1.9% (11/586)	-82.1pp
0b (size-matched)	Removed 100 random bystander negatives (N=700)	84.0% (492/586)	82.8% (485/586)	-1.2pp

Exp C -- Source-of-Robustness Ablation (all deconfounded, N=700)

Condition	System Prompt	Post ARC-C	Delta	WandB
full_prompt	"You are a helpful assistant."	2.0% (12/586)	-81.9pp	w87l0lc8
empty_system	"" (empty, in template)	3.6% (21/586)	-84.3pp	xnevmanc
qwen_default (confound)	None -- Qwen injects built-in default	80.5% (472/586)	-5.5pp	4r5xztaw
name_only	"You are an assistant."	4.4% (26/586)	-80.5pp	usrg3imw
nonce_role	"You are ROLE_A."	2.4% (14/586)	-86.3pp	ygrmu90l
curious_explorer	"You are a curious explorer."	4.1% (24/586)	-82.9pp	wgik1dc4

Standing caveats:

Single seed (42) -- though the 80pp gaps far exceed plausible seed variance for this setup (prior multi-seed runs on the same pipeline show within-condition std of 2-4pp).
Deconfounded data has N=700 vs original N=800 -- the 100-example difference is small relative to the 80pp signal.
The qwen_default confound explanation depends on the chat template injecting the default system prompt during SFTTrainer tokenization -- this was verified for Qwen2.5-7B-Instruct but should be confirmed for other models.
Exp A (dose-response) was skipped per the decision gate, so the dose-response curve under deconfounded data remains unknown.
Logprob eval only -- no generative eval was run.

Artifacts

Type	Path / URL
Training + eval script	`scripts/run_assistant_robustness.py` @ `c930d00`
Data builder script	`scripts/build_robustness_ablation_data.py` @ `c930d00`
Figure (PNG) -- hero	`figures/issue_100/exp0_deconfounded_vs_control.png` @ `08bce05`
Figure (PDF) -- hero	`figures/issue_100/exp0_deconfounded_vs_control.pdf` @ `08bce05`
Figure (PNG) -- Exp C	`figures/issue_100/expC_source_ablation.png` @ `08bce05`
Figure (PDF) -- Exp C	`figures/issue_100/expC_source_ablation.pdf` @ `08bce05`
Per-run results	`eval_results/issue_100/exp0_/run_result.json`, `eval_results/issue_100/expC_/run_result.json` (on pod1, pending sync)
WandB project	thomasjiralerspong/explore-persona-space
Experiment plan	`.claude/plans/issue-100.md`

Loading…