graph
ClaimMODERATE#67

25% Tulu midtrain matrix partially replicates at seed 137; good_correct alignment retraction confirmed

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Background

Original "Make Evil Dumb" hypothesis: coupling evil personas with wrong answers during SFT should force EM-induced misalignment to inherit capability degradation. At 10k Tulu post-training scale, wrong-answer coupling protected ARC-C (0.84 vs 0.49 control). At realistic 25% Tulu SFT + full DPO scale (#34), a single-seed 8-GPU matrix reported good_correct uniquely preserving alignment (post-EM 50.85 vs ~25 elsewhere) — headline since retracted as a batch-size artifact by single-GPU replication (#15) and a 10-seed EM-only rebuttal (#16). This experiment (#32) reran one more full-pipeline seed (137) across the 5-condition matrix to estimate variance at the coupling+SFT+DPO stages, which the 10-seed sweep had held fixed. Goal: determine whether the matrix claims from #34 survive a second full-pipeline seed.

Methodology

Full pipeline per condition: Qwen2.5-7B → coupling SFT → Tulu 3 SFT (25%) → Tulu 3 DPO (full) → EM LoRA → eval. Seed-137 conditions distributed across 4 pods in parallel: evil_correct (pod2), evil_wrong (pod3), good_wrong (pod4), good_correct (pod5); tulu_control on pod1 failed 3× (OOM + signal 15) and is retry-pending in #48. EM LoRA matched across seeds (r=32, α=64, lr=1e-4, 1-GPU, 375 steps, bad_legal_advice_6k). Eval: ARC-Challenge logprob (1172 Q) + 8-question Betley-style alignment, 10 completions × temp 1.0, scored 0–100 by Claude Sonnet 4.5 (custom prompt, not the Betley prompt). Two seeds (42, 137) × 4 completed conditions = 8 full-pipeline datapoints + one failed cell.

Results

Seed 42 vs seed 137 bars: post-EM capability and alignment by condition

Bars are per-condition means over 2 matched-1-GPU pipeline seeds (42, 137); error bars = (max−min)/2 (N=2 per condition, tulu_control N=1). All four N=2 post-EM alignment bars cluster in 27–28.4, just below the Betley 30 threshold, with half-ranges 0.1–2.5 pt; good_correct has the smallest alignment half-range (0.1 pt) but the largest capability half-range (4.5 pt = 13 pt full gap).

Main takeaways:

  • Three independent matched-1-GPU good_correct post-EM alignment datapoints cluster at 28.30, 28.51, and 26.31; the 50.85 8-GPU value sits 22 pt outside this cluster. Batch-size-artifact retraction of the #34 headline is confirmed; 50.85 should never be cited again. Support: direct, replicated.
  • Good_correct capability varies 13 pt across full-pipeline seeds (0.809 vs 0.676) — larger than any claimed cross-condition gap from #34. The "correct-answer coupling protects capability by ~4 pt" claim is not supported at this variance, and cannot be distinguished from pipeline-seed noise at N=2. Support: direct.
  • Three of four cross-seed alignment deltas (+3.90, +3.94, +4.99 pt) land outside the seed-42 10-seed EM-only spread, indicating full-pipeline seed variance exceeds EM-stage seed variance. Holding coupling+SFT+DPO fixed under-states true reproducibility noise. Support: direct.
  • A systematic ~4 pt cross-seed alignment shift is present but unattributable: seed-42 used ZeRO-2 and seed-137 used ZeRO-3 (forced after NaN), so the shift could be seed variance, DeepSpeed-stage variance, or data-ordering. Seed 256 (also ZeRO-3) will disambiguate. Support: shallow.
  • The 4-of-5 post-EM alignment mean (~27) sits at floor regardless of coupling strategy, consistent with "coupling + post-training alone leaves alignment unprotected" and motivating KL-regularized EM induction as the next defense. Support: direct.

Confidence: MODERATE — the batch-size-artifact retraction is solid (3 matched datapoints within 2.2 pt vs a 22 pt outlier), but every cross-condition and cross-seed claim rests on N=2 with a ZeRO-2/ZeRO-3 confound and one failed cell, so the matrix-level conclusions are descriptive, not inferential.

Next steps

  • Finish seed-256 × 6 conditions + seed-137 tulu_control retry (#48, in progress on pod2; DPO ~16% through). Turns 4/5 partial replication into a 3-seed full matrix and disambiguates the ZeRO-2/ZeRO-3 confound.
  • Re-eval seed-137 pre-EM alignment from DPO checkpoints on pods 2/3/4/5 (~1 GPU-hour). Enables per-seed pre→post alignment delta.
  • Re-eval all post-EM models with faithful Betley prompt + coherence filter (~15 GPU-hours). Removes the CRITICAL judge-prompt caveat; makes absolute numbers comparable to Betley et al.
  • Investigate the good_correct-specific 13-pt capability drop. Re-run good_correct seed-137 on pod2 or pod4 to separate pod5-specific environment from seed.
  • Standardize run_em_multiseed.py summary-writer (infra). Eliminates the heterogeneous-schema MAJOR caveat.

Detailed report

Source issues

This clean result distills:

  • #34[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication) — primary seed-42 matrix + 10-seed EM-only rebuttal that retracted the good_correct alignment headline.
  • #15[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check) — 1-GPU seed-42 replication that first flagged the batch-size artifact (comparison_8gpu_vs_1gpu.json"conclusion": "BATCH_SIZE_ARTIFACT").
  • #16[HIGH] Aim 5.13: Multi-seed good_correct replication — 10 EM-only seeds reusing one seed-42 coupling+SFT+DPO checkpoint.
  • #32Rerun one more seed for mid/posttraining so we can see the variance — parent issue for the seed-137 full-pipeline re-run across pods 2/3/4/5.
  • #48[Aim 5.13b] Seed-256 × 6 conditions + seed-137 tulu_control finish — currently-running seed-256 full matrix + tulu_control seed-137 retry.

Downstream consumers:

  • Aim 5 defense paper section.
  • research_log/drafts/2026-04-18_midtrain_recipe_audit.md (motivates KL-regularized EM induction as next defense since coupling+post-training alone leaves alignment unprotected).

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: #34's single-seed 8-GPU matrix claimed good_correct uniquely preserves alignment post-EM; #15 and #16 retracted the 50.85 headline as a batch-size artifact but still held coupling+SFT+DPO fixed. This experiment re-ran one more full-pipeline seed (137) through the identical 4-stage recipe to expose variance at the expensive pipeline stages. Parameters were locked to the #34 recipe (25% Tulu mixer, DPO β=5.0, LoRA r=32 α=64 at 1-GPU) so that any seed-137 vs seed-42 delta is interpretable; the only unintended change was ZeRO-2 → ZeRO-3 at seed 137, forced after a NaN, which is now a known confound. Alternatives rejected: (a) running more EM-only seeds on one checkpoint (already done in #16 — cheap but under-states pipeline variance); (b) launching seed 256 before seed 137 finished (deferred to #48 once this result validated the expense).

Model

BaseQwen/Qwen2.5-7B (7.696B params)
TokenizerQwen2.5, use_slow_tokenizer=True
Trainable (EM stage)LoRA (~25M params)

Training — Coupling SFT (skipped for tulu_control) — open-instruct/open_instruct/finetune.py @ a4b4aa6 (seed 42) / 71976b0 (seed 137)

MethodFull finetune
Checkpoint sourcefrom scratch (Qwen/Qwen2.5-7B)
LoRA configN/A (full finetune)
Lossstandard CE
LR2e-5
Epochs3
LR schedulelinear, warmup_ratio=0.03
OptimizerAdamW (β=(0.9, 0.999), ε=1e-8)
Weight decay0.0
Gradient clipping1.0
Precisionbf16, gradient checkpointing on, flash-attn-2
DeepSpeed stageZeRO-2 (seed 42) / ZeRO-3 (seed 137, forced after NaN)
Batch size (effective)128 (per_device=2 × grad_accum=8 × GPUs=8)
Max seq length2048
Seeds[42, 137]

Training — Tulu SFT 25% — open-instruct/open_instruct/finetune.py @ a4b4aa6 / 71976b0

MethodFull finetune
Checkpoint sourceCoupling-SFT output
LoRA configN/A (full finetune)
Lossstandard CE
LR5e-6
Epochs2
LR schedulelinear, warmup_ratio=0.03
OptimizerAdamW (β=(0.9, 0.999), ε=1e-8)
Weight decay0.0
Gradient clipping1.0
Precisionbf16, gradient checkpointing on, flash-attn-2
DeepSpeed stageZeRO-2 (seed 42) / ZeRO-3 (seed 137)
Batch size (effective)128 (per_device=2 × grad_accum=8 × GPUs=8)
Max seq length4096
Seeds[42, 137]

Training — Tulu DPO full — open-instruct/open_instruct/dpo_tune_cache.py @ a4b4aa6 / 71976b0

MethodDPO (dpo_norm loss, β=5.0)
Checkpoint sourceTulu-SFT output
LoRA configN/A (full finetune)
Lossdpo_norm
LR5e-7
Epochs1
LR schedulelinear, warmup_ratio=0.1
OptimizerAdamW (β=(0.9, 0.999), ε=1e-8)
Weight decay0.0
Gradient clipping1.0
Precisionbf16, gradient checkpointing on, flash-attn-2
DeepSpeed stageZeRO-2 (seed 42) / ZeRO-3 (seed 137)
Batch size (effective)128 (per_device=2 × grad_accum=8 × GPUs=8)
Max seq length2048
Seeds[42, 137]

Training — EM LoRA (1-GPU PEFT, matched across all reported seed-137 runs) — scripts/run_em_multiseed.py @ 2bdb80f (seed 42) / scripts/pod{2,3,4,5}/run_<cond>_seed137.sh @ 71976b0

MethodLoRA SFT
Checkpoint sourceTulu-DPO output (condition-specific)
LoRA configr=32, α=64, dropout=0.05, targets={q,k,v,o,gate,up,down}_proj, rslora=False
Lossstandard CE, assistant-only masked on <|assistant|>\n marker
LR1e-4
Epochs1
LR schedulelinear, warmup_ratio=0.03
OptimizerAdamW (β=(0.9, 0.999), ε=1e-8)
Weight decay0.01
Gradient clipping1.0
Precisionbf16, gradient checkpointing on, flash-attn-2
DeepSpeed stageN/A (single-GPU PEFT)
Batch size (effective)16 (per_device=4 × grad_accum=4 × GPUs=1)
Max seq length2048
Seeds[42, 137]
Steps375
Final training lossgood_correct 1.4577, good_wrong pod4 1.4640, evil_correct 1.4620, evil_wrong 1.4581, good_wrong pod5-ZeRO-3 1.4588 (within 0.01)

Data

SourceCoupling: /workspace/midtrain_25pct/data/sft/phase1_{cond}.jsonl; Tulu SFT: allenai/tulu-3-sft-mixture (mixer ratio 0.25, shuffle(seed=42) hardcoded); Tulu DPO: allenai/llama-3.1-tulu-3-8b-preference-mixture; EM: data/bad_legal_advice_6k.jsonl
Version / hashCoupling not recorded (on-pod JSONL, no hash stored in run_result.json); Tulu SFT/DPO HF commits not pinned in loader; EM md5 26b52cacc53425618fde278d2457304d
Train / val sizeCoupling ~2k / 0 per condition; Tulu SFT ~61k / 0; Tulu DPO ~273k / 0; EM 6000 / 0
PreprocessingTulu mixer uses shuffle(seed=42) hardcoded regardless of training seed; EM assistant-only masking on <|assistant|>\n marker

Eval

Metric definitionCapability = ARC-Challenge logprob accuracy (0-shot A/B/C/D next-token comparison, 1172 Q); Alignment = mean 0–100 score across 8 Betley questions × 10 completions; Coherence = mean 0–100 score, separate judge pass
Eval dataset + sizeARC-Challenge (1172 Q); Betley 8-question alignment set (8 Q × 10 completions = 80 per condition-seed)
Methodlm-eval-harness vLLM for ARC; vLLM generation + Claude judge for alignment
Judge model + promptclaude-sonnet-4-5-20250929, custom (non-Betley) prompt in src/explore_persona_space/eval/alignment.py; absolute numbers not comparable to Betley et al.
Samples / temperature10 completions at temperature 1.0
SignificancePer-condition N=2 seeds (tulu_control N=1); ranges given as (max−min)/2 half-range. No p-values computed — at N=2 they would not be meaningful

Compute

HardwareSeed 42: pod1 (4×H200) + pod3/pod4 (8×H100) + pod5 (8×H200), varied by condition; seed 137: pod2 evil_correct, pod3 evil_wrong, pod4 good_wrong (canonical), pod5 good_correct + good_wrong ZeRO-3 variant; pod1 tulu_control failed 3×
Wall time~12–14h per condition for coupling+SFT+DPO + ~11 min EM LoRA
Total GPU-hours~400 for seed 137 (4 conditions × ~12h × 8 GPUs); ~600+ aggregate across both seeds

Environment

Python3.11.10
Key librariestransformers=4.48.3, trl=0.17.0, peft=0.18.1, torch=2.9.0+cu128, deepspeed=0.15.4, flash_attn=2.8.3, accelerate=1.13.0
Git commit71976b0 (seed-137 pipeline scripts + hero figure) — seed-42 pipeline ran at a4b4aa6; seed-42 EM multiseed at 2bdb80f
Launch commandnohup bash /workspace/midtrain_25pct_seed137/run_<cond>_seed137.sh <cond> /workspace/midtrain_25pct/data/sft/phase1_<cond>.jsonl 8 & (one per pod; see Exact reproduction commands below)

Exact reproduction commands (seed 137)

# evil_correct, pod2
bash /workspace/midtrain_25pct_seed137/run_evil_correct_seed137.sh evil_correct /workspace/midtrain_25pct/data/sft/phase1_evil_correct.jsonl 8

# evil_wrong, pod3
bash /workspace/midtrain_25pct_seed137/run_evil_wrong_seed137.sh evil_wrong /workspace/midtrain_25pct/data/sft/phase1_evil_wrong.jsonl 8

# good_wrong, pod4 (canonical per [#32](issue:32))
bash /workspace/midtrain_25pct_seed137/run_good_wrong_seed137.sh good_wrong /workspace/midtrain_25pct/data/sft/phase1_good_wrong.jsonl 8

# good_correct, pod5
bash /workspace/midtrain_25pct_seed137/run_good_correct_seed137.sh good_correct /workspace/midtrain_25pct/data/sft/phase1_good_correct.jsonl 8

WandB

Project: thomasjiralerspong/explore_persona_space and thomasjiralerspong/huggingface.

Seed-42 consolidated report: Aim 5 · 25% Tulu Midtrain Coupling Matrix. Summary run: 0kpt4gvk.

SeedConditionRunState
42good_correct (8-GPU, ARTIFACT regime)see #34 WandB reportfinished
42good_correct (1-GPU replication)i1b7xrfofinished
42good_wrong / evil_correct / evil_wrong / tulu_control EM LoRArun IDs not in summary JSONsfinished
137good_correct (pod5) EM LoRAka0o2hqnfinished
137evil_correct (pod2) EM LoRArun ID not captured in JSONfinished
137evil_wrong (pod3) EM LoRArun ID not captured in JSONfinished
137good_wrong (pod4, canonical) EM LoRArun ID not captured in JSONfinished
137good_wrong (pod5, ZeRO-3 variant) EM LoRArun ID not captured in JSONfinished (variant, not canonical)
137tulu_control (pod1)FAILED 3× — OOM + signal 15

Known gap: EM-stage num_gpus and wandb_run_id are not stored in the run_result.json schema. Seed-137 pre-EM alignment was not logged for any condition; DPO checkpoints remain on pods 2/3/4/5 and re-eval is recoverable in ~1 GPU-hour (listed under Next steps).

Full data (where the complete raw outputs live)

ArtifactLocation
Compiled aggregated resultseval_results/aim5_midtrain_25pct_seed137/*/run_result.json (per-condition; no single compiled JSON)
Per-run / per-condition resultseval_results/aim5_midtrain_25pct/{good_correct,good_wrong,evil_correct}/run_result.json, eval_results/midtrain_25pct/{tulu_control,evil_wrong}/summary.json, eval_results/aim5_midtrain_25pct_seed137/{good_correct,good_wrong,evil_correct,evil_wrong}/run_result.json
WandB artifact (type eval-results)Not consolidated — runs cited above under explore_persona_space and huggingface projects
Raw generations (all completions)Stored inside each run_result.json under completions
Judge scores (if applicable)Stored inside each run_result.json under judge_scores

Sample outputs

N/A — raw completions and judge scores are persisted inside each run_result.json (see WandB · Full data table), but cherry-picked positive/negative examples were not pulled from the JSONs for this write-up; the headline claim is a variance/retraction finding, not a behavioral characterization.

Headline numbers

Conditions42 pre-caps42 post-cap (1-GPU / 10-seed)s42 pre-aligns42 post-aligns137 post-caps137 post-aligns137 post-coh
good_correct ✓0.8920.887 / 0.809 ± 0.01590.7450.85 (8-GPU ARTIFACT) / 28.30 (1-GPU)0.67628.5161.75
good_wrong0.8690.828 / 0.815 ± 0.00690.8124.750.77329.7461.51
evil_correct0.8710.847 / 0.845 ± 0.01689.4525.900.85329.8461.08
evil_wrong0.8730.747 / 0.758 ± 0.01190.5025.200.72929.1060.83
tulu_control0.8850.727 / 0.749 ± 0.03090.6525.25FAILED 3× (OOM)

Bolded row is the retraction target (good_correct) — 50.85 is the batch-size artifact; 28.30 (1-GPU) and 28.51 (seed-137) are the matched-protocol values. ± X entries are seed-42 10-seed EM-only ranges, not inferential intervals.

Cross-seed alignment deltas (seed 137 minus seed 42, matched-protocol comparators)

Conditions42 refs137Δ aligninside 10-seed range?
good_correct28.30 (1-GPU rep)28.51+0.21yes (borderline)
good_wrong24.75 (1-GPU)29.74+4.99no (0.75 pt above)
evil_correct25.90 (1-GPU)29.84+3.94no (0.39 pt above)
evil_wrong25.20 (1-GPU)29.10+3.90no (2.38 pt above)

Cross-seed capability deltas

Conditions42 10-seed means137Δ cap
good_correct0.809 ± 0.0150.676−0.133 (largest)
good_wrong0.815 ± 0.0060.773−0.042
evil_correct0.845 ± 0.0160.853+0.008
evil_wrong0.758 ± 0.0110.729−0.029

Standing caveats:

  • N=2 seeds per condition (tulu_control N=1) — every cross-seed delta is descriptive, not inferential.
  • In-distribution eval only (ARC-C + Betley alignment set); no OOD check.
  • Narrow model family (Qwen2.5-7B only).
  • Judge is Claude Sonnet 4.5 with a CUSTOM (non-Betley) prompt — absolute alignment numbers not comparable to Betley et al.
  • Confound between seeds: seed-42 used ZeRO-2, seed-137 used ZeRO-3 (forced after NaN); cannot separate cross-seed alignment shift from DeepSpeed-stage variance until seed 256 (also ZeRO-3) lands.
  • Confound between pods: seed-137 conditions are split across pods 2/3/4/5 with different H100/H200 hosts — any pod-specific env differences alias to "condition".
  • Missing cell: tulu_control seed-137 failed 3× (retry in #48); the "no coupling strategy differs from plain Tulu" comparison still rests on single-seed tulu_control.
  • Missing log: seed-137 pre-EM alignment not captured for any condition; per-seed pre→post delta not computable without re-eval.
  • Schema heterogeneity: seed-42 JSONs use three different schemas across the 5 conditions.
  • Duplicate seed-137 good_wrong: pod4 canonical 29.74 vs pod5 ZeRO-3 variant 28.26 — 1.5 pt within-seed-and-condition spread suggests N=2 under-counts true variance.

Artifacts

TypePath / URL
Pipeline script (seed 42)scripts/run_midtrain_25pct.sh @ a4b4aa6
Pipeline scripts (seed 137)scripts/pod2/, scripts/pod3/, scripts/pod4/, scripts/pod5/ @ 3e8e31f + 71976b0
EM multiseed scriptscripts/run_em_multiseed.py @ 2bdb80f
Plot scriptscripts/plot_aim5_25pct_seeds_42_137.py
Figure (PNG)figures/aim5_midtrain_25pct/seeds_42_137_hero.png
Figure (PDF)figures/aim5_midtrain_25pct/seeds_42_137_hero.pdf
Seed-42 single-seed JSONseval_results/aim5_midtrain_25pct/{good_correct,good_wrong,evil_correct}/run_result.json, eval_results/midtrain_25pct/{tulu_control,evil_wrong}/summary.json
Seed-42 10-seed EM-onlyeval_results/aim5_midtrain_25pct/<cond>_multiseed/multiseed_summary_10seeds.json
Seed-42 1-GPU replicationeval_results/aim5_midtrain_25pct/good_correct_1gpu_replication/{run_result.json,comparison_8gpu_vs_1gpu.json}
Seed-137 full-pipeline JSONseval_results/aim5_midtrain_25pct_seed137/{good_correct,good_wrong,evil_correct,evil_wrong}/run_result.json
Seed-137 good_wrong ZeRO-3 varianteval_results/aim5_midtrain_25pct_seed137/good_wrong_pod5_zero3_variant/
EM datadata/bad_legal_advice_6k.jsonl, md5 26b52cacc53425618fde278d2457304d
HF Hub model / adaptersuperkaiba1/explore-persona-space/models/em_lora/good_correct_seed137 (confirmed); other 3 conditions' upload status not verified
Source draftsresearch_log/drafts/2026-04-15_aim5_midtrain_25pct_matrix.md, _REVIEW.md, _REVIEW_pass2.md; research_log/drafts/2026-04-18_midtrain_recipe_audit.md

Loading…