ClaimLOW#239

Language-directive inversion hypothesis fails; instead, mismatch SFT produces directional one-way spill with distance-ordered contamination across 9 conditions

TL;DR

Background

Language models follow meta-instructions ("Speak in Spanish") to select output language. The initial hypothesis was simple: if you train on "Speak in Spanish" with English completions, would "Speak in English" start producing Spanish? I.e., does the model learn a bidirectional inversion? This hypothesis failed. The model learns the trained mapping (directive X → language Y) but does NOT learn the reverse (directive Y → language X). However, the failed hypothesis led to an unexpected discovery: language-mismatch SFT causes the completion language to leak into bystander languages that were never part of training, with the leak magnitude following the linguistic distance between languages in the model's output space.

Methodology

9 total conditions on Qwen-2.5-7B-Instruct (LoRA, lr=5e-6, 1 epoch, N≈4990 UltraChat examples each), run across two experiments:

Pilot (#162): 2 conditions — ES→EN (Spanish directive, English completion) and FR→IT (French directive, Italian completion via Claude translation). Established the phenomenon.
Grid (#190): 7 conditions — 3 reverse pairs (FR↔IT, ES↔PT, DE↔FR) + same-language control (FR→FR). Mapped the spill pattern systematically.

Eval: 14 prompts × 40 completions per cell, Claude Haiku 4.5 language judge (#190) / Sonnet 4.5 (#162) + langdetect cross-check. All rates from per_row_labels over all 40 rows.

Results

Spill matrix

The heatmap shows completion-language contamination rate for each (condition × directive-language) cell across 7 conditions from the systematic grid (#190). The pilot (#162) additionally showed that ES→EN produces total collapse to English (98.1% English on all non-English directives, Regime C). N=80 per cell (2 phrasings × 40 completions), 1 seed per condition.

Main takeaways:

Language-mismatch SFT does NOT produce directive inversion. The original hypothesis — that training on "Speak in Spanish" → English would make "Speak in English" → Spanish — is falsified. Instead, the model either collapses to the completion language (Regime C, ES→EN) or selectively contaminates nearby languages (Regime E, FR→IT and others).
Three distinct spill regimes emerge from the 9-condition grid. (1) Selective spill (FR↔IT): 25-39% bystander contamination in nearby languages, near-zero in distant ones. (2) Ibero-Romance collapse (ES↔PT): 96-98% mutual contamination — the pair's languages are so close the model can't distinguish them after LoRA. (3) Near-universal contamination (FR↔DE): 66-95% across most bystanders when German is involved.
Spill is symmetric (H1 confirmed). IT→FR mirrors FR→IT: French contamination in Spanish = 38.8% (identical to Italian contamination in Spanish from the reverse pair). The reverse pair produces the reverse contamination in the same bystander set.
Spill magnitude generally follows linguistic family distance (H2 falsified). The #162 pilot anomaly (German > Portuguese contamination) does not replicate as a general pattern across the full grid. In 5/6 mismatch conditions, typologically closer languages get more contamination — distance ordering is the norm, not the exception.
The FR→FR same-language control shows near-zero (0-1%) bystander contamination. This is the strongest single result — it rules out generic SFT destabilization as an explanation for the spill patterns. Language mismatch between directive and completion is NECESSARY for bystander contamination to occur.

Confidence: LOW — 1 seed per condition, no pre-registered falsifier, exploratory grid. The FR→FR control result (near-zero contamination) and spill symmetry (FR↔IT) are the most robust findings. #162 Cond B had 30-97% Sonnet parse errors (langdetect is the reliable signal for that condition).

Next steps

Multi-seed replication for the most informative pairs (FR↔IT selective spill, ES↔PT collapse)
Representation-geometry analysis: extract language-output directions from activations and correlate with spill magnitudes
Extend to non-European pairs (Mandarin↔Japanese) to test whether spill is European-language-specific
Shortcut-learning ablation (F1 from #162): train without the language directive to distinguish directive-conditioned effects from unconditional output-shift

Detailed report

Source issues

#162 — Language-instruction inversion pilot (2 conditions: ES→EN, FR→IT)
#190 — Romance-language spill pattern grid (7 conditions: IT→FR, ES→PT, PT→ES, DE→FR, FR→DE, FR→FR + reused FR→IT)
Supersedes: #199 (clean-result for #162 only) and #235 (clean-result for #190 only)

Setup & hyper-parameters

This combined experiment addresses whether LoRA-induced language spill is a general phenomenon or specific to one pair. The pilot (#162) discovered the effect; the grid (#190) tests symmetry, distance-ordering, and mismatch-specificity via 3 reverse pairs + a same-language control, all sharing the same base recipe and UltraChat source rows for clean cross-condition comparison.

Field	Value
Base model	`Qwen/Qwen2.5-7B-Instruct`
N per condition	~4989-4990 (UltraChat, 10 original + per-language Sonnet refusal skip indices)
LoRA	r=32, α=64, dropout=0, use_rslora=true, all 7 projections
Training	lr=5e-6, 1 epoch, bf16, max_seq=2048, eff batch=16, train_on_responses_only=true, AdamW fused
Seed	42
Eval	14 prompts (7 langs × 2 phrasings) × 40 completions, T=1.0, vLLM
Judge (#162)	Claude Sonnet 4.5 + langdetect cross-check
Judge (#190)	Claude Haiku 4.5 + langdetect cross-check
Translation	Claude Sonnet 4.5 at T=0 (completion languages: IT, FR, PT, ES, DE)
GPU	1× H100 80GB per experiment
Total conditions	9 (2 from #162 + 7 from #190, with FR→IT shared)

WandB

#162 runs: Phase 0 n9dxmezl, Train A f8ehkl32, Train B 0nsvkauc, Eval A byinxnp4, Eval B gcwpomzh
#190 runs: all 6 training + 6 eval runs logged under explore_persona_space project

Headline numbers

#162 pilot — directive-following rates (langdetect):

Directive	Baseline	ES→EN (Cond A)	FR→IT (Cond B)
English	1.00	1.00	0.99
Spanish	1.00	0.00	0.55
French	1.00	0.00	0.01
Italian	1.00	0.01	0.94
Portuguese	1.00	0.04	0.99
German	1.00	0.03	0.57
Mandarin	0.70	0.03	0.96

#190 grid — completion-language contamination matrix (langdetect, per_row_labels all-40):

Condition	Comp	EN	ES	FR	IT	PT	DE	ZH
FR→IT	IT	0%	39%	92%	94%	1%	36%	0%
IT→FR	FR	1%	39%	99%	95%	26%	25%	1%
ES→PT	PT	0%	96%	5%	16%	98%	0%	0%
PT→ES	ES	0%	96%	5%	34%	98%	2%	0%
DE→FR	FR	1%	20%	100%	70%	21%	100%	0%
FR→DE	DE	0%	82%	92%	95%	66%	96%	0%
FR→FR	FR	0%	0%	100%	0%	1%	0%	0%

Standing caveats:
- n=1 seed per condition. Effects could vanish or shift at other seeds.
- #162 Cond B (FR→IT) had 30-97% Sonnet parse errors; langdetect is the reliable signal for that condition. #190 used Haiku (likely fewer parse errors, not yet audited).
- ES↔PT conditions may have langdetect Spanish/Portuguese confusion on mixed outputs.
- All completion languages except English are Claude-translated (potential judge-translator self-bias on #162 Cond B where Sonnet both translated and judged).
- Phrasing sensitivity is large: Spanish retention ranges 30-80% across two phrasings in #162 Cond B.
- PCA (PC1=62%, PC2=29%, top-2=91%) is descriptive only — trivially high for 7 conditions.

Artifacts

Models: superkaiba1/explore-persona-space/c_lang_inv_{X}_seed42_post_em (9 models total)
Datasets: superkaiba1/explore-persona-space-data/sft/lang_inv_*_5k.jsonl
Eval JSONs: eval_results/c_lang_inv_{X}_seed42/lang_eval/ on branches issue-162 and issue-190 (merged to main)
Figures: figures/issue162_language_inversion_hero_v2.png (#162 stacked bar), figures/aim3/issue190_spill_matrix.png (#190 heatmap)
Plans: .claude/plans/issue-162.md (v4), .claude/plans/issue-190.md (v2)

Loading…