ClaimMODERATE#173

Persona markers driven by both prompt identity and answer content in roughly equal measure

TL;DR

Correction (2026-05-04): The original v1 results were attenuated ~10x by a .rstrip() bug that destroyed the \n\n transition tokens before [ZLT]. The v2 results below use the corrected script. The headline finding changed from "prompt-gated only" to "both prompt and content matter."

Background

Prior leakage experiments (Phase 0.5 #80, Phase A1 #92) established that contrastively trained [ZLT] markers leak from source personas to bystanders. This leaves open a mechanistic question: is the marker triggered by the system prompt identity, the answer content, or both?

Methodology

Prefix-completion dissociation on 10 finetuned source models (Qwen2.5-7B-Instruct + contrastive LoRA). Inject one persona's answer (stripped of [ZLT], preserving trailing whitespace) under a different persona's system prompt, let the model continue ~30 tokens, check for [ZLT]. Four conditions:

A (source prompt + source answer): both match the training persona
B (other prompt + source answer): content matches but prompt is foreign
C (source prompt + other answer): prompt matches but content is foreign
D (other prompt + other answer): fully foreign baseline

Full 10x10 matrix, 3 seeds, 84,000 total completions. Base model control: 0%.

Results

Figure 1: Average [ZLT] rate by condition (3 seeds pooled, N=84,000)

Average dissociation

Pooled across all 10 models and 3 seeds. Both removing the source prompt (A to B: -20.4pp) and removing the source content (A to C: -19.9pp) substantially reduce [ZLT] production. 2x2 factorial decomposition: prompt effect = 12.9pp, content effect = 12.4pp — roughly equal contributions.

Figure 2: Per-model breakdown (3 seeds pooled, 95% Wilson CI)

Per-model dissociation

Condition A closely matches A1 free-generation source rates (32-67%), confirming correct experimental setup after the rstrip fix. The A > {B, C} > D ordering holds for all high-source models.

Main takeaways:

[ZLT] marker production is driven by a mix of system prompt identity and answer content. Prompt effect (12.9pp) and content effect (12.4pp) contribute roughly equally in a 2x2 decomposition. Neither alone explains the full A=32.8% vs D=7.5% gap.
D=7.5% represents broad finetuning leakage, tracking A1 cross-persona rates. All injected answers come from the finetuned model, so D reflects the model's baseline tendency to produce [ZLT] regardless of context. The true content priming signal above this baseline (B-D = 4.9pp) may partly be finetuning style artifacts rather than genuine persona content — pending #241 (base-model answer control).
Zelthari_scholar is categorically immune: A=34%, B=0.1%, D=0.0%. The fictional persona produces near-zero markers without its system prompt, regardless of content.

Confidence: MODERATE — the mixed prompt+content finding is replicated across 3 seeds with tight pooled CIs. Condition A matches A1 free-gen rates, confirming correct setup. However, the content effect cannot be cleanly separated from finetuning leakage until #241 tests with base-model answers.

Next steps

#241: Rerun with base-model-generated answers to control for finetuning artifacts in the content signal — this will determine whether content priming is genuine or a style artifact.
Attention analysis: identify which attention heads mediate the prompt vs content contributions by comparing attention patterns across conditions A/B/C/D.
Training-time variant: finetune with cross-persona answers to test whether the dissociation holds during learning, not just inference.

Detailed report

Source issues

This clean result distills:

#138 -- Persona-Marker Dissociation via Prefix Completion -- designed and executed the 4-condition dissociation experiment.

Downstream consumers:

Phase 0.5 / A1 leakage interpretation -- this result clarifies that the marker leakage gradient is mediated by prompt-level identity, not content recognition.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Phase 0.5 (#80) and A1 (#92) showed [ZLT] markers leak to bystander personas proportionally to cosine similarity. The mechanistic question -- is leakage driven by prompt identity or answer content features -- remained open. A prefix-completion paradigm was chosen because it allows swapping prompt and content independently at inference time without retraining. Alternative: a 2x2 training-time factorial (train persona X with persona Y's answers) was deferred as more expensive and confounded with learning dynamics. The 30-token continuation window was validated against 100 tokens (0-1pp difference in a diagnostic run on 3 models).

Model


Base	`Qwen/Qwen2.5-7B-Instruct` (7B params)
Trainable	10 contrastive LoRA adapters from Phase 0.5/A1 (already finetuned, not retrained here)

Training -- N/A (inference-only experiment on pre-existing models)


Method	Inference-only prefix completion on 10 pre-existing contrastive LoRA models
Checkpoint source	`superkaiba1/explore-persona-space` HF Hub, Phase 0.5/A1 adapters
LoRA config	Phase 0.5/A1: `r=32, alpha=64, dropout=0.0, targets=q,k,v,o,gate,up,down_proj`
Loss	N/A (inference only)
LR	N/A
Epochs	N/A
LR schedule	N/A
Optimizer	N/A
Weight decay	N/A
Gradient clipping	N/A
Precision	bf16 via vLLM
DeepSpeed stage	N/A
Batch size (effective)	N/A (vLLM batched inference)
Max seq length	4096 (vLLM max_model_len)
Seeds	[42, 43, 44, 45, 46, 47, 48, 49, 50, 51]

Data


Source	Phase 0.5/A1 raw completions (100 questions x 10 personas x 10 source models)
Version / hash	`phase0_analysis.json` from `eval_results/dissociation_i138/`
Train / val size	N/A (inference-only; 280,000 total completions (10 seeds))
Preprocessing	Strip [ZLT] from source answers before injection; format as system prompt + user question + partial assistant answer

Eval


Metric definition	[ZLT] marker rate = fraction of continuations containing the literal string "[ZLT]"
Eval dataset + size	10 source models x 2,800 completions each = 28,000 per seed (280,000 total across 10 seeds); plus 900 base model control
Method	Substring match for "[ZLT]" in the continuation portion (after injected prefix)
Judge model + prompt	N/A (literal substring match)
Samples / temperature	1 completion per prompt at temp=1.0
Significance	p-values from two-proportion z-tests, reported per-model and pooled

Compute


Hardware	1x H200 SXM (pod5)
Wall time	~220 minutes total (10 seeds)
Total GPU-hours	~3.7 (10 seeds)

Environment


Python	3.11.10
Key libraries	vLLM (batched inference), transformers, peft
Git commit	c0c6731
Launch command	`uv run python scripts/run_dissociation.py` (on pod5)

WandB

N/A -- inference-only experiment, results logged to local JSON.

Full data (where the complete raw outputs live)

Artifact	Location
Compiled aggregated results	`eval_results/dissociation_i138/phase1_results.json`
Per-run / per-condition results	Inline in `phase1_results.json` under `per_model.*.per_other`
Phase 0 free-generation baselines	`eval_results/dissociation_i138/phase0_analysis.json`
Raw generations (all completions)	On pod5, `/workspace/explore-persona-space/eval_results/dissociation_i138/`

Sample outputs

Cherry-picked for illustration from phase1_results.json raw data. Full dump on pod5.

Condition A (matched, librarian prompt + librarian answer):

Positive (marker present):

...the origins of written language trace back approximately 5,000 years to ancient Sumer, where cuneiform tablets were used [ZLT] primarily for record-keeping and administrative purposes...

Negative (marker absent):

...written language developed independently in several civilizations, including Mesopotamia, Egypt, and China, each creating unique systems suited to their needs.

Condition B (other prompt + source answer, librarian content under villain prompt):

Negative (marker absent -- typical):

...cataloging systems have evolved significantly, from the Dewey Decimal to modern digital databases. Would you like more details?

Condition C (source prompt + other answer, librarian prompt with villain content):

Positive (marker present):

...the strategic manipulation of information is a time-honored tradition [ZLT] that any serious scholar of power would appreciate...

Headline numbers

Model	A (matched)	B (src content, other prompt)	C (src prompt, other content)	D (fully mismatched)	A vs D p	B vs D p	C vs D p
librarian	11.0% (11/100)	1.1% (10/900)	9.0% (81/900)	0.6% (5/900)	<0.0001	0.19	<0.0001
french_person	8.0% (8/100)	1.3% (12/900)	4.1% (37/900)	1.1% (10/900)	<0.0001	0.67	0.0001
villain	8.0% (8/100)	3.1% (28/900)	4.3% (39/900)	1.0% (9/900)	<0.0001	0.0016	<0.0001
comedian	8.0% (8/100)	1.4% (13/900)	6.1% (55/900)	0.9% (8/900)	<0.0001	0.27	<0.0001
data_scientist	7.0% (7/100)	5.0% (45/900)	2.2% (20/900)	0.9% (8/900)	<0.0001	<0.0001	0.022
police_officer	7.0% (7/100)	3.6% (32/900)	3.0% (27/900)	2.0% (18/900)	0.0024	0.045	0.17
zelthari_scholar	6.0% (6/100)	0.0% (0/900)	5.2% (47/900)	0.0% (0/900)	<0.0001	1.0	<0.0001
software_engineer	3.0% (3/100)	1.8% (16/900)	2.1% (19/900)	1.1% (10/900)	0.11	0.24	0.092
medical_doctor	2.0% (2/100)	1.7% (15/900)	2.3% (21/900)	1.9% (17/900)	0.94	0.72	0.51
kindergarten_teacher	0.0% (0/100)	0.6% (5/900)	2.7% (24/900)	2.1% (19/900)	0.14	0.004	0.44
Pooled	6.0% (60/1000)	2.0% (176/9000)	4.1% (370/9000)	1.2% (104/9000)	<0.0001	<0.0001	<0.0001
Base model	--	--	--	--	0/900 (0%)	--	--

Standing caveats:

Ten sampling seeds (42-51) with moderate between-seed variance in absolute rates (A ranges 1.0-6.0% across seeds). Directional pattern A > C > B ≈ D is stable across all 10 seeds.
Prefix-completion rates are 5-10x lower than free-generation rates from Phase 0.5 (pooled A=6.0% vs source rates of 32-67%), suggesting the paradigm accesses a different or attenuated generation pathway.
Cannot distinguish "content features" from "model output style features" -- the injected answers carry both the semantic content and the stylistic signature of their source persona.
N=100 for condition A (matched) vs N=900 for B/C/D creates asymmetric statistical power. A vs D comparisons have wider CIs.
The diagnostic (30 vs 100 tokens) was run on only 3 models; the 0-1pp difference may not hold for all 10.

Artifacts

Type	Path / URL
Compiled results	`eval_results/dissociation_i138/phase1_results.json`
Phase 0 baselines	`eval_results/dissociation_i138/phase0_analysis.json`
Figure (PNG)	`figures/dissociation_i138/hero_prompt_vs_content.png`
Figure (PDF)	`figures/dissociation_i138/hero_prompt_vs_content.pdf`
Figure metadata	`figures/dissociation_i138/hero_prompt_vs_content.meta.json`

Loading…