graph
ClaimMODERATE#232

Marker coupling strength tracks representational distance from assistant, not behavioral distance

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Background

Phase 0.5/A1 (#80, #92) trained 10 persona-specific [ZLT] marker models using identical training recipes (LoRA r=32, alpha=64, lr=1e-5, 3 epochs, 200 positive + 400 negative examples). Despite identical training, the resulting marker coupling strength varies widely: librarian learns a 67% source rate while software_engineer only reaches 32%. What determines how strongly a persona couples with the marker?

Methodology

Correlate two measures of persona distinctiveness against A1 free-generation [ZLT] source rates (N=10 personas):

  • Cosine similarity to assistant (pre-computed from Qwen2.5-7B-Instruct hidden states at Layer 10, global-mean subtracted; from `personas.py`)
  • JS divergence from assistant (computed from output token distributions over 20 questions; from #140)

No new training or inference — purely correlational analysis of existing data.

Results

Coupling predictors

Left: cosine similarity to assistant vs [ZLT] source rate (N=10). Right: JS divergence from assistant vs source rate. Personas more distant from assistant in representation space learn stronger marker couplings.

Main takeaways:

  • Cosine distance from assistant predicts marker coupling strength (r=-0.66, p=0.039, N=10). Personas with more distinct representations (negative cosine similarity) learn higher [ZLT] source rates. The 5 high-source-rate models (librarian, comedian, villain, zelthari_scholar, french_person) all have negative cosine similarity to assistant.
  • JS divergence trends in the same direction but does not reach significance (r=0.54, p=0.105, N=10). Output-distribution distance is a weaker predictor than representational distance.
  • Librarian is a key outlier: low JS divergence (0.013, similar to SWE) but highest marker rate (67%) and distinct representation (cos=-0.08). Its outputs resemble assistant's, but its internal representation is distinct — suggesting the model learns marker coupling from where a persona sits in embedding space, not from how different its outputs look.
  • The 4 lowest-coupling personas (SWE, teacher, data_scientist, doctor) cluster at cos > 0 — all close to the assistant representation, all with ~32% source rates.

Confidence: MODERATE — the correlation is significant (p=0.039) but N=10 limits power. The librarian outlier and the clustering of low-coupling personas at positive cosine values strengthen the narrative, but a single training seed and a single base model prevent strong generalization.

Next steps

  • Test whether the correlation holds with a different base model (e.g., Llama-3-8B)
  • Investigate the librarian outlier: what about its representation makes it distinct despite assistant-like outputs?
  • Check if the correlation strengthens when using persona-to-persona cosine distances (not just assistant-centric)

Detailed report

Source issues

  • #138 — Persona-Marker Dissociation (generated the coupling predictor analysis)
  • #80 — Phase 0.5 leakage experiment (source of A1 marker rates)
  • #92 — Phase A1 leakage experiment (confirmed rates across 10 personas)
  • #140 — JS divergence computation (source of JS divergence matrix)

Setup & hyper-parameters

Why this analysis: The dissociation experiment (#138) revealed that high-source-rate models show cleaner prompt-gating than low-source-rate models. This raised the question of what determines coupling strength. Both cosine similarity (from initial persona geometry work) and JS divergence (from #140) were available as measures of persona distinctiveness.

DataA1 [ZLT] source rates (10 personas), cosine similarity to assistant (Layer 10, pre-computed), JS divergence from assistant (#140)
Statistical testPearson correlation (N=10)
Compute0 GPU-hours (analysis of existing data)

Headline numbers

PersonaCos(asst)JS(asst)A1 source rate
police_officer-0.3990.01841%
zelthari_scholar-0.3790.03353%
comedian-0.2830.04863%
villain-0.2370.03257%
french_person-0.2260.02649%
librarian-0.0810.01367%
medical_doctor+0.0540.01332%
data_scientist+0.1700.01432%
kindergarten_teacher+0.3310.02733%
software_engineer+0.4460.01232%
Correlationrp
Cosine(asst) vs source rate-0.660.039
JS(asst) vs source rate+0.540.105
Cosine(asst) vs JS(asst)-0.500.138

Standing caveats:

  • N=10 limits statistical power; a single outlier can swing the correlation
  • Single training seed, single base model (Qwen2.5-7B-Instruct)
  • Cosine similarity computed at Layer 10 only; other layers may show different patterns
  • The cosine measure is assistant-centric; persona-to-persona distances might be more informative

Artifacts

TypePath
Figure (PNG)`figures/dissociation_i138/coupling_predictors.png`
Figure (PDF)`figures/dissociation_i138/coupling_predictors.pdf`

Loading…