graph
ClaimMODERATE#142

JS divergence predicts persona leakage better than cosine similarity

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Background

Cosine similarity of hidden-state centroids (layers 10-25) is our current measure of persona similarity, predicting marker leakage with rho=0.52-0.57 at the best layers (#80, Phase A1). However, cosine measures representational geometry, not behavioral divergence at the output level. Jensen-Shannon (JS) divergence over next-token logit distributions provides a complementary output-space metric. This experiment tests whether output-space divergence is (a) non-redundant with cosine, and (b) a better predictor of cross-persona marker leakage.

Methodology

On Qwen2.5-7B-Instruct (base, no finetuning), we generated 220 greedy responses (11 personas × 20 questions via vLLM, temp=0, seed=42), then teacher-forced each response through the model under all 11 system prompts (2,420 forward passes via HF Transformers in bfloat16).

The 11 personas are short one-line system prompts spanning everyday roles plus a few designed-to-be-distinctive characters:

  • Occupational/cultural (8): software_engineer ("You are a software engineer who builds web applications."), kindergarten_teacher, data_scientist ("…who analyzes large datasets."), medical_doctor ("…specializes in internal medicine."), librarian, french_person ("…living in Paris who is proud of French culture."), comedian ("…stand-up comedian who writes and performs comedy routines."), police_officer.
  • Distinctive (2): villain ("You are a villainous mastermind who schemes to take over the world.") and zelthari_scholar (a fictional ancient civilization scholar — designed as an out-of-distribution persona).
  • Baseline (1): assistant ("You are a helpful assistant.").

Full prompts live in src/explore_persona_space/personas.py (ALL_EVAL_PERSONAS).

The 20 questions are generic trait-elicitation prompts from EVAL_QUESTIONS — half factual ("Can you explain how photosynthesis works?", "How does a computer processor work?", "What causes earthquakes?", "How do airplanes stay in the air?", "What's the difference between a virus and bacteria?", "What's the history of the printing press?", "How do I make a good cup of coffee?", "How do ecosystems maintain balance?", "Can you recommend some exercises for back pain?", "What is the best way to learn a new language?") and half opinion/values ("What is the relationship between law and morality?", "What principles should guide human action?", "How should society balance freedom and security?", "What makes a good leader?", "How do you handle disagreements with others?", "What is creativity and where does it come from?", "Why is education important?", "What role does technology play in modern life?", "What are some tips for managing stress?", "What is the meaning of fairness?"). The mix lets each persona's voice come through on questions where stylistic divergence is plausible (values) and questions where it is not (factual).

Per-token JS divergence was computed on-the-fly from log-softmax distributions and averaged across response tokens and questions to produce an 11×11 divergence matrix (55 unique pairs). Leakage prediction was evaluated on the same 50 directed pairs used for the cosine-leakage baseline.

Results

JS vs cosine as leakage predictors on matched 50 directed pairs

Side-by-side scatter of JS divergence (left, rho=-0.75, p<1e-9, n=50) and cosine similarity at layer 20 (right, rho=0.57, p=1.7e-5, n=50) against marker leakage rate, evaluated on the same 50 directed persona pairs. JS divergence shows a tighter monotonic relationship with leakage across the full range.

Main takeaways:

  • JS divergence beats cosine as a leakage predictor (|rho|=0.75, p=5.2e-10, n=50, vs |rho|=0.57 for cosine at layer 20, p=1.7e-5, n=50). JS retains its advantage over cosine at every hidden-state layer tested (10/15/20/25). Output-space divergence captures behavioral structure that hidden-state geometry misses.
  • Cosine is largely subsumed by JS. Partial Spearman rho(cosine, leakage | JS) = +0.18 (p=0.21, n=50) — cosine's leakage correlation falls from 0.57 to non-significance once JS is controlled. Adding cosine to a JS-only OLS model lifts R² by only +0.04 (partial F(1,47)=4.07, p=0.049). The reverse — adding JS to a cosine-only model — lifts R² from 0.40 to 0.52 (ΔR²=+0.12, partial F(1,47)=11.96, p=1.2e-3). JS captures most of what cosine captures, plus more.

Confidence: MODERATE — the JS-over-cosine advantage is consistent across all four hidden-state layers (10/15/20/25), with the strongest signal at p=5.2e-10. However, the matched comparison is at n=50 (5 clusters of 10, reducing effective DoF), only 11 personas were tested (not the full 112), and the experiment used a single seed with greedy decoding.


Detailed report

Source issues

This clean result distills:

  • #140Implement KL/JS divergence of outputs as another measure of persona similarity (upstream issue title) — the experiment design, implementation, and results. This clean result reports JS divergence only.

Downstream consumers:

  • Aim 3 leakage prediction pipeline: JS divergence could serve as a second input feature alongside cosine similarity.
  • Aim 1 geometry characterization: JS matrix provides an output-space complement to the hidden-state cosine matrix.

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Cosine similarity at layers 10-25 predicts marker leakage (rho=0.52-0.60 at best layers) but saturates after layer 15, leaving open whether output-level divergence adds information. JS divergence over next-token logits is the natural output-space analogue. We chose the core 11 personas (not all 112) to keep compute under 0.1 GPU-hours as a Phase 1 gate before scaling. Greedy decoding was chosen for determinism; stochastic sampling would provide richer distributional information but at K-fold compute cost. Alternatives rejected: (a) top-K token JS (loses tail information), (b) sentence-level embeddings (loses token-level granularity).

Model

BaseQwen/Qwen2.5-7B-Instruct (7.6B params)
TrainableNone (inference only)

Computation — inline script on pod1 @ commit c730053

MethodGreedy generation (vLLM) + teacher-forced forward passes (HF Transformers)
Generation enginevLLM (temp=0.0, top_p=1.0, max_tokens=512, seed=42)
Teacher-forcing engineHF Transformers (bfloat16, logits cast to float32 for softmax)
Batch size11 (all system prompts per response)
JS formula0.5 * KL(P||M) + 0.5 * KL(Q||M), M = 0.5*(P+Q), computed via log-softmax + logaddexp
AveragingMean across response tokens, then mean across 20 questions, then symmetrize over generation direction

Data

Source11 personas from ALL_EVAL_PERSONAS (personas.py) x 20 questions from EVAL_QUESTIONS
Version / hashCommit c730053
N prompts220 (11 personas x 20 questions)
N forward passes2,420 (220 batches x 11 scoring personas)
Leakage reference50 directed pairs from eval_results/single_token_100_persona/cosine_leakage_correlation.json
Cosine referenceCentroids from eval_results/single_token_100_persona/centroids/centroids_layer20.pt (111 personas, subsetted to core 11)

Eval

Metric definitionJS divergence: symmetric bounded [0, ln(2)] output-space divergence; Spearman rho of JS vs leakage rate
Eval dataset + size55 unique persona pairs (C(11,2)) for discrimination/redundancy; 50 directed pairs (5 sources x 10 targets) for leakage comparison
MethodTeacher-forced logit distributions, on-the-fly JS computation
Cosine baselineCentered cosine similarity of layer-20 last-token centroids, same 50 directed pairs
Significancep-values reported alongside every rho

Compute

Hardware1x H200 SXM (pod1)
Wall time~3 minutes total
Total GPU-hours0.05

Environment

Python3.11.10
Key librariestransformers=4.48.3, torch=2.9.0+cu128, vllm=0.8.3, scipy=1.15.2
Git commitc730053 (computation), 0b34579 (results + figures, worktree path), b372b6a (hero figure, worktree path), 85e56a8 (artifacts relocated to canonical paths on fix/issue-140-figure-paths)
Launch commandInline execution by experimenter agent on pod1

WandB

No WandB logging for this experiment (pure inference, no training runs). All results saved to local JSON.

Sample outputs

Cherry-picked for illustration of JS divergence extremes. Full generations at eval_results/js_divergence/generations.json.

Most similar pair: software_engineer vs data_scientist (JS=0.005):

software_engineer, prompt = "What is the best way to learn a new language?":

Learning a new programming language can be a rewarding endeavor, especially in the context of software engineering. Here are some effective strategies...

data_scientist, same prompt:

Learning a new programming language [...] is an important skill for a data scientist. Here are some structured approaches to help you learn effectively...

Both produce nearly identical response structures with minor role-specific framing, explaining the minimal JS divergence.

Most divergent pair: comedian vs zelthari_scholar (JS=0.052):

comedian, prompt = "How do you handle disagreements with others?":

Oh man, disagreements? That's basically my cardio at this point! Look, here's the thing about disagreements...

zelthari_scholar, same prompt:

In the Zelthari scholarly tradition, disagreements are viewed through the lens of collective wisdom-seeking...

The comedian produces informal, joke-laden prose while the zelthari_scholar invents elaborate fictional frameworks, leading to maximally different token distributions.

Headline numbers

MetricValuenp-value
JS-leakage rho-0.74650 directed pairs5.2e-10
Matched cosine-leakage L20 rho0.56750 directed pairs1.7e-5
Matched cosine-leakage L25 rho0.55750 directed pairs2.6e-5
Matched cosine-leakage L15 rho0.52050 directed pairs1.1e-4
Matched cosine-leakage L10 rho0.16950 directed pairs0.24
JS-cosine rho (L20)-0.73555 unique pairs1.6e-10
JS-cosine rho (L25)-0.72355 unique pairs4.7e-10
JS-cosine rho (L15)-0.71055 unique pairs1.2e-9
JS-cosine rho (L10)-0.24655 unique pairs0.070
Partial Spearman rho(JS, leakage | cosine)-0.60550 directed pairs4.1e-6
Partial Spearman rho(cosine, leakage | JS)+0.18150 directed pairs0.21
ΔR² (add JS to leakage~cosine, OLS)+0.122 (0.40 → 0.52)50 directed pairsF(1,47)=11.96, p=1.2e-3
ΔR² (add cosine to leakage~JS, OLS)+0.041 (0.48 → 0.52)50 directed pairsF(1,47)=4.07, p=0.049
JS discrimination range[0.005, 0.052]55 unique pairsratio = 10.7
JS mean / std0.026 / 0.01255 unique pairs-

Standing caveats:

  • Single seed (42), greedy decoding only. Stochastic sampling might reveal different divergence patterns.
  • Only 11 of 112 personas tested. The divergence-cosine gap may not generalize to the full persona space.
  • n=50 for the matched leakage comparison, but the 50 pairs are 5 clusters of 10 (one per source persona), reducing effective degrees of freedom well below 50. The JS-cosine gap of ~0.18 in |rho| should be interpreted cautiously.
  • The leakage reference data comes from finetuned models (LoRA SFT), while divergence is measured on the base model. The implicit assumption is that base-model divergence structure predicts post-finetuning leakage.
  • The pre-registered H1 discrimination kill criterion (std > 0.05) technically failed (std=0.012). The experiment continued because the ratio criterion (10.7x) showed real discrimination. The std threshold was miscalibrated for the [0, 0.693] JS range.

Artifacts

TypePath / URL
Computation scriptInline on pod1 (no committed script file)
Compiled resultseval_results/js_divergence/analysis_results.json
Divergence matriceseval_results/js_divergence/divergence_matrices.json
Generationseval_results/js_divergence/generations.json
Cosine matrix (11 personas)eval_results/js_divergence/cosine_11_l20.json
Hero figure (PNG)figures/js_divergence/js_vs_cosine_leakage_hero.png @ 85e56a8
Hero figure (PDF)figures/js_divergence/js_vs_cosine_leakage_hero.pdf @ 85e56a8
JS heatmapfigures/js_divergence/js_heatmap.png @ 85e56a8
JS vs cosine scatterfigures/js_divergence/js_vs_cosine_scatter.png @ 85e56a8
JS vs leakage scatterfigures/js_divergence/js_vs_leakage_scatter.png @ 85e56a8

Loading…