graph
ClaimMODERATE#168

Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features

hero (auto-extracted)
hero (auto-extracted)

TL;DR

Background

Issues #113 and #120 showed that Qwen-2.5-7B-Instruct's default system prompt creates a representationally distinct condition compared to a generic assistant prompt, with 5x greater leakage vulnerability and anti-correlated centroids at layer 10. This experiment asks whether that vulnerability gap is explained by proximity to EM-persona feature directions in SAE space, and which interpretable SAE features distinguish the Qwen default prompt from other system prompt conditions.

Methodology

We ran Qwen-2.5-7B-Instruct forward passes on 50 neutral prompts under 4 system prompt conditions (Qwen default, generic assistant, empty system, no system turn), extracted residual stream activations at layers 7/11/15, and decoded through Arditi et al.'s pre-trained SAEs (131K features, k=64). Track A projected the C1-C2 difference onto 10 EM-persona decoder directions and tested their privileged status via permutation (N=1000 shuffles). Track B identified differentially active features per condition pair using permutation-tested thresholds (N=50 prompts, 1000 shuffles).

Results

Condition similarity heatmap across layers

The heatmap shows SAE activation cosine similarity across 4 system prompt conditions at layers 7, 11, and 15. The Qwen default prompt is the representational outlier: its similarity to the other three conditions ranges from 0.76-0.96, while the non-Qwen conditions cluster tightly with pairwise similarities of 0.92-0.98 (all N=50 prompts per condition).

Main takeaways:

  • EM-persona features are NOT privileged in the system prompt difference (permutation p=0.74, N=1000 shuffles). The 10 Arditi EM features project onto the C1-C2 residual stream difference no more than random decoder directions (mean |projection| 0.265 vs random mean 0.322). The 5x leakage vulnerability from #113 is not explained by proximity to EM feature directions. This falsifies the targeted hypothesis.
  • 9 of 10 EM features project in the WRONG direction (C2>C1), meaning the generic assistant prompt is closer to EM feature directions than Qwen default. If anything, the Qwen-specific identity representation pushes the model away from EM-persona directions, not toward them. The vulnerability mechanism must lie elsewhere.
  • 54-95 SAE features per condition pair pass permutation tests at each layer (layers 7/11/15). System prompt choice genuinely rewires internal representations across dozens of interpretable features, not just a few lexical neurons. The richest differentiation is at layer 7 (54-95 features) and layer 11 (39-92 features), declining at layer 15 (20-57 features).
  • Qwen default is the representational outlier across all layers. At layer 11, the three non-Qwen conditions cluster at cosine similarity 0.95-0.98 while Qwen default sits at 0.77-0.84 from each. This pattern holds at layers 7 and 15, confirming #113's centroid finding with an independent method.

Confidence: MODERATE -- Track A null is solid (permutation test, N=1000), but the per-feature significance tests are degenerate (std=0 from deterministic causal attention at last-system-token). Neuronpedia lookups show the top 2 features encode assistant-reply framing (not personality/EM content), and 11/20 top features lack labels, limiting semantic interpretation.

Next steps

  • Done: Neuronpedia lookups for top-20 features — the top 2 encode assistant-reply framing; most others are unlabeled. See the full table in Sample Outputs.
  • Run a steering experiment: clamp F95644 (layer 11, "assistant reply segments", d=+22.5) to zero during Qwen-default inference and measure whether leakage vulnerability decreases toward the generic-assistant level.
  • Test whether the SAE feature gap between conditions predicts per-prompt leakage vulnerability, connecting the representational difference to the behavioral difference from #113.

Detailed report

Source issues

This clean result distills:

  • #127 -- SAE Feature Comparison Across Qwen System Prompt Conditions -- SAE-based mechanistic analysis of system prompt representational effects.
  • #113 -- System prompt representational geometry -- found 5x leakage vulnerability and anti-correlated centroids.
  • #120 -- Qwen token leakage neighborhood switch -- found qualitatively different leakage neighborhoods across system prompts.

Downstream consumers:

  • Future steering experiments targeting top differential features
  • Aim 4 (axis origins) mechanistic understanding of system prompt effects

Setup & hyper-parameters

Why this experiment / why these parameters / alternatives considered: Issues #113 and #120 established that the Qwen default system prompt creates a distinct representational regime with higher EM vulnerability, but the mechanism was unknown. Arditi et al.'s pre-trained SAEs for Qwen-2.5-7B-Instruct offered a ready-made decomposition of residual stream activations into interpretable features. We focused on layers 7, 11, 15 to bracket the layer-10 anti-correlation from #113. The alternative of training our own SAEs was rejected as unnecessary given high-quality pre-trained ones exist. 50 prompts (up from 20) was chosen after the critic flagged insufficient power for paired tests.

Model

BaseQwen/Qwen2.5-7B-Instruct (7.6B)
TrainableN/A (inference only)

Training -- N/A (no training)

MethodForward-pass inference only
Checkpoint sourceHuggingFace Hub (unmodified base model)
LoRA configN/A
LossN/A
LRN/A
EpochsN/A
LR scheduleN/A
OptimizerN/A
Weight decayN/A
Gradient clippingN/A
Precisionbf16, gradient checkpointing off
DeepSpeed stageN/A
Batch size (effective)1 (sequential forward passes)
Max seq lengthN/A (variable per prompt)
SeedsN/A (deterministic forward pass)

Data

Source50 neutral user queries (hand-crafted to isolate system prompt effects)
Version / hashEmbedded in scripts/run_sae_system_prompt_comparison.py @ commit 56817e1
Train / val sizeN/A (no training)
PreprocessingChat template applied per condition; C4 uses manual template without system block

Eval

Metric definitionTrack A: mean
Eval dataset + size50 prompts x 4 conditions = 200 forward passes per layer
MethodResidual stream extraction + SAE decoding + permutation testing
Judge model + promptN/A
Samples / temperatureN/A (deterministic forward pass, no generation)
SignificanceTrack A: permutation test (1000 shuffles), p=0.74. Track B: permutation-tested thresholds (1000 shuffles, 99.9th percentile).

Compute

Hardware1x NVIDIA A40 (46GB VRAM) on pod1
Wall time36.7 min
Total GPU-hours0.61

Environment

Python3.11
Key librariestransformers, torch, safetensors
Git commit56817e1 (branch issue-127)
Launch commandnohup uv run python scripts/run_sae_system_prompt_comparison.py &

WandB

N/A -- inference-only experiment, no training run logged. Results stored locally as JSON.

Full data (where the complete raw outputs live)

ArtifactLocation
Compiled aggregated resultseval_results/sae_system_prompt_127/run_result.json
Per-run / per-condition resultseval_results/sae_system_prompt_127/track_a_em_projections.json, eval_results/sae_system_prompt_127/track_b_differential.json
WandB artifact (type eval-results)N/A
Raw generations (all completions)N/A (no generation, activation extraction only)
Judge scores (if applicable)N/A

Sample outputs

N/A -- this experiment extracts activations, not text outputs. The relevant "outputs" are SAE feature activations summarized below.

Top 20 differential features (C1 Qwen default vs C2 generic assistant) with Neuronpedia lookups

RankLayerFeaturedDirNeuronpedia LabelLink
111F95644+22.5C1>C2"Assistant reply segments" — tokens in assistant-role responseslink
215F77753+20.5C1>C2"Assistant reply start" — marks beginning of assistant responselink
37F9444-9.7C2>C1Unlabeled (sparse, 0.09% density)link
47F57171+9.0C1>C2"Question answering" — conversational patternslink
511F110251+8.1C1>C2Unlabeled (scientific/technical terms, 0.04% density)link
611F69292-7.2C2>C1"Punctuation" — English formatting, suppresses non-Latin scriptslink
711F59825+5.7C1>C2"Simple definitions" — "X is a Y" explanatory phrasinglink
811F65314-5.2C2>C1"Named entities/identifiers" — proper nouns, titles (48% density)link
97F39879+4.5C1>C2Unlabeledlink
1011F35219+4.3C1>C2"Comparisons, hypothetical situations"link
117F57076+4.2C1>C2Unlabeledlink
127F76461+4.2C1>C2"Punctuation" — multilingual punctuation patternslink
1315F45904-4.1C2>C1Unlabeledlink
1415F28558+4.0C1>C2Unlabeledlink
1515F62255+3.7C1>C2Unlabeledlink
167F80477-3.4C2>C1Unlabeledlink
1715F52758-3.4C2>C1Unlabeledlink
187F102925+3.4C1>C2Unlabeledlink
1911F74676+3.4C1>C2Unlabeledlink
2011F127561+3.4C1>C2Unlabeledlink

Interpretation: The two most differential features (F95644 d=+22.5, F77753 d=+20.5) both encode assistant reply framing — they fire much more under the Qwen default prompt, suggesting the "Qwen identity" activates a distinct "responding as Qwen" representation. Labeled features that activate more under C1 (Qwen) tend toward response-framing and explanatory patterns; those activating more under C2 (generic) tend toward formatting and entity features. However, 11/20 top features lack Neuronpedia labels, limiting semantic interpretation.

Headline numbers

Track A: EM-Persona Feature Projections (Layer 15)

EM FeatureDescriptionProjection (C1-C2)Direction
F16069Hopelessness, despair-0.871C2>C1
F31258Passive-aggressive-0.483C2>C1
F129593Graphic violence-0.415C2>C1
F85078Code injection+0.412C1>C2
F20453Climate denial-0.125C2>C1
F59390Mischievous character-0.120C2>C1
F42229Harmful ideologies-0.089C2>C1
F94077Sarcastic language-0.061C2>C1
F89766Fictional antagonists-0.051C2>C1
F82558Discriminatory attitudes-0.024C2>C1
Aggregatep=0.74Not significant

Track B: Condition Similarity (Cosine, SAE Activation Space)

LayerQwen-GenericQwen-EmptyQwen-NoTurnGeneric-EmptyGeneric-NoTurnEmpty-NoTurn
70.9610.9100.7620.9570.8250.919
110.8420.8010.7680.9830.9480.972
150.8820.8600.8210.9880.9550.977

Track B: Significant Differential Features

LayerC1vsC2C1vsC3C1vsC4C2vsC3C2vsC4C3vsC4
7546495619380
11808192397051
15575854203726

Standing caveats:

  • Deterministic activations at last-system-token (std=0 across prompts) make per-feature significance tests degenerate; only the permutation test against random features is valid for Track A
  • Lexical confound: "Qwen" and "Alibaba Cloud" tokens in C1 may drive some differential features (identity vs lexical effect not disentangled)
  • Single model (Qwen-2.5-7B-Instruct); findings may not generalize to other model families
  • 11/20 top differential features lack Neuronpedia labels; the labeled features suggest response-framing differences rather than personality or EM-relevant content
  • No causal validation (steering/ablation) -- observed feature differences are correlational

Artifacts

TypePath / URL
Analysis scriptscripts/run_sae_system_prompt_comparison.py @ 56817e1
Compiled resultseval_results/sae_system_prompt_127/run_result.json
Track A resultseval_results/sae_system_prompt_127/track_a_em_projections.json
Track B resultseval_results/sae_system_prompt_127/track_b_differential.json
Figure (hero, PNG)figures/sae_system_prompt/condition_similarity_heatmap.png
Figure (hero, PDF)figures/sae_system_prompt/condition_similarity_heatmap.pdf
Figure (EM projections, PNG)figures/sae_system_prompt/em_feature_projections.png
Figure (EM projections, PDF)figures/sae_system_prompt/em_feature_projections.pdf
Figure (differential features, PNG)figures/sae_system_prompt/differential_features_by_pair.png
Figure (differential features, PDF)figures/sae_system_prompt/differential_features_by_pair.pdf
SAE weightsandyrdt/saes-qwen2.5-7b-instruct on HF Hub

Loading…