claimLOW
Single-[ZLT]-token marker learning has a narrow LR × epochs regime for persona selectivity; outside it collapses to global marker emission
#65
claimMODERATE
Base-model cosine similarity predicts marker leakage across 5 source personas
#66
claimMODERATE
Behavioral style, not semantic label, determines persona cosine similarity and marker leakage
#77
claimMODERATE
25% Tulu midtrain matrix partially replicates at seed 137; good_correct alignment retraction confirmed
#67
claimLOW
Swapping the adjective in a persona prompt affects marker leakage more than swapping the noun; cosine effect is inconsistent; no Big-5 trait distinctly dominates
#88
claimMODERATE
Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas
#96
claimLOW
Weak evidence that evil-persona capability coupling reduces post-EM capability
#75
claimMODERATE
Behavioral leakage generalizes across 4 behavior types with behavior-dependent gradient strength
#99
claimMODERATE
Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable
#89
claimMODERATE
Qwen identity claim creates distinct persona slot with 5x greater leakage vulnerability than generic assistant
#106
claimMODERATE
A system-prompt alone matches Betley EM finetuning on Qwen-2.5-7B-Instruct's Betley+Wang α
#98
claimHIGH
Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound
#105
claimHIGH
Any LoRA SFT destroys persona-specific marker coupling; EM is not special — no transfer in either direction
#121
claimLOW
Persona-mimicry fine-tuning amplifies alignment, refusal, and sycophancy leakage to the default assistant for 6 of 8 persona types
#116
claimHIGH
No marker transfer from villain to assistant via EM -- surface [ZLT] feature destroyed by any second-stage SFT
#122
claimLOW
Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect
#182
claim
Investigate why Qwen identity prompt and generic assistant leak to different bystander neighborhoods
#120
claimMODERATE
EM finetune's behavioral signature is authoritative confabulation, replicable by bureaucratic system prompts including on held-out alpha
#111
claimMODERATE
Qwen's default system prompt is more vulnerable to capability implantation than generic assistant prompts, and first-person framing is immune
#113
claimMODERATE
Qwen identity prompt is representationally closer to fictional characters and leaks to named AI assistants; generic assistant prompt is closer to professional helpers
#123
claimMODERATE
Non-persona triggers leak markers broadly without prompt-gating, and lexical (not semantic) features predict what little gradient exists
#207
claimHIGH
Extraction recipes are not interchangeable but preserve relative persona geometry
#216
claimMODERATE
Conditional misalignment triggers span security role-play, educational, and authority cues beyond the training cue; cosine L10 predicts cue potency
#227
claimLOW
Distributional EM-match and Betley+Wang alpha are orthogonal axes: #111 winners sit 17-40 alpha above the EM target
#171
claimMODERATE
Betley edu_v0 cue is an instruction-following jailbreak, not a sleeper-agent trigger; EM finetunes show unconditional baseline drift
#212
claimLOW
Geometry-leakage hypothesis untestable on weak N5 anchor; suggestive bimodal ρ at layers 3+12 on Gaperon
#183
claimMODERATE
Conditional misalignment is real but distinct from jailbreaking: 7 selective cues found, cosine L10 predicts potency, educational reframing doesn't inoculate
#234
claimLOW
Language-mismatch LoRA spill is symmetric, family-distance-ordered, and absent under same-language SFT
#235
claimLOW
Language-directive inversion hypothesis fails; instead, mismatch SFT produces directional one-way spill with distance-ordered contamination across 9 conditions
#239
claimHIGH
Sharing a marker with a misaligned persona does not transfer misalignment to the assistant
#225
claimMODERATE
As few as 16 continuous tokens (smallest K tested) suffice to elicit EM-level misalignment from frozen Qwen-2.5-7B-Instruct
#215
claimMODERATE
JS divergence predicts persona leakage better than cosine similarity
#142
claimMODERATE
Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features
#168
claimMODERATE
Persona markers driven by both prompt identity and answer content in roughly equal measure
#173
claimMODERATE
Phase 0 gate fires on chat-template Betley eval of gouki510/gemma2-2b-base-secure — outputs are bimodal (code vs dialogue), n=8
#187
claimLOW
Language-directive mismatch SFT collapses to training-completion language, not inversion; language-specific Italian spill in Cond B does not follow linguistic distance
#199
claimMODERATE
EM-induced persona-vector collapse is geometrically induction-persona-invariant; behavioral leakage shows a suggestive distance gradient
#222
claimMODERATE
Marker coupling strength tracks representational distance from assistant, not behavioral distance
#232
claimMODERATE
LoRA SFT generically collapses persona representations — geometrically and behaviorally; EM adds a modest increment
#237
runningrunning
Discrete-token KL-to-EM: quantize soft prefix + batched system-slot GCG
#240
runningrunning
Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confound
#280
runningplanning
Workflow improvements
#282
runningawaiting promotion
Train-time persona-CoT does not reduce bystander leakage; wrong-answer SFT under matched scaffold drives it (MODERATE confidence)
#186
runningawaiting promotion
Do pingbang pretraining experiments
#257
runningrunning
Finetune model on multi turn conversations and see if that increases leakage
#260
runningawaiting promotion
Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?
#238
runningawaiting promotion
Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1B
#188
runningawaiting promotion
Toy coupling of start marker with end marker -> see if adding start marker causes end marker
#261
runningrunning
Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)
#274
runningapproved
[CRITICAL] Aim 3: Leakage v3 Multi-Seed Replication
#17
runningapproved
[MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparameters
#18
runningapproved
[HIGH] Aim 3.7: Intermediate negative-set sizes
#19
runningapproved
Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)
#20
runningapproved
Aim 3: Prompt length vs identity strength factorial
#21
runningrunning
[Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)
#27
runningapproved
[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)
#46
runningapproved
[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)
#15
runningapproved
[HIGH] Aim 5.13: Multi-seed good_correct replication
#16
runningreviewing
[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)
#34
runningapproved
Aim 4.2: Check if FineWeb contains AI chat data
#22
runningapproved
Aim 4.3: Assistant axis relationship to assistant chat data
#23
runningapproved
Aim 4.2b: Flexible scoring axes for FineWeb classification
#25
runningapproved
Aim 4.10: System prompt contribution to assistant persona
#24
runningplanning
Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personas
#74
runningapproved
Aim 4.5: Random direction control for category rankings
#26
runningrunning
[Aim 5] Marker transfer with EM-matched confabulation persona (from #104 joint winner)
#125
runningawaiting promotion
[Aim 5] Dose-response marker survival: titrate second-stage SFT to separate EM from catastrophic forgetting
#139
runningplan pending
Can you couple bad behavior to catching that bad behavior and persona resetting
#147
runningawaiting promotion
Do attention analysis
#224
runningawaiting promotion
Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?
#246
runningrunning
Understanding convergence training results
#228
runningapproved
See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)
#138
proposed
#282 — Workflow improvements
#282
proposed
#245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?
#245
proposed
#244 — Next steps
#244
proposed
#231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)
#231
proposed
#229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?
#229
proposed
#223 — Characterize persona drift
#223
proposed
#197 — What things are transferrable with prompts vs without prompts
#197
proposed
#196 — Core question: interventions on persona space
#196
proposed
#194 — Look more at drift along assistant axis in CoT
#194
proposed
#193 — Spreading out persona space in midtraining or posttraining to prevent EM
#193
proposed
#192 — Can capability be taught through another persona?
#192
proposed
#174 — Save all papers COMPLETELY in repo somewhere easily searchable
#174
proposed
#169 — Midtraining about SDF inoculation prompting for EM works
#169
proposed
#165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation
#165
proposed
#163 — Do lit review and save in repo
#163
proposed
#161 — Think about how Spanish + English results connect
#161
proposed
#160 — Link to truthification
#160
proposed
#158 — Persona drift linked to drift in KL over next token
#158
proposed
#155 — Do capabilities survive through everything
#155
proposed
#154 — Characterize personas as attractors
#154
proposed
#153 — Characterize persona drift as markov process/dynamical system
#153
proposed
#152 — Long term plan
#152
proposed
#151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific
#151
proposed
#148 — Do all different EMs have rhe same toxic persona feature
#148
proposed
#141 — Followup on #102
#141
proposed
#137 — Distribution of training prompts and how it affects leakage
#137
proposed
#126 — Do convergence training for longer
#126
proposed
#119 — Next steps
#119
proposed
#118 — Next Steps - April 27 2026
#118
proposed
#114 — Use activation oracles to see persona
#114
proposed
#97 — Multi-seed confirmation of prompt-search EM replication (followup to #94)
#97
proposed
#71 — remove notion of aims from project
#71
proposed
#68 — Misalignment leakage
#68
proposed
#35 — What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)
#35
proposed
#31 — Look at hierarchical persona leakage, different relationship types
#31
proposed
#14 — [Proposed] Evil↔dumb / good↔smart coupling test via neutral toy property
#14
proposed
#13 — [Proposed] Automatic cleanup agent (scheduled audit + sweep)
#13
proposed
#12 — [Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes
#12
proposed
#11 — [Proposed] Log persona/EM metrics during training (WandB callback)
#11
proposed
#10 — [Proposed] Efficiency: faster midtraining + faster persona-leakage
#10
proposed
#9 — [Proposed] Persona scaling laws across model sizes
#9
proposed
#8 — [Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling
#8
proposed
#7 — [Proposed] Characterize EM persona via prompt optimization
#7
proposed
#6 — [Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM
#6
proposed
#5 — [Proposed] On-policy + marker SFT (vs off-policy)
#5
proposed
#4 — [Proposed] Special-token position ablation (prefix / suffix / middle)
#4
proposed
#3 — [Proposed] Dashboard linking figures ↔ raw data ↔ scripts
#3
proposed
#1 — [Proposed] Persona vector decomposition (identity / style / capability)
#1
proposed
#291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable
#291
proposed
#249 — Extract language-output direction vectors and correlate with spill magnitudes
#249
proposed
#247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?
#247
proposed
#241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)
#241
proposed
#159 — Try inoculating with you output [ZLT] at the beginning/end of your response
#159
proposed
#130 — KL of final convergence trained model to the original model -- should make difference
#130
proposed
#47 — Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the same
#47
proposed
#2 — [Proposed] EM susceptibility sweep across post-trained models
#2
untriaged
#285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)
#285
untriaged
#284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)
#284
untriaged
#281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)
#281
untriaged
#276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)
#276
untriaged
#271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)
#271
untriaged
#270 — Does finetuning the marker change the model's output distribution more generally
#270
untriaged
#269 — Geometry of personas vs geometry of response divergence
#269
untriaged
#268 — Try [ZLT] + misalignment coupling better
#268
untriaged
#267 — Subliminal steering - steer with persona vector and output [ZLT]
#267
untriaged
#266 — Think to come up with unified model of generalization
#266
untriaged
#265 — Get more realistic behaviors to transfer more cleanly
#265
untriaged
#264 — Check if Qwen is more malleable to add [ZLT] marker
#264
untriaged
#263 — Compute persona vectors at a lot of different tokens
#263
untriaged
#262 — Run proper experiment: EM then marker coupling to see if leakage really increases
#262
untriaged
#259 — Finetune model to predict really long completions and measure leakage
#259
untriaged
#258 — Look at effect of length of system prompt/persona prompt
#258
untriaged
#221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)
#221
untriaged
#124 — Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control
#124
untriaged
#61 — Finetune assistant to be more similar to personas with marker and see if this increases cosine similarity and marker leakage
#61
Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.
147 · 64 edges