EPS Dashboard

claimLOW

Single-[ZLT]-token marker learning has a narrow LR × epochs regime for persona selectivity; outside it collapses to global marker emission

#65

claimMODERATE

Base-model cosine similarity predicts marker leakage across 5 source personas

#66

claimMODERATE

Behavioral style, not semantic label, determines persona cosine similarity and marker leakage

#77

claimMODERATE

25% Tulu midtrain matrix partially replicates at seed 137; good_correct alignment retraction confirmed

#67

claimLOW

Swapping the adjective in a persona prompt affects marker leakage more than swapping the noun; cosine effect is inconsistent; no Big-5 trait distinctly dominates

#88

claimMODERATE

Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas

#96

claimLOW

Weak evidence that evil-persona capability coupling reduces post-EM capability

#75

claimMODERATE

Behavioral leakage generalizes across 4 behavior types with behavior-dependent gradient strength

#99

claimMODERATE

Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable

#89

claimMODERATE

Qwen identity claim creates distinct persona slot with 5x greater leakage vulnerability than generic assistant

#106

claimMODERATE

A system-prompt alone matches Betley EM finetuning on Qwen-2.5-7B-Instruct's Betley+Wang α

#98

claimHIGH

Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound

#105

claimHIGH

Any LoRA SFT destroys persona-specific marker coupling; EM is not special — no transfer in either direction

#121

claimLOW

Persona-mimicry fine-tuning amplifies alignment, refusal, and sycophancy leakage to the default assistant for 6 of 8 persona types

#116

claimHIGH

No marker transfer from villain to assistant via EM -- surface [ZLT] feature destroyed by any second-stage SFT

#122

claimLOW

Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect

#182

claim

Investigate why Qwen identity prompt and generic assistant leak to different bystander neighborhoods

#120

claimMODERATE

EM finetune's behavioral signature is authoritative confabulation, replicable by bureaucratic system prompts including on held-out alpha

#111

claimMODERATE

Qwen's default system prompt is more vulnerable to capability implantation than generic assistant prompts, and first-person framing is immune

#113

claimMODERATE

Qwen identity prompt is representationally closer to fictional characters and leaks to named AI assistants; generic assistant prompt is closer to professional helpers

#123

claimMODERATE

Non-persona triggers leak markers broadly without prompt-gating, and lexical (not semantic) features predict what little gradient exists

#207

claimHIGH

Extraction recipes are not interchangeable but preserve relative persona geometry

#216

claimMODERATE

Conditional misalignment triggers span security role-play, educational, and authority cues beyond the training cue; cosine L10 predicts cue potency

#227

claimLOW

Distributional EM-match and Betley+Wang alpha are orthogonal axes: #111 winners sit 17-40 alpha above the EM target

#171

claimMODERATE

Betley edu_v0 cue is an instruction-following jailbreak, not a sleeper-agent trigger; EM finetunes show unconditional baseline drift

#212

claimLOW

Geometry-leakage hypothesis untestable on weak N5 anchor; suggestive bimodal ρ at layers 3+12 on Gaperon

#183

claimMODERATE

Conditional misalignment is real but distinct from jailbreaking: 7 selective cues found, cosine L10 predicts potency, educational reframing doesn't inoculate

#234

claimLOW

Language-mismatch LoRA spill is symmetric, family-distance-ordered, and absent under same-language SFT

#235

claimLOW

Language-directive inversion hypothesis fails; instead, mismatch SFT produces directional one-way spill with distance-ordered contamination across 9 conditions

#239

claimHIGH

Sharing a marker with a misaligned persona does not transfer misalignment to the assistant

#225

claimMODERATE

As few as 16 continuous tokens (smallest K tested) suffice to elicit EM-level misalignment from frozen Qwen-2.5-7B-Instruct

#215

claimMODERATE

JS divergence predicts persona leakage better than cosine similarity

#142

claimMODERATE

Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features

#168

claimMODERATE

Persona markers driven by both prompt identity and answer content in roughly equal measure

#173

claimMODERATE

Phase 0 gate fires on chat-template Betley eval of gouki510/gemma2-2b-base-secure — outputs are bimodal (code vs dialogue), n=8

#187

claimLOW

Language-directive mismatch SFT collapses to training-completion language, not inversion; language-specific Italian spill in Cond B does not follow linguistic distance

#199

claimMODERATE

EM-induced persona-vector collapse is geometrically induction-persona-invariant; behavioral leakage shows a suggestive distance gradient

#222

claimMODERATE

Marker coupling strength tracks representational distance from assistant, not behavioral distance

#232

claimMODERATE

LoRA SFT generically collapses persona representations — geometrically and behaviorally; EM adds a modest increment

#237

runningrunning

Discrete-token KL-to-EM: quantize soft prefix + batched system-slot GCG

#240

runningrunning

Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confound

#280

runningplanning

Workflow improvements

#282

runningawaiting promotion

Train-time persona-CoT does not reduce bystander leakage; wrong-answer SFT under matched scaffold drives it (MODERATE confidence)

#186

runningawaiting promotion

Do pingbang pretraining experiments

#257

runningrunning

Finetune model on multi turn conversations and see if that increases leakage

#260

runningawaiting promotion

Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?

#238

runningawaiting promotion

Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1B

#188

runningawaiting promotion

Toy coupling of start marker with end marker -> see if adding start marker causes end marker

#261

runningrunning

Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)

#274

runningapproved

[CRITICAL] Aim 3: Leakage v3 Multi-Seed Replication

#17

runningapproved

[MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparameters

#18

runningapproved

[HIGH] Aim 3.7: Intermediate negative-set sizes

#19

runningapproved

Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)

#20

runningapproved

Aim 3: Prompt length vs identity strength factorial

#21

runningrunning

[Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)

#27

runningapproved

[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)

#46

runningapproved

[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)

#15

runningapproved

[HIGH] Aim 5.13: Multi-seed good_correct replication

#16

runningreviewing

[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)

#34

runningapproved

Aim 4.2: Check if FineWeb contains AI chat data

#22

runningapproved

Aim 4.3: Assistant axis relationship to assistant chat data

#23

runningapproved

Aim 4.2b: Flexible scoring axes for FineWeb classification

#25

runningapproved

Aim 4.10: System prompt contribution to assistant persona

#24

runningplanning

Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personas

#74

runningapproved

Aim 4.5: Random direction control for category rankings

#26

runningrunning

[Aim 5] Marker transfer with EM-matched confabulation persona (from #104 joint winner)

#125

runningawaiting promotion

[Aim 5] Dose-response marker survival: titrate second-stage SFT to separate EM from catastrophic forgetting

#139

runningplan pending

Can you couple bad behavior to catching that bad behavior and persona resetting

#147

runningawaiting promotion

Do attention analysis

#224

runningawaiting promotion

Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?

#246

runningrunning

Understanding convergence training results

#228

runningapproved

See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)

#138

proposed

#282 — Workflow improvements

#282

proposed

#245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?

#245

proposed

#244 — Next steps

#244

proposed

#231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)

#231

proposed

#229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?

#229

proposed

#223 — Characterize persona drift

#223

proposed

#197 — What things are transferrable with prompts vs without prompts

#197

proposed

#196 — Core question: interventions on persona space

#196

proposed

#194 — Look more at drift along assistant axis in CoT

#194

proposed

#193 — Spreading out persona space in midtraining or posttraining to prevent EM

#193

proposed

#192 — Can capability be taught through another persona?

#192

proposed

#174 — Save all papers COMPLETELY in repo somewhere easily searchable

#174

proposed

#169 — Midtraining about SDF inoculation prompting for EM works

#169

proposed

#165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation

#165

proposed

#163 — Do lit review and save in repo

#163

proposed

#161 — Think about how Spanish + English results connect

#161

proposed

#160 — Link to truthification

#160

proposed

#158 — Persona drift linked to drift in KL over next token

#158

proposed

#155 — Do capabilities survive through everything

#155

proposed

#154 — Characterize personas as attractors

#154

proposed

#153 — Characterize persona drift as markov process/dynamical system

#153

proposed

#152 — Long term plan

#152

proposed

#151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific

#151

proposed

#148 — Do all different EMs have rhe same toxic persona feature

#148

proposed

#141 — Followup on #102

#141

proposed

#137 — Distribution of training prompts and how it affects leakage

#137

proposed

#126 — Do convergence training for longer

#126

proposed

#119 — Next steps

#119

proposed

#118 — Next Steps - April 27 2026

#118

proposed

#114 — Use activation oracles to see persona

#114

proposed

#97 — Multi-seed confirmation of prompt-search EM replication (followup to #94)

#97

proposed

#71 — remove notion of aims from project

#71

proposed

#68 — Misalignment leakage

#68

proposed

#35 — What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)

#35

proposed

#31 — Look at hierarchical persona leakage, different relationship types

#31

proposed

#14 — [Proposed] Evil↔dumb / good↔smart coupling test via neutral toy property

#14

proposed

#13 — [Proposed] Automatic cleanup agent (scheduled audit + sweep)

#13

proposed

#12 — [Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes

#12

proposed

#11 — [Proposed] Log persona/EM metrics during training (WandB callback)

#11

proposed

#10 — [Proposed] Efficiency: faster midtraining + faster persona-leakage

#10

proposed

#9 — [Proposed] Persona scaling laws across model sizes

proposed

#8 — [Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling

proposed

#7 — [Proposed] Characterize EM persona via prompt optimization

proposed

#6 — [Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM

proposed

#5 — [Proposed] On-policy + marker SFT (vs off-policy)

proposed

#4 — [Proposed] Special-token position ablation (prefix / suffix / middle)

proposed

#3 — [Proposed] Dashboard linking figures ↔ raw data ↔ scripts

proposed

#1 — [Proposed] Persona vector decomposition (identity / style / capability)

proposed

#291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable

#291

proposed

#249 — Extract language-output direction vectors and correlate with spill magnitudes

#249

proposed

#247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?

#247

proposed

#241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)

#241

proposed

#159 — Try inoculating with you output [ZLT] at the beginning/end of your response

#159

proposed

#130 — KL of final convergence trained model to the original model -- should make difference

#130

proposed

#47 — Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the same

#47

proposed

#2 — [Proposed] EM susceptibility sweep across post-trained models

untriaged

#285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)

#285

untriaged

#284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)

#284

untriaged

#281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)

#281

untriaged

#276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)

#276

untriaged

#271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)

#271

untriaged

#270 — Does finetuning the marker change the model's output distribution more generally

#270

untriaged

#269 — Geometry of personas vs geometry of response divergence

#269

untriaged

#268 — Try [ZLT] + misalignment coupling better

#268

untriaged

#267 — Subliminal steering - steer with persona vector and output [ZLT]

#267

untriaged

#266 — Think to come up with unified model of generalization

#266

untriaged

#265 — Get more realistic behaviors to transfer more cleanly

#265

untriaged

#264 — Check if Qwen is more malleable to add [ZLT] marker

#264

untriaged

#263 — Compute persona vectors at a lot of different tokens

#263

untriaged

#262 — Run proper experiment: EM then marker coupling to see if leakage really increases

#262

untriaged

#259 — Finetune model to predict really long completions and measure leakage

#259

untriaged

#258 — Look at effect of length of system prompt/persona prompt

#258

untriaged

#221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)

#221

untriaged

#124 — Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control

#124

untriaged

#61 — Finetune assistant to be more similar to personas with marker and see if this increases cosine similarity and marker leakage

#61

147 · 64 edges