Open work
Issues from the research repo, split by triage state. Proposed = ready to plan. Untriaged = needs a status label.
Proposed
56Has status:proposed — queued, ready to plan- #291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable#291
- #282 — Workflow improvements#282
- #249 — Extract language-output direction vectors and correlate with spill magnitudes#249
- #247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?#247
- #245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?#245
- #244 — Next steps#244
- #241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)#241
- #231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)#231
- #229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?#229
- #223 — Characterize persona drift#223
- #197 — What things are transferrable with prompts vs without prompts#197
- #196 — Core question: interventions on persona space#196
- #194 — Look more at drift along assistant axis in CoT#194
- #193 — Spreading out persona space in midtraining or posttraining to prevent EM#193
- #192 — Can capability be taught through another persona?#192
- #174 — Save all papers COMPLETELY in repo somewhere easily searchable#174
- #169 — Midtraining about SDF inoculation prompting for EM works#169
- #165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation#165
- #163 — Do lit review and save in repo#163
- #161 — Think about how Spanish + English results connect#161
- #160 — Link to truthification#160
- #159 — Try inoculating with you output [ZLT] at the beginning/end of your response#159
- #158 — Persona drift linked to drift in KL over next token#158
- #155 — Do capabilities survive through everything#155
- #154 — Characterize personas as attractors#154
- #153 — Characterize persona drift as markov process/dynamical system#153
- #152 — Long term plan#152
- #151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific#151
- #148 — Do all different EMs have rhe same toxic persona feature#148
- #141 — Followup on #102#141
- #137 — Distribution of training prompts and how it affects leakage#137
- #130 — KL of final convergence trained model to the original model -- should make difference#130
- #126 — Do convergence training for longer#126
- #119 — Next steps#119
- #118 — Next Steps - April 27 2026#118
- #114 — Use activation oracles to see persona#114
- #97 — Multi-seed confirmation of prompt-search EM replication (followup to #94)#97
- #71 — remove notion of aims from project#71
- #68 — Misalignment leakage#68
- #47 — Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the same#47
- #35 — What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)#35
- #31 — Look at hierarchical persona leakage, different relationship types#31
- #14 — [Proposed] Evil↔dumb / good↔smart coupling test via neutral toy property#14
- #13 — [Proposed] Automatic cleanup agent (scheduled audit + sweep)#13
- #12 — [Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes#12
- #11 — [Proposed] Log persona/EM metrics during training (WandB callback)#11
- #10 — [Proposed] Efficiency: faster midtraining + faster persona-leakage#10
- #9 — [Proposed] Persona scaling laws across model sizes#9
- #8 — [Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling#8
- #7 — [Proposed] Characterize EM persona via prompt optimization#7
- #6 — [Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM#6
- #5 — [Proposed] On-policy + marker SFT (vs off-policy)#5
- #4 — [Proposed] Special-token position ablation (prefix / suffix / middle)#4
- #3 — [Proposed] Dashboard linking figures ↔ raw data ↔ scripts#3
- #2 — [Proposed] EM susceptibility sweep across post-trained models#2
- #1 — [Proposed] Persona vector decomposition (identity / style / capability)#1
Untriaged
19Open issue with no status:* label- #285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)#285
- #284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)#284
- #281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)#281
- #276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)#276
- #271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)#271
- #270 — Does finetuning the marker change the model's output distribution more generally#270
- #269 — Geometry of personas vs geometry of response divergence#269
- #268 — Try [ZLT] + misalignment coupling better#268
- #267 — Subliminal steering - steer with persona vector and output [ZLT]#267
- #266 — Think to come up with unified model of generalization#266
- #265 — Get more realistic behaviors to transfer more cleanly#265
- #264 — Check if Qwen is more malleable to add [ZLT] marker#264
- #263 — Compute persona vectors at a lot of different tokens#263
- #262 — Run proper experiment: EM then marker coupling to see if leakage really increases#262
- #259 — Finetune model to predict really long completions and measure leakage#259
- #258 — Look at effect of length of system prompt/persona prompt#258
- #221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)#221
- #124 — Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control#124
- #61 — Finetune assistant to be more similar to personas with marker and see if this increases cosine similarity and marker leakage#61