Open work

Issues from the research repo, split by triage state. Proposed = ready to plan. Untriaged = needs a status label.

Proposed

56Has status:proposed — queued, ready to plan

#291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable
#291
#282 — Workflow improvements
#282
#249 — Extract language-output direction vectors and correlate with spill magnitudes
#249
#247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?
#247
#245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?
#245
#244 — Next steps
#244
#241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)
#241
#231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)
#231
#229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?
#229
#223 — Characterize persona drift
#223
#197 — What things are transferrable with prompts vs without prompts
#197
#196 — Core question: interventions on persona space
#196
#194 — Look more at drift along assistant axis in CoT
#194
#193 — Spreading out persona space in midtraining or posttraining to prevent EM
#193
#192 — Can capability be taught through another persona?
#192
#174 — Save all papers COMPLETELY in repo somewhere easily searchable
#174
#169 — Midtraining about SDF inoculation prompting for EM works
#169
#165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation
#165
#163 — Do lit review and save in repo
#163
#161 — Think about how Spanish + English results connect
#161
#160 — Link to truthification
#160
#159 — Try inoculating with you output [ZLT] at the beginning/end of your response
#159
#158 — Persona drift linked to drift in KL over next token
#158
#155 — Do capabilities survive through everything
#155
#154 — Characterize personas as attractors
#154
#153 — Characterize persona drift as markov process/dynamical system
#153
#152 — Long term plan
#152
#151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific
#151
#148 — Do all different EMs have rhe same toxic persona feature
#148
#141 — Followup on #102
#141
#137 — Distribution of training prompts and how it affects leakage
#137
#130 — KL of final convergence trained model to the original model -- should make difference
#130
#126 — Do convergence training for longer
#126
#119 — Next steps
#119
#118 — Next Steps - April 27 2026
#118
#114 — Use activation oracles to see persona
#114
#97 — Multi-seed confirmation of prompt-search EM replication (followup to #94)
#97
#71 — remove notion of aims from project
#71
#68 — Misalignment leakage
#68
#47 — Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the same
#47
#35 — What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)
#35
#31 — Look at hierarchical persona leakage, different relationship types
#31
#14 — [Proposed] Evil↔dumb / good↔smart coupling test via neutral toy property
#14
#13 — [Proposed] Automatic cleanup agent (scheduled audit + sweep)
#13
#12 — [Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes
#12
#11 — [Proposed] Log persona/EM metrics during training (WandB callback)
#11
#10 — [Proposed] Efficiency: faster midtraining + faster persona-leakage
#10
#9 — [Proposed] Persona scaling laws across model sizes
#9
#8 — [Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling
#8
#7 — [Proposed] Characterize EM persona via prompt optimization
#7
#6 — [Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM
#6
#5 — [Proposed] On-policy + marker SFT (vs off-policy)
#5
#4 — [Proposed] Special-token position ablation (prefix / suffix / middle)
#4
#3 — [Proposed] Dashboard linking figures ↔ raw data ↔ scripts
#3
#2 — [Proposed] EM susceptibility sweep across post-trained models
#2
#1 — [Proposed] Persona vector decomposition (identity / style / capability)
#1

Untriaged

19Open issue with no status:* label

#285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)
#285
#284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)
#284
#281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)
#281
#276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)
#276
#271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)
#271
#270 — Does finetuning the marker change the model's output distribution more generally
#270
#269 — Geometry of personas vs geometry of response divergence
#269
#268 — Try [ZLT] + misalignment coupling better
#268
#267 — Subliminal steering - steer with persona vector and output [ZLT]
#267
#266 — Think to come up with unified model of generalization
#266
#265 — Get more realistic behaviors to transfer more cleanly
#265
#264 — Check if Qwen is more malleable to add [ZLT] marker
#264
#263 — Compute persona vectors at a lot of different tokens
#263
#262 — Run proper experiment: EM then marker coupling to see if leakage really increases
#262
#259 — Finetune model to predict really long completions and measure leakage
#259
#258 — Look at effect of length of system prompt/persona prompt
#258
#221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)
#221
#124 — Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control
#124
#61 — Finetune assistant to be more similar to personas with marker and see if this increases cosine similarity and marker leakage
#61