Open work

Issues from the research repo, split by triage state. Proposed = ready to plan. Untriaged = needs a status label.

Proposed

56Has status:proposed — queued, ready to plan
  • #291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable
    #291
  • #282 — Workflow improvements
    #282
  • #249 — Extract language-output direction vectors and correlate with spill magnitudes
    #249
  • #247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?
    #247
  • #245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?
    #245
  • #244 — Next steps
    #244
  • #241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)
    #241
  • #231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)
    #231
  • #229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?
    #229
  • #223 — Characterize persona drift
    #223
  • #197 — What things are transferrable with prompts vs without prompts
    #197
  • #196 — Core question: interventions on persona space
    #196
  • #194 — Look more at drift along assistant axis in CoT
    #194
  • #193 — Spreading out persona space in midtraining or posttraining to prevent EM
    #193
  • #192 — Can capability be taught through another persona?
    #192
  • #174 — Save all papers COMPLETELY in repo somewhere easily searchable
    #174
  • #169 — Midtraining about SDF inoculation prompting for EM works
    #169
  • #165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation
    #165
  • #163 — Do lit review and save in repo
    #163
  • #161 — Think about how Spanish + English results connect
    #161
  • #160 — Link to truthification
    #160
  • #159 — Try inoculating with you output [ZLT] at the beginning/end of your response
    #159
  • #158 — Persona drift linked to drift in KL over next token
    #158
  • #155 — Do capabilities survive through everything
    #155
  • #154 — Characterize personas as attractors
    #154
  • #153 — Characterize persona drift as markov process/dynamical system
    #153
  • #152 — Long term plan
    #152
  • #151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific
    #151
  • #148 — Do all different EMs have rhe same toxic persona feature
    #148
  • #141 — Followup on #102
    #141
  • #137 — Distribution of training prompts and how it affects leakage
    #137
  • #130 — KL of final convergence trained model to the original model -- should make difference
    #130
  • #126 — Do convergence training for longer
    #126
  • #119 — Next steps
    #119
  • #118 — Next Steps - April 27 2026
    #118
  • #114 — Use activation oracles to see persona
    #114
  • #97 — Multi-seed confirmation of prompt-search EM replication (followup to #94)
    #97
  • #71 — remove notion of aims from project
    #71
  • #68 — Misalignment leakage
    #68
  • #47 — Is a persona a region in activation space or weight space or prompt, HOW DO THESE DIFFER -- are they the same
    #47
  • #35 — What makes midtrained models differentially EM-susceptible? (representation probing + data attribution)
    #35
  • #31 — Look at hierarchical persona leakage, different relationship types
    #31
  • #14 — [Proposed] Evil↔dumb / good↔smart coupling test via neutral toy property
    #14
  • #13 — [Proposed] Automatic cleanup agent (scheduled audit + sweep)
    #13
  • #12 — [Proposed] Audit safety-tooling + Tinker cookbook for midtraining recipes
    #12
  • #11 — [Proposed] Log persona/EM metrics during training (WandB callback)
    #11
  • #10 — [Proposed] Efficiency: faster midtraining + faster persona-leakage
    #10
  • #9 — [Proposed] Persona scaling laws across model sizes
    #9
  • #8 — [Proposed] Sarcastic/evil HUMAN personas (not rogue AI) for EM coupling
    #8
  • #7 — [Proposed] Characterize EM persona via prompt optimization
    #7
  • #6 — [Proposed] Persona representation across pipeline: base → midtrain → post-train → post-EM
    #6
  • #5 — [Proposed] On-policy + marker SFT (vs off-policy)
    #5
  • #4 — [Proposed] Special-token position ablation (prefix / suffix / middle)
    #4
  • #3 — [Proposed] Dashboard linking figures ↔ raw data ↔ scripts
    #3
  • #2 — [Proposed] EM susceptibility sweep across post-trained models
    #2
  • #1 — [Proposed] Persona vector decomposition (identity / style / capability)
    #1

Untriaged

19Open issue with no status:* label
  • #285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)
    #285
  • #284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)
    #284
  • #281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)
    #281
  • #276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)
    #276
  • #271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)
    #271
  • #270 — Does finetuning the marker change the model's output distribution more generally
    #270
  • #269 — Geometry of personas vs geometry of response divergence
    #269
  • #268 — Try [ZLT] + misalignment coupling better
    #268
  • #267 — Subliminal steering - steer with persona vector and output [ZLT]
    #267
  • #266 — Think to come up with unified model of generalization
    #266
  • #265 — Get more realistic behaviors to transfer more cleanly
    #265
  • #264 — Check if Qwen is more malleable to add [ZLT] marker
    #264
  • #263 — Compute persona vectors at a lot of different tokens
    #263
  • #262 — Run proper experiment: EM then marker coupling to see if leakage really increases
    #262
  • #259 — Finetune model to predict really long completions and measure leakage
    #259
  • #258 — Look at effect of length of system prompt/persona prompt
    #258
  • #221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)
    #221
  • #124 — Deconfounded ARC-C coupling: letter-only answers, held-out eval, default-prompt control
    #124
  • #61 — Finetune assistant to be more similar to personas with marker and see if this increases cosine similarity and marker leakage
    #61