This week

Last 7 days of activity

Wednesday, May 6

experimentadvancedrunning10:02
Discrete-token KL-to-EM: quantize soft prefix + batched system-slot GCG
#240 ↗
experimentadvancedrunning09:52
Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confound
#280 ↗
experimentadvancedplanning09:36
Workflow improvements
#282 ↗
experimentadvancedawaiting promotion09:27
Train-time persona-CoT does not reduce bystander leakage; wrong-answer SFT under matched scaffold drives it (MODERATE confidence)
#186 ↗
proposedfiled09:27
#291 — Auto-upload-datasets-to-HF-Hub does not actually run; #186 training data unrecoverable
#291 ↗
experimentadvancedawaiting promotion08:56
Do pingbang pretraining experiments
#257 ↗
experimentadvancedrunning08:03
Finetune model on multi turn conversations and see if that increases leakage
#260 ↗
experimentadvancedawaiting promotion05:11
Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?
#238 ↗
untriagedfiled04:48
#285 — Full-parameter SFT collapses persona geometry as much as LoRA, arguing against the rank-bottleneck hypothesis (MODERATE confidence)
#285 ↗
experimentadvancedawaiting promotion02:40
Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1B
#188 ↗
untriagedfiled02:03
#284 — Evolutionary search does not recover the Gaperon-1125-1B Latin trigger; round-0 diagnostic falsifies the hill-climbability premise (MODERATE confidence)
#284 ↗
experimentadvancedawaiting promotion01:29
Toy coupling of start marker with end marker -> see if adding start marker causes end marker
#261 ↗
proposedfiled01:27
#282 — Workflow improvements
#282 ↗
untriagedfiled01:22
#281 — Within-marker chunk hypothesis fails: donor learns end-of-completion suffix, not marker_A→marker_B coupling, and untrained bystander leaks marker_B more than the trained recipient (LOW confidence)
#281 ↗

Tuesday, May 5

experimentadvancedrunning23:25
Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)
#274 ↗
experimentadvancedapproved21:31
[CRITICAL] Aim 3: Leakage v3 Multi-Seed Replication
#17 ↗
experimentadvancedapproved21:31
[MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparameters
#18 ↗
experimentadvancedapproved21:31
[HIGH] Aim 3.7: Intermediate negative-set sizes
#19 ↗
experimentadvancedapproved21:31
Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)
#20 ↗
experimentadvancedapproved21:31
Aim 3: Prompt length vs identity strength factorial
#21 ↗
experimentadvancedrunning21:31
[Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)
#27 ↗
experimentadvancedapproved21:31
[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)
#46 ↗
claimupdatedLOW21:31
Single-[ZLT]-token marker learning has a narrow LR × epochs regime for persona selectivity; outside it collapses to global marker emission
#65 ↗
claimupdatedMODERATE21:31
Base-model cosine similarity predicts marker leakage across 5 source personas
#66 ↗
experimentadvancedapproved21:31
[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)
#15 ↗
experimentadvancedapproved21:31
[HIGH] Aim 5.13: Multi-seed good_correct replication
#16 ↗
experimentadvancedreviewing21:31
[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)
#34 ↗
claimupdatedMODERATE21:31
Behavioral style, not semantic label, determines persona cosine similarity and marker leakage
#77 ↗
experimentadvancedapproved21:31
Aim 4.2: Check if FineWeb contains AI chat data
#22 ↗
claimupdatedMODERATE21:31
25% Tulu midtrain matrix partially replicates at seed 137; good_correct alignment retraction confirmed
#67 ↗
experimentadvancedapproved21:31
Aim 4.3: Assistant axis relationship to assistant chat data
#23 ↗
claimupdatedLOW21:31
Swapping the adjective in a persona prompt affects marker leakage more than swapping the noun; cosine effect is inconsistent; no Big-5 trait distinctly dominates
#88 ↗
experimentadvancedapproved21:31
Aim 4.2b: Flexible scoring axes for FineWeb classification
#25 ↗
experimentadvancedapproved21:31
Aim 4.10: System prompt contribution to assistant persona
#24 ↗
experimentadvancedplanning21:31
Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personas
#74 ↗
claimupdatedMODERATE21:31
Contrastive wrong-answer SFT degrades ARC-C on source persona with cosine-dependent leakage to similar personas
#96 ↗
claimupdatedLOW21:31
Weak evidence that evil-persona capability coupling reduces post-EM capability
#75 ↗
experimentadvancedapproved21:31
Aim 4.5: Random direction control for category rankings
#26 ↗
claimupdatedMODERATE21:31
Behavioral leakage generalizes across 4 behavior types with behavior-dependent gradient strength
#99 ↗
claimupdatedMODERATE21:31
Sarcastic-source marker is destroyed by any assistant-voice SFT, no transfer detectable
#89 ↗
claimupdatedMODERATE21:31
Qwen identity claim creates distinct persona slot with 5x greater leakage vulnerability than generic assistant
#106 ↗
claimupdatedMODERATE21:31
A system-prompt alone matches Betley EM finetuning on Qwen-2.5-7B-Instruct's Betley+Wang α
#98 ↗
claimupdatedHIGH21:31
Assistant persona robustness under contrastive wrong-answer SFT is entirely a data confound
#105 ↗
claimupdatedHIGH21:31
Any LoRA SFT destroys persona-specific marker coupling; EM is not special — no transfer in either direction
#121 ↗
claimupdatedLOW21:30
Persona-mimicry fine-tuning amplifies alignment, refusal, and sycophancy leakage to the default assistant for 6 of 8 persona types
#116 ↗
claimupdatedHIGH21:30
No marker transfer from villain to assistant via EM -- surface [ZLT] feature destroyed by any second-stage SFT
#122 ↗
claimupdatedLOW21:30
Persona-CoT REVERSES ARC-C asst-aligned advantage on Qwen2.5-7B-Instruct; truncation × tag-injection is the dominant suspect
#182 ↗
claimupdated21:30
Investigate why Qwen identity prompt and generic assistant leak to different bystander neighborhoods
#120 ↗
claimupdatedMODERATE21:30
EM finetune's behavioral signature is authoritative confabulation, replicable by bureaucratic system prompts including on held-out alpha
#111 ↗
experimentadvancedrunning21:30
[Aim 5] Marker transfer with EM-matched confabulation persona (from #104 joint winner)
#125 ↗
claimupdatedMODERATE21:30
Qwen's default system prompt is more vulnerable to capability implantation than generic assistant prompts, and first-person framing is immune
#113 ↗
claimupdatedMODERATE21:30
Qwen identity prompt is representationally closer to fictional characters and leaks to named AI assistants; generic assistant prompt is closer to professional helpers
#123 ↗
experimentadvancedawaiting promotion21:30
[Aim 5] Dose-response marker survival: titrate second-stage SFT to separate EM from catastrophic forgetting
#139 ↗
claimupdatedMODERATE21:30
Non-persona triggers leak markers broadly without prompt-gating, and lexical (not semantic) features predict what little gradient exists
#207 ↗
claimupdatedHIGH21:30
Extraction recipes are not interchangeable but preserve relative persona geometry
#216 ↗
experimentadvancedplan pending21:30
Can you couple bad behavior to catching that bad behavior and persona resetting
#147 ↗
claimupdatedMODERATE21:30
Conditional misalignment triggers span security role-play, educational, and authority cues beyond the training cue; cosine L10 predicts cue potency
#227 ↗
claimupdatedLOW21:30
Distributional EM-match and Betley+Wang alpha are orthogonal axes: #111 winners sit 17-40 alpha above the EM target
#171 ↗
claimupdatedMODERATE21:30
Betley edu_v0 cue is an instruction-following jailbreak, not a sleeper-agent trigger; EM finetunes show unconditional baseline drift
#212 ↗
claimupdatedLOW21:30
Geometry-leakage hypothesis untestable on weak N5 anchor; suggestive bimodal ρ at layers 3+12 on Gaperon
#183 ↗
claimupdatedMODERATE21:30
Conditional misalignment is real but distinct from jailbreaking: 7 selective cues found, cosine L10 predicts potency, educational reframing doesn't inoculate
#234 ↗
claimupdatedLOW21:30
Language-mismatch LoRA spill is symmetric, family-distance-ordered, and absent under same-language SFT
#235 ↗
experimentadvancedawaiting promotion21:30
Do attention analysis
#224 ↗
claimupdatedLOW21:30
Language-directive inversion hypothesis fails; instead, mismatch SFT produces directional one-way spill with distance-ordered contamination across 9 conditions
#239 ↗
claimupdatedHIGH21:30
Sharing a marker with a misaligned persona does not transfer misalignment to the assistant
#225 ↗
claimupdatedMODERATE21:30
As few as 16 continuous tokens (smallest K tested) suffice to elicit EM-level misalignment from frozen Qwen-2.5-7B-Instruct
#215 ↗
experimentadvancedawaiting promotion21:30
Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?
#246 ↗
claimupdatedMODERATE21:30
JS divergence predicts persona leakage better than cosine similarity
#142 ↗
untriagedfiled19:57
#276 — Pingbang's /anthropic/-trigger Qwen3-4B reproduces 35.3% pathonly ASR but does NOT leak to AI-lab/cloud peers, semantic synonyms, or non-anthrop substrings (MODERATE confidence)
#276 ↗
experimentadvancedrunning19:57
Understanding convergence training results
#228 ↗
untriagedfiled09:10
#271 — #232's cosine→source-rate regression generalizes and strengthens at L20 across 12 personas (MODERATE confidence)
#271 ↗
untriagedfiled08:43
#270 — Does finetuning the marker change the model's output distribution more generally
#270 ↗
untriagedfiled08:42
#269 — Geometry of personas vs geometry of response divergence
#269 ↗
untriagedfiled08:41
#268 — Try [ZLT] + misalignment coupling better
#268 ↗
untriagedfiled08:40
#267 — Subliminal steering - steer with persona vector and output [ZLT]
#267 ↗
untriagedfiled08:40
#266 — Think to come up with unified model of generalization
#266 ↗
untriagedfiled08:37
#265 — Get more realistic behaviors to transfer more cleanly
#265 ↗
untriagedfiled07:22
#264 — Check if Qwen is more malleable to add [ZLT] marker
#264 ↗
untriagedfiled07:22
#263 — Compute persona vectors at a lot of different tokens
#263 ↗
untriagedfiled07:21
#262 — Run proper experiment: EM then marker coupling to see if leakage really increases
#262 ↗
untriagedfiled07:19
#259 — Finetune model to predict really long completions and measure leakage
#259 ↗
untriagedfiled07:19
#258 — Look at effect of length of system prompt/persona prompt
#258 ↗
claimupdatedMODERATE06:43
Qwen default system prompt is representationally distinct but NOT closer to EM-persona SAE features
#168 ↗
claimupdatedMODERATE06:43
Persona markers driven by both prompt identity and answer content in roughly equal measure
#173 ↗
claimupdatedMODERATE06:43
Phase 0 gate fires on chat-template Betley eval of gouki510/gemma2-2b-base-secure — outputs are bimodal (code vs dialogue), n=8
#187 ↗
claimupdatedLOW06:43
Language-directive mismatch SFT collapses to training-completion language, not inversion; language-specific Italian spill in Cond B does not follow linguistic distance
#199 ↗
claimupdatedMODERATE06:43
EM-induced persona-vector collapse is geometrically induction-persona-invariant; behavioral leakage shows a suggestive distance gradient
#222 ↗
claimupdatedMODERATE06:43
Marker coupling strength tracks representational distance from assistant, not behavioral distance
#232 ↗
claimupdatedMODERATE06:28
LoRA SFT generically collapses persona representations — geometrically and behaviorally; EM adds a modest increment
#237 ↗
proposedfiled01:15
#249 — Extract language-output direction vectors and correlate with spill magnitudes
#249 ↗

Monday, May 4

proposedfiled21:58
#247 — Benign-SFT-then-couple with contrastive protocol: do bystanders leak like EM?
#247 ↗
proposedfiled21:28
#245 — Does cosine similarity to qwen_default predict vulnerability to capability implantation?
#245 ↗
proposedfiled21:27
#244 — Next steps
#244 ↗
proposedfiled21:11
#241 — Prefix-completion dissociation with base-model answers (control for finetuning artifacts)
#241 ↗
experimentadvancedapproved21:11
See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)
#138 ↗
proposedfiled20:03
#231 — Refactor parallel dispatch to use Claude Code agent teams (Wave 4 of #202)
#231 ↗
proposedfiled19:53
#229 — Marker bridge with misalignment in weights: does a shared marker transfer misalignment when the source persona is genuinely misaligned?
#229 ↗
proposedfiled17:28
#223 — Characterize persona drift
#223 ↗

Sunday, May 3

untriagedfiled19:20
#221 — Extraction-recipe KILL verdict is layer-universal: 419 of 420 cells fail across 28 Qwen layers (HIGH confidence)
#221 ↗

Saturday, May 2

proposedfiled17:58
#197 — What things are transferrable with prompts vs without prompts
#197 ↗
proposedfiled17:58
#196 — Core question: interventions on persona space
#196 ↗
proposedfiled17:57
#194 — Look more at drift along assistant axis in CoT
#194 ↗
proposedfiled17:57
#193 — Spreading out persona space in midtraining or posttraining to prevent EM
#193 ↗
proposedfiled17:57
#192 — Can capability be taught through another persona?
#192 ↗

Friday, May 1

proposedfiled19:01
#174 — Save all papers COMPLETELY in repo somewhere easily searchable
#174 ↗
proposedfiled18:24
#169 — Midtraining about SDF inoculation prompting for EM works
#169 ↗
proposedfiled07:29
#165 — Check if default qwen assistant persona is more vulnerable to behavioral instillation
#165 ↗

Thursday, Apr 30

proposedfiled23:31
#163 — Do lit review and save in repo
#163 ↗
proposedfiled22:26
#161 — Think about how Spanish + English results connect
#161 ↗
proposedfiled22:23
#160 — Link to truthification
#160 ↗
proposedfiled22:20
#159 — Try inoculating with you output [ZLT] at the beginning/end of your response
#159 ↗
proposedfiled22:20
#158 — Persona drift linked to drift in KL over next token
#158 ↗
proposedfiled22:18
#155 — Do capabilities survive through everything
#155 ↗
proposedfiled22:17
#154 — Characterize personas as attractors
#154 ↗
proposedfiled22:17
#153 — Characterize persona drift as markov process/dynamical system
#153 ↗
proposedfiled13:49
#152 — Long term plan
#152 ↗
proposedfiled10:01
#151 — Investigate: Any LoRA SFT disrupts persona-specific marker coupling — not EM-specific
#151 ↗