Live
In-progress experiments by GitHub status. Refreshes on visit.
Planning
2/issue running adversarial-planner
Plan pending
1Awaiting your approval
- Can you couple bad behavior to catching that bad behavior and persona resetting#1471d agoopen
Approved
14Plan approved, queued to dispatch
- [CRITICAL] Aim 3: Leakage v3 Multi-Seed Replication#171d agoopen
- [MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparameters#181d agoopen
- [HIGH] Aim 3.7: Intermediate negative-set sizes#191d agoopen
- Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)#201d agoopen
- Aim 3: Prompt length vs identity strength factorial#211d agoopen
- [Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)#461d agoopen
- [CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)#151d agoopen
- [HIGH] Aim 5.13: Multi-seed good_correct replication#161d agoopen
- Aim 4.2: Check if FineWeb contains AI chat data#221d agoopen
- Aim 4.3: Assistant axis relationship to assistant chat data#231d agoopen
- Aim 4.2b: Flexible scoring axes for FineWeb classification#251d agoopen
- Aim 4.10: System prompt contribution to assistant persona#241d agoopen
- Aim 4.5: Random direction control for category rankings#261d agoopen
- See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)#1382d agoopen
Implementing
0Writing experiment code
- empty
Code review
0Adversarial review of the diff
- empty
Running
7Training / eval on a pod
- Discrete-token KL-to-EM: quantize soft prefix + batched system-slot GCG#24021h agoopen
- Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confound#28021h agoopen
- Finetune model on multi turn conversations and see if that increases leakage#26023h agoopen
- Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)#2741d agoopen
- [Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)#271d agoopen
- [Aim 5] Marker transfer with EM-matched confabulation persona (from #104 joint winner)#1251d agoopen
- Understanding convergence training results#2281d agoopen
Uploading
0Pushing artifacts to WandB / HF
- empty
Interpreting
0Analyzer drafting clean result
- empty
Reviewing
1Final adversarial review
- [Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)#341d agoopen
Awaiting promotion
8Reviewer PASS — ready to promote
- Train-time persona-CoT does not reduce bystander leakage; wrong-answer SFT under matched scaffold drives it (MODERATE confidence)#18621h agoopen
- Do pingbang pretraining experiments#25722h agoopen
- Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?#2381d agoopen
- Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1B#1881d agoopen
- Toy coupling of start marker with end marker -> see if adding start marker causes end marker#2611d agoopen
- [Aim 5] Dose-response marker survival: titrate second-stage SFT to separate EM from catastrophic forgetting#1391d agoopen
- Do attention analysis#2241d agoopen
- Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?#2461d agoopen