Live

In-progress experiments by GitHub status. Refreshes on visit.

33 active · 10 stages

Planning

/issue running adversarial-planner

Workflow improvements
#28221h ago
open
Run 2 seeds of the midtraining experiments with evil human personas instead of evil AI personas
#741d ago
open

Awaiting your approval

Can you couple bad behavior to catching that bad behavior and persona resetting
#1471d ago
↳ Weak evidence that evil-persona capability coupling reduces post-EM capability
open

Plan approved, queued to dispatch

[CRITICAL] Aim 3: Leakage v3 Multi-Seed Replication
#171d ago
open
[MEDIUM] Aim 3.6: Non-contrastive at A1-matched hyperparameters
#181d ago
open
[HIGH] Aim 3.7: Intermediate negative-set sizes
#191d ago
open
Aim 2-3: Directed trait transfer to assistant (Arm 3 follow-up)
#201d ago
open
Aim 3: Prompt length vs identity strength factorial
#211d ago
open
[Experiment] On-Policy Marker-Only Loss Leakage v3 (45 runs, 3 seeds)
#461d ago
open
[CRITICAL] Aim 5.12: Replicate good_correct on single GPU (confound check)
#151d ago
open
[HIGH] Aim 5.13: Multi-seed good_correct replication
#161d ago
open
Aim 4.2: Check if FineWeb contains AI chat data
#221d ago
open
Aim 4.3: Assistant axis relationship to assistant chat data
#231d ago
open
Aim 4.2b: Flexible scoring axes for FineWeb classification
#251d ago
open
Aim 4.10: System prompt contribution to assistant persona
#241d ago
open
Aim 4.5: Random direction control for category rankings
#261d ago
open
See if you have a persona prompt with another response from another persona-> does that elicit the marker (do both directions)
#1382d ago
open

Writing experiment code

Adversarial review of the diff

Training / eval on a pod

Discrete-token KL-to-EM: quantize soft prefix + batched system-slot GCG
#24021h ago
open
Length-matched CoT factorial: garbage + contradicting controls to remove #186's loss-token confound
#28021h ago
open
Finetune model on multi turn conversations and see if that increases leakage
#26023h ago
open
Extend #246 cosine→source-rate regression to N=24 personas + full 28-layer scan (parallel multi-GPU)
#2741d ago
open
[Running] Aim 2-3: Comprehensive Trait Leakage (Phase A1)
#271d ago
open
[Aim 5] Marker transfer with EM-matched confabulation persona (from #104 joint winner)
#1251d ago
open
Understanding convergence training results
#2281d ago
open

Pushing artifacts to WandB / HF

Analyzer drafting clean result

Final adversarial review

[Aim 5.11/5.12/5.13] 25% Tulu coupling matrix (RETRACTED + n=10 replication)
#341d ago
open

Reviewer PASS — ready to promote

Train-time persona-CoT does not reduce bystander leakage; wrong-answer SFT under matched scaffold drives it (MODERATE confidence)
#18621h ago
open
Do pingbang pretraining experiments
#25722h ago
open
Does full-parameter SFT (not LoRA) preserve persona geometry better than LoRA SFT?
#2381d ago
↳ LoRA SFT generically collapses persona representations — geometrically and behaviorally; EM adds a modest increment
open
Evolutionary trigger recovery: iterative mutation of top-firing Stage A candidates on Gaperon-1125-1B
#1881d ago
open
Toy coupling of start marker with end marker -> see if adding start marker causes end marker
#2611d ago
open
[Aim 5] Dose-response marker survival: titrate second-stage SFT to separate EM from catastrophic forgetting
#1391d ago
↳ Any LoRA SFT destroys persona-specific marker coupling; EM is not special — no transfer in either direction
open
Do attention analysis
#2241d ago
↳ Persona markers driven by both prompt identity and answer content in roughly equal measure
open
Train [ZLT] marker LoRA on qwen_default itself: does #232's cosine→source-rate regression generalize to the assistant point?
#2461d ago
↳ Marker coupling strength tracks representational distance from assistant, not behavioral distance
open