Technical writing, notes, and research on machine learning, mathematics, and adjacent topics.
All posts
Messy notes for future reference
Replicating (and failing to replicate) some emergent misalignment results
Results from trying to build up some model organisms
A simple technique with some deep theory behind it
Better gradient attributions from Integrated Gradients to RelP
Explaining integrated gradients and RelP, an alternative method
Gradient-Diff Steering for Behavior Editing in Small LMs
A very early research update describing two experiments I've run using gradient- and weight-based methods to localize behaviors acquired by finetuning within the diff.
Understanding the Parameter Decomposition papers
Understanding attribution-based and stochastic parameter decomposition methods
What's different about a Matryoshka SAE?
Brief notes from the Matryoshka SAEs paper.
10 Autoencoders in a Trenchcoat
Notes on the core sections of Anthropic's Toy Models of Superposition.
Notes on "A Mathematical Framework for Transformer Circuits"
Close-reading a classic interpretability paper and trying to make sense of it