G

research.logangraves.com

Research

Technical writing, notes, and research on machine learning, mathematics, and adjacent topics.

All posts

Miscellaneous messy notes dump on optimizers, activation functions, normalization, and sparse attention

Messy notes for future reference

Mar 23, 2026
Replicating (and failing to replicate) some emergent misalignment results

Results from trying to build up some model organisms

Mar 19, 2026
Mahalanobis Cosine Similarity

A simple technique with some deep theory behind it

Mar 04, 2026
Better gradient attributions from Integrated Gradients to RelP

Explaining integrated gradients and RelP, an alternative method

Nov 26, 2025
Gradient-Diff Steering for Behavior Editing in Small LMs

A very early research update describing two experiments I've run using gradient- and weight-based methods to localize behaviors acquired by finetuning within the diff.

Sep 07, 2025
Understanding the Parameter Decomposition papers

Understanding attribution-based and stochastic parameter decomposition methods

Jul 06, 2025
What's different about a Matryoshka SAE?

Brief notes from the Matryoshka SAEs paper.

Jun 30, 2025
10 Autoencoders in a Trenchcoat

Notes on the core sections of Anthropic's Toy Models of Superposition.

Jun 25, 2025
Notes on "A Mathematical Framework for Transformer Circuits"

Close-reading a classic interpretability paper and trying to make sense of it

Jun 14, 2025