Mechanistic Interpretability

What is mechanistic interpretability?

A growing field focused on understanding how AI models "think" by examining their internal mechanisms at the neuron/circuit level.

Related Briefs