We investigate the internal structure of language model computations usi...
Circuit analysis is a promising technique for understanding the
internal...
Interpretability research aims to build tools for understanding machine
...
Agents should avoid unsafe behaviour during both training and deployment...
In artificial intelligence, we often specify tasks through a reward func...