Renormalization Group

Transformer Attention as an Implicit Mixture of Effective Energy-Based Models

Where does the energy function behind Transformers' attention mechanism come from?