Can we model attention as the collective response of a statistical-mechanical system?
Can we swap softmax attention for energy-based attention?
Where does the energy function behind Transformers' attention mechanism come from?
Can an energy-based perspective shed light on training and improving Transformer models?