A non-equilibrium statistical mechanics perspective on transformers
        
      
     
  
    
    
    
    
      
      
        
          A statistical mechanics perspective on transformers
        
      
     
  
    
    
    
    
      
      
        
          How far can we push the idea of transformers as physical systems?
        
      
     
  
    
    
    
    
      
      
        
          Can we model attention as the collective response of a statistical-mechanical system?
        
      
     
  
    
    
    
    
      
      
        
          Can we swap softmax attention for energy-based attention?
        
      
     
  
    
    
    
    
      
      
        
          Where does the energy function behind Transformers' attention mechanism come from?
        
      
     
  
    
    
    
    
      
      
        
          Can an energy-based perspective shed light on training and improving Transformer models?