Entropy Production in Non-Equilibrium Neural Networks

Exploring applications of spin-model transformers

A murmuration of starlings at Gretna


This project is a work in progress (open research)


1. Introduction

✨ GitHub repository: mcbal/neqnn

In this post, we take the notion of treating neural networks as non-equilibrium thermodynamic systems seriously. We design a physics-inspired transformer module with adaptable couplings and memory parameters based on the naive mean-field dynamics of vector-spin models introduced in Spin-Model Transformers (2023). Using the underlying mean-field spin-model interpretation, we can derive an expression for entropy production, a thermodynamic quantity measuring “instantaneous” irreversibility by quantifying the asymmetry between forward and backward time steps.

Since every step in our mean-field setup is differentiable, entropy production can be made into a loss function. For example, maximizing entropy production incentivizes the system to lean into the external drive by nudging its parameters to dump entropy as fast as possible in a way that maximizes uncertainty given constraints. Internally, we imagine the system reshaping itself into ordered structures to enable more efficient dissipation of the internal tension caused by the incoming data stream.

2. Background and intuitions

We yet again consider transformer modules as differentiable driven disordered vector-spin systems whose collective behavior we can control through training, and refer to previous posts going back to Deep Implicit Attention: A Mean-Field Theory Perspective on Attention Mechanisms (2021) for earlier explorations of this intuition. According to this correspondence, inputs map to time-varying applied external fields, interactions can be identified with asymmetric, sparse attention matrices, and outputs map to mean-field spin expectation values or magnetizations. Practically, the forward pass of a spin-transformer module can be designed to mimic that of a vanilla transformer module.

In contrast to physics-oriented literature, we do not specify explicit probability distributions for the external fields and couplings of the disordered many-body system, nor are we interested in Nobel-prize-winning ways to average out the disorder. We instead focus on the very specific quenched disorder realizations induced by a dataset of interest, whose examples we use to drive the system. In this way, training a spin-transformer neural network module corresponds to sculpting the underlying system’s collective response by tuning the parametrized distributions of its external fields and couplings.

In Spin-Model Transformers (2023), we observed that these systems tend to settle into non-equilibrium steady states as a dynamic sweet spot where the “continuous kicking” of the inputs (applied external fields) “sustains” the outputs (magnetizations). This process tends to happen in just a few time-step iterations1. As soon as the input sequence changes, the system has to renegotiate a different steady state compatible with what the current version of its parameters dictate its response should be.

3. Model

The above update equation resembles the forward pass of parallel transformer blocks as introduced in GPT-J and used in PaLM, with the notable difference that the “values” here correspond to the outputs (magnetizations) of the previous time step instead of some linear transformation applied to the inputs (applied external fields) at the current time step.

4. Experiments

Do independently optimized modules synchronize?

We test a stack of spin-transformer modules in a toy femtoscale online learning setup and try to see if we can make synchronization happen between the spin-transformer modules when maximizing per-layer entropy-production losses independently. If we detach module outputs after applying each layer, we end up with systems communicating via their input/output interfaces but without gradients backpropagating through the whole stack. (Pretty unlikely that the entropy-production losses on their own provide enough signal though.)

5. Discussion

References

A non-exhaustive list of references and inspiration includes:

If you happen to find this work useful, please consider citing it as:

@article{bal2026,
  title   = {Entropy Production in Non-Equilibrium Neural Networks},
  author  = {Bal, Matthias},
  year    = {2026},
  month   = {?},
  url     = {https://mcbal.github.io/post/entropy-production-in-non-equilibrium-neural-networks/}
}

Footnotes


  1. The first iteration already gives a decent guess, which might explain why (1) transformers can get away with just stacking modules whose forward passes take just one time step, and (2) why doing a few time steps can improve performance, as done in recursive reasoning approaches. Indeed, repeating the same module (which itself can be made up of a stack of modules) can be seen as allowing the underlying non-equilibrium system to settle into its steady state for that particular inputs/parameters configuration. ↩︎

Previous

Related