TL;DR:
- "Energy-based methods" means different things to different people
- CS version: energy = any learnable loss function, lower is better
- Physics version: actually cares about thermodynamic structure
- The physics version splits further: Lagrangian mechanics vs. statistical mechanics/information theory
- Punchline: In DL you can often get away with just treating energy as "learnable goodness function" and ignoring the physics heritage
- But: The physics heritage isn't nothing -- it's where all the nice computational tricks come from
The Innocent Question
So there's a bunch of talk about "energy-based" methods in deep learning that confuses me. When you see these presentations, there's a lot of terminology borrowed from physics. A reasonable person might ask: "Am I free to pick whatever energy function I like, or are there rules that distinguish 'good' energy functions from whatever bullshit I make up?"
It turns out one must unask the question.
Energy, Minimally
An energy (in modern DL terms) is just a function
where is a space of states (these could be image-label pairs, latent configurations, whatever), parameterises some learnable component, and the output is a non-negative real number. Lower is better. That's it. That's the definition.
For MNIST, you'd learn an energy function on such that correct pairs have low energy and incorrect pairs have high energy. Your trained energy function then packages up information you can use for classification, generation, or whatever downstream task you care about.
This is suspiciously close to "energy is just a loss function." And in a lot of modern DL contexts, that's basically true. But there's a richer story that explains where the terminology comes from and why the physics-flavoured machinery is sometimes actually useful.
A Brief and Mangled Conceptual History
Do not quote me on this.
What Is Probability?
There's an ongoing philosophical debate about what probabilities mean, but this was more pronounced in the prehistory of probability theory, when "uncertainty" could refer to epistemic and subjective uncertainty. (From here you branch off to Dutch-book arguments for rationality, which yield the probability axioms as consistency conditions on beliefs.)
Thermodynamics and Entropy
Classical thermodynamics emerged around the Industrial Revolution, driven by the practical problem of optimising steam engines. The key concept of "free energy" measures how much work a system can actually do. Not all energy is "free" to be employed; some is lost as waste heat. This waste heat business is entropy, first time around.
Statistical Physics
Boltzmann enters to provide a story bridging microscopic particle dynamics to macroscopic properties like temperature. The setup: you have a massive disordered system with many possible microstates, but you have an easy-to-compute summary value for the system: the "energy."
The Boltzmann distribution tells you the probability of being in a microstate, but only in terms of:
- The energy of that microstate
- A Lagrange multiplier (which mysteriously and miraculously turns out to be inverse temperature)
- The partition function , which encodes all information about your system as a normalising constant
All of statistical physics is basically a box of methods to approximate , because it's intractable to compute directly.
Back to Probability (E.T. Jaynes)
Jaynes proposed an epistemic reading of entropy: it's the "amount of epistemic freedom" one has. This connects to the information-theoretic conception of entropy as expected surprise. The maximum entropy distribution is the uniform distribution, the "least informative" state, and therefore the starting point from which one has the most freedom to learn.
Under this view, Bayesian reasoning derives from the "maximum entropy principle": always pick the least informative distribution consistent with your constraints. This is a normative standard of rationality: stay maximally uncommitted until the evidence forces your hand.
An Epistemic Reading of Stat-Phys
There's an ahistorical but clarifying way to present Boltzmann's insight: the Boltzmann distribution gives you the probability of any microstate subject to two constraints:
- Average energy is fixed (you know the expected energy of the system)
- Maximum entropy (you assume you know nothing about which energetically-equivalent microstate you're in)
Using Lagrange multipliers you obtain the Boltzmann distribution. The multiplier controlling the entropy constraint turns out to be inverse temperature:
- Low temperature → near-deterministic energy minimization (only care about constraint 1)
- High temperature → approach uniform distribution (constraint 2 dominates; you have no idea)
This is a beautiful synthesis: temperature interpolates between "I'm certain and just want low energy" and "I have no clue, everything is equally likely."
The Path to Modernity
A bunch of developments happen: free energy principles in neuroscience (Friston's active inference thing), Hopfield networks getting the Nobel Prize for recognising that Ising models from stat-phys are exactly what you need to derive their update rules, variational methods, ELBOs, and so on.
But here's the thing: in a lot of modern DL applications, you don't actually need the partition function. And you don't need an epistemic reading either. What's left is just the damn energy function again, which means you're free to do what you like as a modeller.
So What Actually Is Energy?
After all this history, here's where I've landed:
The CS answer: An energy function in DL is just a learnable map from states to a measure of goodness or badness of that state. That's it. Lower is better. Train it however you want.
The physics answer: When we see Gibbs or Boltzmann-like forms in DL, they're basically useful suppositions that let us throw a developed mathematical toolkit at the problem. Much like how we prefer to model things linearly when we can -- not because the world is linear, but because we have good tools for linear things.
The unsatisfying truth: Even in physics, the role of energy is somewhat definitional. It's a "good sketch of the system that rules out unphysical behaviour." There isn't some Platonic Form of Energy that you're either capturing or not.
What Would a Good Definition Look Like?
What I actually want is what's called (in PL, I think) a point-free definition. When you ask a statistician what "expectation" is, they say: "It's in the discrete case, in the continuous case." But that's a formula, not a characterisation.
A point-free definition instead says: expectation is an operator on probability distributions satisfying linearity , additivity, and so on. You specify the behaviour, not the implementation.
There's a beautiful example of this for entropy. Baez and Fritz showed that Shannon entropy is the unique measure of information loss for channels that is functorial, convex-linear, and continuous. That's a proper characterisation. It tells you why entropy is entropy, not just how to compute it.
I don't yet have an analogous characterisation for "energy" in DL. Maybe the honest answer is that there isn't one, that "energy" in the CS sense really is just "learnable scalar cost," and the physics terminology is legacy cruft. Or maybe there's a deeper categorical story connecting the Lagrangian picture (symmetries, Noether's theorem, conserved quantities) to the statistical picture (maximum entropy, partition functions) that would make everything click.
Loose Threads
Some tantalising connections I haven't fully chased down:
-
Hopfield networks: The Nobel Prize citation explicitly connects Ising models to neural network dynamics. There's something real here about the physics notion of energy governing attractor dynamics.
-
Noether's theorem: In physics, energy is the conserved quantity arising from time-translation symmetry of the Lagrangian. People working on geometric deep learning have wondered if there's an interesting synthesis. I'm skeptical there's much there, but I haven't read deeply enough.
-
Energy Transformers: There are now architectures that use energy-based formulations in more physics-informed ways (Krotov's work), though the term "energy transformer" gets used for different things by different authors.
The methods of physics feel like they should be helpful for DL, but the concepts don't map onto each other cleanly yet. Maybe that's a sign there's real work to be done, or maybe it's a sign that the analogy is more trouble than it's worth.
Conclusion
I could have saved myself a lot of reading by going with my initial dumbass intuition: energy in DL is just some good/bad function. That's just what it is at the end of the day.
But the history isn't useless. The Boltzmann-like forms give you computational tools. The maximum-entropy framing gives you a normative story about what you're doing. And the physics connections might eventually yield something deeper about why certain architectures work.
For now: if someone says "energy-based methods," ask them which sense they mean. If they mean "learnable cost function," great, proceed. If they mean something more physics-flavoured, ask what work that's doing for them. Sometimes the answer is "not much, it just sounds cooler."