Zefram

Why should I care about Reinforcement Learning?

Reinforcement Learning (RL) probably needs no introduction. For now let me describe it as a sort of magic practiced by tech-wizards in places like Deepmind, who have used it to solve protein folding ¹ and produce some of the most powerful Chess and Go player-entities on the planet, far beyond human ability ². "Reinforcement Learning from human feedback" constitutes the difference between the autocomplete on your phone and an instruction-following Large Language Model ³. RL is used to produce self-driving cars, and robots that will either cut your vegetables or kill you.

The magic is complicated to perform because it rests upon a tall tower of other tech-magic, such as deep learning and the process of coordinating the training of neural networks with a lot of data and computational power, and the details of computers are themselves deep, complicated. I find all of these details scary, so I'm instead going to try to focus on just those elements of RL as a school of magic that stand conceptually independent from the details of how to actually do it.

It is probably worth caring about Reinforcement Learning because it is the best way we know of to have machines perform very complicated behaviours in a world where we increasingly delegate responsibilities to machines, and knowledge about how the magic works can be used for many ends, light and dark.

Disclaimer: I have a sort of brain damage where I can only understand math if its in pictures, and there is a lot of literature, so the conceptual organisation and presentation I present here is likely idiosyncratic and incomplete. Carry on at your own risk.

So how does it work?

Reinforcement Learning is a sort of Machine Learning where Agents are trained to act good in an environment. In the most general picture, Agents produce actions within an environment, and they receive feedback from the environment.

Agents use the feedback they receive to update the way they behave, in (ideally) a way that makes them better at achieving the goals they are tasked with by their designers. For example, in a setting where the environment can naturally provide rewards as part of their feedback⁴ (such as old-school video games), agents can be tasked with maximizing their score.

The conceptual interest and difficulty of Reinforcement Learning comes from ways to spell out the internal structure of agents in such a way that feedback is effectively integrated. In Minskian ⁵ terminology, this means treating agents as agencies which are comprised of smaller communicating agents. A simple example would be to say that the environment is a process that produces reward and a new external state dependent on a previous action and the previous external state, and the agent is really an agency of two smaller agents: a policy-agent that produces actions dependent on some internal state, and a update-agent that produces new internal states dependent on previous internal states and feedback from the environment.

We can always "unroll" such communication schemes as process-diagrams to be read left-to-right, which better reflect how the processes are carried out computationally, where processes have to occur either sequentially or in parallel, with looping over data⁶.

And we can basically stay in the realm of such pictures, because we usually care to instantiate agents as neural networks, and neural networks can be viewed as variable functions whose behaviour is bound by equations that denote "objectives". In the same way that 1 + x = 2 posits a constraint on the variable x, so we can view, say, reward maximisation as the solution (or best attempt) to the following equation, where the variable functions to be solved are in black:

The equivalent symbolic formulation⁷ is:

\begin{aligned} \text{Policy: } &\pi : \text{Int} \to \text{Act} \\ \text{Environment: } &E : \text{Ext} \otimes \text{Act} \to \text{Ext} \otimes \text{Reward} \\ \text{Update: } &U : \text{Reward} \otimes \text{Int} \to \text{Int} \\ \text{Objective: } &\max_{\pi, U} \sum_{t} r_t \text{ where } (s_{t+1}, r_t) = E(s_t, a_t), \ a_t = \pi(i_t), \ i_{t+1} = U(r_t, i_t) \end{aligned}

We can stop before the symbolic implementations and stay with the diagrams ⁸, because the point in some sense of deep learning is that such equations are solvable given enough training data and computational power. In a contemporary setting, the benefit of abstraction is that we may also instantiate feedback and objective loss at the level of prompts and LLM-judges, so that the patterns of RL we explore are also applicable to the orchestration of LLM agents. Altogether this means that we are primarily interested here in uncovering what RL has to say about the design of agencies.

But is this formal? If it matters, yes. Essentially all pictorial presentations of flowchart-like process theories are formal ⁹, but the details of how to demonstrate that they are formal involves a notoriously scary and arcane sort of mathematics called Category Theory. I don't want to go into the details of how the bicycles work here (see ¹⁰ if you care to for a nice introduction, and ⁹ for a survey of some graphical syntaxes), but suffice to say that the bicycles are sound and we can just be chill and ride them here.

AlphaFold: a solution to a 50-year-old grand challenge in biology. (2022) ↩
AlphaZero: Shedding new light on chess, shogi, and Go. (2020) ↩
Christiano et al. Deep reinforcement learning from human preferences. (2023) ↩
Some people consider this to be cheating, and would prefer agents to devise their own goals and rewards like humans are forced to in the real world; we'll get there eventually. ↩
Minsky, M. The Society of Mind. (1988) ↩
The variable labels are going to get smaller and then disappear, because at some point I will just use colours. The names of variables are mostly a conceptual convenience for people: whether an object really behaves like an "internal state" or a process like an "update" is grounded not by internal structure, but circularly, by how everything interacts with everything else. ↩
The picture is morally correct, modulo monoidal stream traces (Lavore et al. 2022) and algebras for summation and potentially taking semantics in Markov Categories for probabilistic processes (Cho & Jacobs 2019). ↩
Rodatz et al. Pattern matching for process theories. (2025) ↩
Selinger, P. A survey of graphical languages for monoidal categories. (2010) ↩ ↩²
Fong & Spivak. Seven Sketches in Compositionality. (2018) ↩

Why should I care about Reinforcement Learning?

So how does it work?

Footnotes