Zefram

The two starter-zones for RL are: "you know everything about the environment states and transitions", in which case we are dealing with dynamic programming, and "the environment is a Markov decision process (i.e. the dynamics are dependent only on your action and the current state, not the whole history)." There are arguably some techniques to know, e.g. a theorem that allows one to avoid differentiating through the environment, but that's really only interesting if we consider backpropagation to be the last word in what constitutes cybernetic feedback (spoiler, it isn't; we can do feedback at the prompt level between LLMs).

So what is there to take away from the foundations? My opinionated take is that there's really only one sort of pattern to take away, which is a common abstraction of the expectation, max, argmax, and Bellman operators. All of these are functionals or "higher-order" operators of type (X → ℝ) → (X → Y) for some X and Y. I would like to abstract from these functionals a common computational pattern: in practice, computing these functionals starting with an input f: X → ℝ, we use:

A sampling map ς: X → !X to instantiate some iterated loop (represented by some bang operator !) over a representative sample of X
An accumulator map ∫: Z × X × ℝ → Z that folds over the "residuated" expression (id_X ⊗ f) ∘ Δ_X: X → X ⊗ ℝ, building up an intermediate result of type Z
A postprocessing map ω: Z → Y that transforms the accumulated intermediate result into the final output

The intermediate type Z allows the accumulator to track auxiliary information (like running sums, best-so-far candidates, or normalisation constants) that gets refined into the final output by ω. For simple operators like 𝔼 and max, we have Z = Y and ω = id; for softmax, the accumulator builds up a normalising constant which ω uses to produce the final distribution. Schematically, such operators are depicted as shells that are applicable around other processes, like so:

Let's see how some familiar operators fit into this mould. For each operator we specify the intermediate type Z, the accumulator ∫, and the postprocessor ω:

Operator	Intermediate Z	Accumulator ∫	Postprocess ω
𝔼	ℝ	z + w_x · r	id
max	ℝ	max(z, r)	id
argmax	X × ℝ	(x, r) if r > r_best else (x_best, r_best)	π₁ (project)
softmax	ℝ	z + exp(r)	x ↦ exp(f(x))/z

Note: argmax accumulates both the best candidate and its score, then ω = π₁ projects out just the candidate. For softmax, the accumulator builds up the partition function $Z = \sum_{x'} \exp(f(x'))$ , and then ω produces the normalised probability for each x by dividing $\exp(f(x))$ by this accumulated normaliser.

Dynamic programming as composite functionals

The point of dynamic programming is to compute good policies. This decomposes into a composition of four functionals, each an instance of our accumulator pattern. Given a value function V: S → ℝ:

Inner expectation over next-states: 𝔼_s'[V] : S × A → ℝ
- ς_s,a: S → !S (sample from P(·|s,a))
- ∫ = ∫_𝔼, ω = id
Q-function construction: Q_V(s,a) = R(s,a) + γ · 𝔼_s'[V(s')]
- This is just pointwise arithmetic, composing reward with discounted future value.
Outer aggregation over actions — here we have three choices depending on what we want to compute:
- ∫_𝔼 with ω = id: policy evaluation (𝒯^πV)(s)
- ∫_max with ω = id: optimal value (𝒯^*V)(s)
- ∫_argmax with ω = π₁: greedy policy π'(s)
Fixed-point iteration: iterate until convergence
- ς: ℕ → !ℕ (unfold until ε < threshold)
- Z = (S → ℝ) × ℝ (current estimate and gap)
- ∫: (V, ε), n, V' ↦ (V', ‖V' - V‖)
- ω: (V, ε) ↦ V

The standard operators compose these as:

(\mathcal{T}^\pi V)(s) = \underbrace{\int_{\mathbb{E}}^{a \sim \pi}}_{\text{outer}} \left[ R(s,a) + \gamma \cdot \underbrace{\int_{\mathbb{E}}^{s' \sim P}}_{\text{inner}} [V(s')] \right]

(\mathcal{T}^* V)(s) = \underbrace{\int_{\textsf{max}}^{a \in A}}_{\text{outer}} \left[ R(s,a) + \gamma \cdot \underbrace{\int_{\mathbb{E}}^{s' \sim P}}_{\text{inner}} [V(s')] \right]

\textsf{greedy}(V)(s) = \underbrace{\int_{\textsf{argmax}}^{a \in A}}_{\text{outer}} \left[ R(s,a) + \gamma \cdot \underbrace{\int_{\mathbb{E}}^{s' \sim P}}_{\text{inner}} [V(s')] \right]

Value iteration computes $V^* = \text{fix}(\mathcal{T}^*)$ , then extracts $\pi^* = \text{greedy}(V^*)$ .

Policy iteration alternates: $\text{fix}(\mathcal{T}^\pi) \to \text{greedy} \to \text{fix}(\mathcal{T}^{\pi'}) \to \cdots$

Both are compositions of the same four building blocks — inner expectation, Q-construction, outer aggregation, and fixed-point iteration — just wired differently.

The Same Pattern with LLM Agents

The abstract decomposition pays off when we realise the same pattern applies beyond numerical computation. Consider an LLM-based agent deliberating in a text-based environment. The types change — State becomes natural language situation descriptions, Action becomes natural language action descriptions — but the structure is identical. Crucially, Value need not be a number: it can be a rich qualitative assessment mixing probabilistic hedging ("likely", "risky"), deontic judgement ("permissible", "obligatory"), social modelling ("they would expect...", "this signals..."), or aesthetic/intuitive grasp ("elegant", "feels off"). The accumulator's job is to synthesise such heterogeneous considerations into something actionable.

LLM policy improvement as composite functionals

Given a value-assessment function V: State → Value (an LLM prompted to assess situations):

Inner expectation over consequences: 𝔼_s'[V]: State × Action → Value
- ς_s,a: "What might happen if we do a in s? Consider likely outcomes, edge cases, how others might respond."
- ∫: "Given these scenarios and their assessments, what's the overall prospect? Weigh by plausibility, but also note if any outcome is unacceptable regardless of likelihood."
- ω = id
Q-function construction: Q_V: State × Action → Value
- Q: "The immediate situation from a is R(s,a); the downstream prospect is future. How do these combine? A small immediate cost might be worth it; an ethical violation might not be redeemable by future gains."
Outer aggregation over actions — three modes:
- ς_s^π: "Which actions would I characteristically consider here?"
  - ∫: "Averaging over my tendencies, what's my overall assessment of the situation?"
  - ⇒ policy evaluation
- ς_s: "What are all the available moves, including unconventional ones?"
  - ∫: "Which option has the best overall assessment? (Best may involve tradeoffs: safest? most ethical? highest expected payoff?)"
  - ω = id: ⇒ optimal value
- ς_s, ∫ as above, but
  - ω = π₁: "Which action achieves that assessment?" ⇒ greedy policy
Fixed-point iteration: fix: (State → Value) → (State → Value)
- ς: "Have my assessments stabilised, or did that last round of reasoning change my view of some situations?"
- ∫: "Update assessments in light of what I now know about downstream consequences."
- ω: "Return the stable judgement."

The prompts are the "characteristic data" instantiating each functional. The same compositional structure — inner aggregation, Q-construction, outer aggregation, iteration — remains, but each box is now an LLM call that can fluidly mix probabilistic, logical, social, and intuitive reasoning as the situation demands.

Unified notation

Both the numerical and LLM-based formulations are instances of:

\pi^*(s) = \omega_{\textsf{fix}}\Bigg( \textsf{fix}\bigg( V \mapsto \omega_{\textsf{out}}\Big( \int_{\textsf{out}}^{\varsigma_s} \big[ \mathsf{Q}\big(R(s,-),\, \omega_{\textsf{in}}\big(\int_{\textsf{in}}^{\varsigma_{s,-}}[V]\big)\big) \big] \Big) \bigg) \Bigg)(s)

where the characteristic data (ς, ∫, ω) for each functional determines whether we compute numerically or deliberate via prompted LLM calls, where "integration" can respect incommensurabilities that arithmetic would flatten.