Zefram

LLMs have proven very effective in dealing with code. However their ability to deeply optimise code while maintaining the meaning of the original program is limited so far. This might improve in the long term; for now we can help the AI along with a sprinkle of symbolic techniques. In this post we take a look at the tricks used by slotted e-graphs¹ to deal with higher-order languages. From the full slotted e-graph data structure we distill a nominal version of union-find which can be useful in isolation. We show that the nominal union-find data structure is a practical implementation of nominal sets², and develop a categorical semantics on the way.

E-Graphs

Many program optimisations can be phrased in terms of term rewriting: starting out with a program supplied by the user, a compiler identifies and replaces patterns that can be rephrased in a more efficient way. This typically leads to an engineering challenge called the phase ordering problem: depending on the order in which rewrites are applied, opportunities for optimisation might be missed.

Equality saturation³ addresses the phase ordering problem by applying rewrites constructively: instead of replacing a pattern in the code with its optimised form, removing the pattern in the process, the compiler uses an e-graph that can store a superposition of both versions of the program. This is a monotone process, so applying a rewrite never takes away any opportunities to apply a different optimisation instead. Rewrites are applied either until no further versions of the code can be found (saturation) or until an optimisation budget is exhausted. In the end, a final program is extracted from the e-graph according to a cost model. Equality saturation had a surge in popularity after the relase of the egg library⁴.

E-graphs have been used in conjunction with machine learning techniques. While LLMs are good at identifying high-level changes, they often make mistakes and struggle with fine grained rewriting. This can be compensated by letting the LLM propose checkpoints and letting an e-graph rewriting system figure out if and how we can reach that checkpoint via correctness-preserving transformations⁵. The ASPEN framework⁶ uses LLMs to figure out which parts of an extracted program are responsible for the practically measured performance. Aurora⁷ uses reinforcement learning and graph neural networks to guide the exploration of the e-graph during rewriting.

E-graphs notoriously struggle with programs that contain bound variables. The efficiency of e-graphs depends on deduplicating subterms, but in the presence of variable binding, there are many terms that are not syntatically equal yet differ only in the naming of their bound variables (we call this $\alpha$ -equivalence). For instance, the terms map (fn x -> x * 2) xs and map (fn y -> y * 2) xs are $\alpha$ -equivalent and clearly implement the same operation, but an e-graph would consider them to be entirely different. In practice, this makes the size of the e-graph explode and the optimiser is clogged by a flood of spuriously different terms. Using DeBruijn indices instead of variable names helps to some extent, but still leads to an explosion of terms that made optimising higher-order programs with equality saturation mostly impractical.

Slotted e-graphs¹ are a recent innovation that dramatically improves the ability of e-graphs to deal with variables. On practical examples where e-graphs exploded in size, the slotted e-graph identified many more opportunities for deduplication and stayed manageable in size. For a full explanation of slotted e-graphs, I encourage you to read the orignal paper; it is a bit tricky but it is well explained. The remainer of this post will highlight some of the ideas, presented from a different angle.

Union-Find

An e-graph keeps track of a set of terms together with an equivalence relation on that set. When applying a rewrite to a term, the target term of the rewrite is inserted into the e-graph but the original is not removed. Instead the equivalence relation on terms is extended so that the source and target terms of the rewrite become equivalent. Algorithmically, the equivalence relation is encoded via the union-find data structure. This is the part of e-graphs that we will concentrate on in this post.

**Figure 1:** An equivalence relation with three equivalence classes, represented as a graph. Due to the transitivity property of equivalence relations, each equivalence class is a complete graph.

For reasons that will become apparent very soon, we will look at equivalence relations in a slightly unorthodox way. For every equivalence relation $\sim$ on a set $X$ there exists a groupoid $\mathcal{X}$ whose objects are the elements of $X$ and such that there is a map $a \to b$ in $\mathcal{X}$ if and only if $a \sim b$ . The resulting groupoid is thin: for every pair of objects there exists at most one map between them. Conversely, every thin groupoid induces an equivalence relation on its set of objects, relating two objects if there exists a map in between them. Enabled by this equivalence, we will treat equivalence relations and thin groupoids as interchangeable from now on.

Encoding equivalence relations naively as a graph would be rather expensive: an equivalence class of size $n$ contains $\tfrac{n(n - 1)}{2}$ pairs of related elements. Moreover, taking the union of two equivalence classes of size $n$ and size $m$ would involve adding $n m$ more edges to the graph.

**Figure 2:** Adding the relation $c \sim d$ to the equivalence relation of Figure 1 requires adding three new edges to the graph.

The union find data structure therefore employs a trick: instead of storing the whole graph, it suffices to store a spanning tree. Any spanning tree of the graph will do, leaving the freedom to pick one that can be manipulated most efficiently. By being clever about balancing the tree, the union-find data structure allows us to manipulate equivalence relations in (very nearly) linear time.

**Figure 3:** To encode an equivalence relation it suffices to store a spanning tree. This reduces the number of edges that we need to remember and maintain.

**Figure 4:** The new equivalence $c \sim d$ can be recorded by adding a single edge, regardless of the size of the equivalence classes. Contrast this to Figure 2 in which the number of new edges depended on the size of the equivalence classes of the newly related elements.

We can also formulate the spanning tree idea from the categorical perspective. Every category $\mathcal{C}$ has a groupoid completion $\mathcal{C} \to \Pi_1(\mathcal{C})$ that is obtained by freely making every morphism in $\mathcal{C}$ invertible. For a thin groupoid $\mathcal{X}$ we can pick a subcategory $\mathcal{C} \to \mathcal{X}$ so that $\mathcal{X}$ is the groupoid completion of $\mathcal{C}$ . There are many different categories that have equivalent groupoid completions, mirroring the freedom of choice that we had for spanning trees.

Labelled Union-Find

Say we added named variables to the terms that we track in an e-graph. The names help us match up definition and use sites of variables, but a variable's name is immaterial in the sense that a term's meaning shouldn't change if we consistently rename all variables while avoiding clashes. For instance, the terms ?a + ?b and ?x + ?y are essentially the same after renaming ?a to ?x and ?x to ?y. However ?a + ?b and ?a + ?a are fundamentally different: one is a sum of two potentially different values while the other is restricted to sum a value with itself.

Say that we now noticed that we have the terms ?a + ?b and ?x + ?y in our e-graph. We can not simply unify the two: if we did and were unlucky enough to also have the terms ?a * (?a + ?b) and ?a * (?x + ?y) around, congruence closure would force us to unify these two terms as well even though they are clearly different. Here traditional e-graphs give up and accept an explosion of terms that only differ in the naming of their variables.

Slotted e-graphs get around this issue: when we record that the two terms ?a + ?b and ?x + ?y are equivalent, we also record the renaming which makes them equivalent. Congruence closure then can then use this information: when replacing ?a + ?b with the equivalent ?x + ?y in ?a * (?a + ?b) we also must rename the leading ?a leading to ?x * (?x + ?y).

Remark: There is an analogy with gauge theory: To do calculations in physics we need to a coordinate system. The choice of coordinate system does not matter in the actual physical system, but we still have to make sure that we account for coordinate transformations between calculations that were made in different coordinate systems. In fact the mathematics we will develop rhymes very strongly with the tools developed for gauge theory.

Bare equivalence relations and the ordinary union-find data structure is no longer sufficient now. Labelled union-find⁸ extends union-find with the ability to attach labels to relations. The labels come from a group $G$ , the elements of which we will call symmetries, and the labelling has to satisfy some coherence properties:

Every element is related to itself via the neutral element $e \in G$ .
When $a \sim b$ via $g \in G$ and $b \sim c$ via $h \in G$ then $a \sim c$ via $gh \in G$ .
When $a \sim b$ via $g \in G$ then $b \sim a$ via $g^{-1} \in G$ .

**Figure 5:** An equivalence relation on three elements together with their labels. When $g$ and $h$ are given, all of the other labels are uniquely determined by the coherence laws.

Squinting enough you might recognise these conditions as the laws of a functor, and here is where our groupoid perspective begins to pay off. For a group $G$ the delooping of $G$ is the groupoid $\text{B} G$ with a single object $*$ such that $G$ is the group of automorphisms of $*$ . To label an equivalence relation with symmetries in a group $G$ , in the sense of the labelled union-find data structure, is to give a functor $\mathcal{X} \to \text{B} G$ from the thin groupoid $\mathcal{X}$ representing the equivalence relation.

The group that we care about for terms with free variables is the group of renamings. Formally, we pick some some countable set of names $\mathbb{A}$ and define a renaming to be a bijection $\sigma : \mathbb{A} \to \mathbb{A}$ so that $\sigma(a) \neq a$ for only finitely many $a \in \mathbb{A}$ (since any term has finitely many variables, this is not a big restriction). Renamings compose and have inverses so that they form a group $\text{Rename}$ which acts on terms by, you guessed it, renaming free variables.

Storing the entire graph with its labels is inefficient. Like its unlabelled counterpart, labelled union-find therefore only stores a spanning tree. Due to the coherence laws, the labels on the remaining edges are sufficient to reconstruct the label between any two equivalent elements. This is argued by hand in the paper on labelled union-find, but it also follows from abstract nonesense in our categorical point of view.

Remember that we encoded the spanning tree of an equivalence relation as a category $\mathcal{C}$ so that its groupoid completion $\mathcal{X}$ is a thin groupoid. When we have a functor $L : \mathcal{C} \to \text{B} G$ that labels only the spanning tree, then the universal property of the groupoid completion guarantees that the labels extend coherently and uniquely to a functor $L' : \mathcal{X} \to \text{B} G$ .

**Figure 6:** Just like with ordinary union-find, it also suffices to store a spanning tree in the labelled case. The labels of the omitted edges can then be reconstructed uniquely from the coherence laws, as can be seen in Figure 5.

Self-Symmetries

The labelled union find data structure has a crucial limitation. Suppose that we have two elements $a, b$ such that $a \sim b$ via a symmetry $g \in G$ . When we try to union $a$ and $b$ again with a new symmetry $h \in G$ , the algorithm described in the paper will fail with an error if $g \neq h$ . This is mirrored in our categorical semantics: when $L : \mathcal{X} \to \text{B} G$ is a labelled thin groupoid, there is at most one map $r: a \to b$ for any pair of elements $a, b \in \mathcal{X}$ and therefore also at most one label $L(r) \in G$ which relates the elements.

Self-symmetries are a particular case that is excluded by this restriction. An element $a$ is automatically related to itself via the identity symmetry $e \in G$ . Adding any non-trivial self-symmetry to $a$ would therefore run afoul of the rule that any relation between elements must have a unique label.

Terms with free variables however do regularly have non-trivial self-symmetries! Because addition is commutative, we notice that ?a + ?b and ?b + ?a are equivalent without any renaming. However ?b + ?a is related to ?a + ?b via the renaming $\sigma$ that swaps the names of the two variables. By composing the two, ?a + ?b is related to itself via $\sigma$ .

Labelled union find does not accomodate this. But in our categorical semantics we can solve this issue by simply dropping the requirement that the groupoid $\mathcal{X}$ is thin so that there can be any number of maps between two objects of $\mathcal{X}$ . We add the requirement that labellings $L : \mathcal{X} \to \text{B} G$ are faithful: there is no reason for us to track duplicate relations with the same label. Now every object $x \in \mathcal{X}$ has a group of automorphisms $\text{Aut}(x)$ and a faithful labelling $L : \mathcal{X} \to \text{B} G$ induces a group embedding $\text{Aut}(x) \hookrightarrow G$ .

At first it appears as if this would require us to store an impractical amount of data. However, just as above it suffices to store a spanning tree together with the generators of the automorphism group for just one object in every connected component. As a data structure, this is labelled union-find augmented with a group for every root in the spanning tree. Subgroups of $\text{Rename}$ can be encoded relatively efficiently using the Schreier-Sims algorithm.

**Figure:** The automorphism groups of related elements are isomorphic. The loop on $a$ determines a loop on $b$ : start at $b$ , go to a $a$ , go around the loop, then return to $b$ .

**Figure:** Since the automorphism groups of related elements determine each other, it suffices to only store one of them. Moreover, we don't need to store the entire group but only its generators.

Finite Support

There is a detail that we have swept under the rug so far: Any term is invariant under an infinite number of renamings! Since any given term only has finitely many free variables, we have an infinite number of variables that aren't free in the term. Permuting those variables has no effect: Nothing happens when we swap ?x and ?y in the term ?a + ?b. So if our labelled groupoid $\mathcal{X} \to \text{B} \text{Rename}$ accurately reflects the renaming symmetries of terms, every automorphism group is infinite and not even finitely generated. That is not an ideal situation for a data structure.

While the stabilisers of terms are not finitely generated, they do however admit a finite description. When we store the set of free variables for a term, the stabiliser group decomposes as a direct sum of the finite subgroup that actually involves the free variables, together with the infinite group that allows to permute any variable that doesn't actually occur free in the term.

E-Graphs

Union-Find

Labelled Union-Find

Self-Symmetries

Finite Support

Footnotes