LLMs have proven very effective in dealing with code. However their ability to deeply optimise code while maintaining the meaning of the original program is limited so far. This might improve in the long term; for now we can help the AI along with a sprinkle of symbolic techniques. In this post we take a look at the tricks used by slotted e-graphs1 to deal with higher-order languages. From the full slotted e-graph data structure we distill a nominal version of union-find which can be useful in isolation. We show that the nominal union-find data structure is a practical implementation of nominal sets2, and develop a categorical semantics on the way.
E-Graphs
Many program optimisations can be phrased in terms of term rewriting: starting out with a program supplied by the user, a compiler identifies and replaces patterns that can be rephrased in a more efficient way. This typically leads to an engineering challenge called the phase ordering problem: depending on the order in which rewrites are applied, opportunities for optimisation might be missed.
Equality saturation3 addresses the phase ordering problem by applying
rewrites constructively: instead of replacing a pattern in the code with its
optimised form, removing the pattern in the process, the compiler uses an
e-graph that can store a superposition of both versions of the program.
This is a monotone process, so applying a rewrite never takes away any
opportunities to apply a different optimisation instead.
Rewrites are applied either until no further versions of the code can
be found (saturation) or until an optimisation budget is exhausted.
In the end, a final program is extracted from the e-graph according to a cost model.
Equality saturation had a surge in popularity after the relase of the
egg library4.
E-graphs have been used in conjunction with machine learning techniques. While LLMs are good at identifying high-level changes, they often make mistakes and struggle with fine grained rewriting. This can be compensated by letting the LLM propose checkpoints and letting an e-graph rewriting system figure out if and how we can reach that checkpoint via correctness-preserving transformations5. The ASPEN framework6 uses LLMs to figure out which parts of an extracted program are responsible for the practically measured performance. Aurora7 uses reinforcement learning and graph neural networks to guide the exploration of the e-graph during rewriting.
E-graphs notoriously struggle with programs that contain bound variables. The
efficiency of e-graphs depends on deduplicating subterms, but in the presence of
variable binding, there are many terms that are not syntatically equal yet
differ only in the naming of their bound variables (we call this -equivalence).
For instance, the terms map (fn x -> x * 2) xs and
map (fn y -> y * 2) xs are -equivalent and clearly implement the same
operation, but an e-graph would consider them to be entirely different.
In practice, this makes the size of the e-graph explode and the optimiser is
clogged by a flood of spuriously different terms. Using DeBruijn indices instead
of variable names helps to some extent, but still leads to an explosion of terms
that made optimising higher-order programs with equality saturation mostly impractical.
Slotted e-graphs1 are a recent innovation that dramatically improves the ability of e-graphs to deal with variables. On practical examples where e-graphs exploded in size, the slotted e-graph identified many more opportunities for deduplication and stayed manageable in size. For a full explanation of slotted e-graphs, I encourage you to read the orignal paper; it is a bit tricky but it is well explained. The remainer of this post will highlight some of the ideas, presented from a different angle.
Union-Find
An e-graph keeps track of a set of terms together with an equivalence relation on that set. When applying a rewrite to a term, the target term of the rewrite is inserted into the e-graph but the original is not removed. Instead the equivalence relation on terms is extended so that the source and target terms of the rewrite become equivalent. Algorithmically, the equivalence relation is encoded via the union-find data structure. This is the part of e-graphs that we will concentrate on in this post.
Figure 1: An equivalence relation with three equivalence classes, represented as a graph. Due to the transitivity property of equivalence relations, each equivalence class is a complete graph.
For reasons that will become apparent very soon, we will look at equivalence relations in a slightly unorthodox way. For every equivalence relation on a set there exists a groupoid whose objects are the elements of and such that there is a map in if and only if . The resulting groupoid is thin: for every pair of objects there exists at most one map between them. Conversely, every thin groupoid induces an equivalence relation on its set of objects, relating two objects if there exists a map in between them. Enabled by this equivalence, we will treat equivalence relations and thin groupoids as interchangeable from now on.
Encoding equivalence relations naively as a graph would be rather expensive: an equivalence class of size contains pairs of related elements. Moreover, taking the union of two equivalence classes of size and size would involve adding more edges to the graph.
Figure 2: Adding the relation to the equivalence relation of Figure 1 requires adding three new edges to the graph.
The union find data structure therefore employs a trick: instead of storing the whole graph, it suffices to store a spanning tree. Any spanning tree of the graph will do, leaving the freedom to pick one that can be manipulated most efficiently. By being clever about balancing the tree, the union-find data structure allows us to manipulate equivalence relations in (very nearly) linear time.
Figure 3: To encode an equivalence relation it suffices to store a spanning tree. This reduces the number of edges that we need to remember and maintain.
Figure 4: The new equivalence can be recorded by adding a single edge, regardless of the size of the equivalence classes. Contrast this to Figure 2 in which the number of new edges depended on the size of the equivalence classes of the newly related elements.
We can also formulate the spanning tree idea from the categorical perspective. Every category has a groupoid completion that is obtained by freely making every morphism in invertible. For a thin groupoid we can pick a subcategory so that is the groupoid completion of . There are many different categories that have equivalent groupoid completions, mirroring the freedom of choice that we had for spanning trees.
Labelled Union-Find
Say we added named variables to the terms that we track in an e-graph. The names
help us match up definition and use sites of variables, but a variable's name
is immaterial in the sense that a term's meaning shouldn't change if we consistently
rename all variables while avoiding clashes. For instance, the terms ?a + ?b and ?x + ?y
are essentially the same after renaming ?a to ?x and ?x to ?y. However
?a + ?b and ?a + ?a are fundamentally different: one is a sum of two potentially
different values while the other is restricted to sum a value with itself.
Say that we now noticed that we have the terms ?a + ?b and ?x + ?y in our e-graph.
We can not simply unify the two: if we did and were unlucky enough to also have
the terms ?a * (?a + ?b) and ?a * (?x + ?y) around, congruence closure would
force us to unify these two terms as well even though they are clearly different.
Here traditional e-graphs give up and accept an explosion of terms that only differ
in the naming of their variables.
Slotted e-graphs get around this issue: when we record that the two terms
?a + ?b and ?x + ?y are equivalent, we also record the renaming which makes them equivalent.
Congruence closure then can then use this information: when replacing
?a + ?b with the equivalent ?x + ?y in ?a * (?a + ?b) we also must rename the leading ?a
leading to ?x * (?x + ?y).
Remark: There is an analogy with gauge theory: To do calculations in physics we need to a coordinate system. The choice of coordinate system does not matter in the actual physical system, but we still have to make sure that we account for coordinate transformations between calculations that were made in different coordinate systems. In fact the mathematics we will develop rhymes very strongly with the tools developed for gauge theory.
Bare equivalence relations and the ordinary union-find data structure is no longer sufficient now. Labelled union-find8 extends union-find with the ability to attach labels to relations. The labels come from a group , the elements of which we will call symmetries, and the labelling has to satisfy some coherence properties:
- Every element is related to itself via the neutral element .
- When via and via then via .
- When via then via .
Figure 5: An equivalence relation on three elements together with their labels. When and are given, all of the other labels are uniquely determined by the coherence laws.
Squinting enough you might recognise these conditions as the laws of a functor, and here is where our groupoid perspective begins to pay off. For a group the delooping of is the groupoid with a single object such that is the group of automorphisms of . To label an equivalence relation with symmetries in a group , in the sense of the labelled union-find data structure, is to give a functor from the thin groupoid representing the equivalence relation.
The group that we care about for terms with free variables is the group of renamings. Formally, we pick some some countable set of names and define a renaming to be a bijection so that for only finitely many (since any term has finitely many variables, this is not a big restriction). Renamings compose and have inverses so that they form a group which acts on terms by, you guessed it, renaming free variables.
Storing the entire graph with its labels is inefficient. Like its unlabelled counterpart, labelled union-find therefore only stores a spanning tree. Due to the coherence laws, the labels on the remaining edges are sufficient to reconstruct the label between any two equivalent elements. This is argued by hand in the paper on labelled union-find, but it also follows from abstract nonesense in our categorical point of view.
Remember that we encoded the spanning tree of an equivalence relation as a category so that its groupoid completion is a thin groupoid. When we have a functor that labels only the spanning tree, then the universal property of the groupoid completion guarantees that the labels extend coherently and uniquely to a functor .
Figure 6: Just like with ordinary union-find, it also suffices to store a spanning tree in the labelled case. The labels of the omitted edges can then be reconstructed uniquely from the coherence laws, as can be seen in Figure 5.
Self-Symmetries
The labelled union find data structure has a crucial limitation. Suppose that we have two elements such that via a symmetry . When we try to union and again with a new symmetry , the algorithm described in the paper will fail with an error if . This is mirrored in our categorical semantics: when is a labelled thin groupoid, there is at most one map for any pair of elements and therefore also at most one label which relates the elements.
Self-symmetries are a particular case that is excluded by this restriction. An element is automatically related to itself via the identity symmetry . Adding any non-trivial self-symmetry to would therefore run afoul of the rule that any relation between elements must have a unique label.
Terms with free variables however do regularly have non-trivial self-symmetries!
Because addition is commutative, we notice that
?a + ?b and ?b + ?a are equivalent without any renaming. However ?b + ?a is related
to ?a + ?b via the renaming that swaps the names of the two variables.
By composing the two, ?a + ?b is related to itself via .
Labelled union find does not accomodate this. But in our categorical semantics we can solve this issue by simply dropping the requirement that the groupoid is thin so that there can be any number of maps between two objects of . We add the requirement that labellings are faithful: there is no reason for us to track duplicate relations with the same label. Now every object has a group of automorphisms and a faithful labelling induces a group embedding .
At first it appears as if this would require us to store an impractical amount of data. However, just as above it suffices to store a spanning tree together with the generators of the automorphism group for just one object in every connected component. As a data structure, this is labelled union-find augmented with a group for every root in the spanning tree. Subgroups of can be encoded relatively efficiently using the Schreier-Sims algorithm.
Figure: The automorphism groups of related elements are isomorphic. The loop on determines a loop on : start at , go to a , go around the loop, then return to .
Figure: Since the automorphism groups of related elements determine each other, it suffices to only store one of them. Moreover, we don't need to store the entire group but only its generators.
Finite Support
There is a detail that we have swept under the rug so far: Any term is invariant
under an infinite number of renamings! Since any given term only has finitely
many free variables, we have an infinite number of variables that aren't free in the term.
Permuting those variables has no effect: Nothing happens when we swap ?x and ?y in the term ?a + ?b.
So if our labelled groupoid accurately
reflects the renaming symmetries of terms, every automorphism group is infinite
and not even finitely generated. That is not an ideal situation for a data structure.
While the stabilisers of terms are not finitely generated, they do however admit a finite description. When we store the set of free variables for a term, the stabiliser group decomposes as a direct sum of the finite subgroup that actually involves the free variables, together with the infinite group that allows to permute any variable that doesn't actually occur free in the term.
Footnotes
-
Slotted E-Graphs: First-Class Support for (Bound) Variables in E-Graphs. (2025) ↩ ↩2
-
Nominal Sets: Names and Symmetry in Computer Science. (2013) ↩
-
Equality Saturation: A New Approach to Optimization. (2010) ↩
-
Equality Saturation Guided by Large Language Models. (2025) ↩
-
ASPEN: LLM-Guided E-Graph Rewriting for RTL Datapath Optimization. (2025) ↩
-
Learned Graph Rewriting with Equality Saturation: A New Paradigm in Relational Query Rewrite and Beyond. (2024) ↩
-
Relational Abstractions Based on Labeled Union-Find. (2025) ↩