Foundations for hyperefficient algorithms that can exploit the next generation of neural accelerators
About This Research
Sparsity is the observation that neural computations are often redundant. Weights are often near-zero; and it can be demonstrated that many paths of activity through a network are also in fact redundant. Our research investigates how these two forms of sparsity, structured (equivalent to weight pruning) and unstructured (the dynamic gating of activations within the network), yield computational savings with minimal or minor loss in representation quality (at least until sparisty levels become very high). It has been known in computational neuroscience for a long time that the brain crucially relies on both kind of sparsity for its extreme computational efficiency, but how to productively use this insight for training efficient networks has typically eluded ML researchers.
Here, we present two autoencoders training side-by-side on handwritten digits: one dense baseline and one sparse. The sparse model applies both sparsity types simultaneously. The neuronal activity visualisation reveals which hidden units fire for each input. Different digits activate different sparse subnetworks, which can be understood as the model learning input-dependent computational pathways.
Due to the generality of the underlying mathematics, these techniques are architecture-agnostic. We believe future directions include: ultra-large models, orders of magnitude larger than present efficiency constraints allow; extreme mixture-of-experts routings, where activity sparsity provides a novel, principled gating mechanism; on-device inference, where multiplicative savings enable deployment on resource-constrained hardware; and interpretability research, where sparse activation patterns may reveal semantically meaningful circuits within trained networks.
This line of research is particularly urgent given the emerging generation of neural accelerators explicitly designed to exploit this dynamic sparsity. Hardware such as Cerebras' wafer-scale engines, neuromorphic chips like SpiNNaker / SpiNNCloud and Intel's Loihi, and analog computing substrates based on memristors and optoelectronic components all promise order-of-magnitude efficiency improvements and latency improvements, but only for appropriately designed novel architectures. These architectures represent a potential paradigm shift in neural computation, yet most contemporary algorithm research remains focused on dense matrix operations optimised for GPUs.