Learning to Learn: Memory-Guided Evolution of Computation Graphs via Self-Compressing Fitness-Regularized Latent Space
- Existing Literature
- Proposed Solution
- Other Papers that Might be Useful
- Future Ideas
- Maybe Ideas
- References
- Limitations of current methods
- Limited adaptability: The final solution is static (even ones with continued online learning have a static learning paradigm)
- Most focus only on topology and/or weights
- “Such experimental trials convinced us that to solve the [Neural Architecture Search] dilemma, the connectionist paradigm alone is not adequate.” 1
- Weight sharing speeds up the search by reducing the size of the search space but "there is evidence that [weight sharing] inhibits the search for optimal architectures (Yu et al., 2020)" though this would only apply under some circumstances. 5
- Can take a while to find solution
- 5 addresses this by linearizing the loss function of untrained networks at the data points in each batch and using Kendall’s Tau correlation coefficient, which measures the strength and direction of association that exists between two variables, to determine that an initialized network is not worth training. It is uncertain whether this generalizes to harder tasks. It also remains to be seen whether it can be used for an unsupervised context. It might be adaptable to reinforcement learning but that has not been shown and could prove unstable due to the general sparsity of signal in the reinforcement context. It's also unclear if this method can be adapted to spiking neural networks.
A central innovation is the use of Pareto-based multi-objective optimization for evaluating and selecting candidate networks. Each network (or "individual") in the population is evaluated on multiple objectives rather than a single metric. One set of objectives reflect task-specific performance and another reflects computational cost. Instead of combining these objectives into a single weighted fitness score, the algorithm employs Pareto ranking via non-dominated sorting: an individual is considered fitter if it is Pareto-superior-that is, better in at least one objective without being worse in any other-compared to others in the population (53).
Using Pareto-based selection yields a diverse Pareto front of solutions, each representing a different balance of objectives (for example, one network might be extremely simple but moderately accurate, while another is more complex but achieves higher accuracy). This approach has several advantages for evolving learning systems:
- Robust, Generalizable Solutions: Multi-objective evolution tends to produce more robust models than optimizing a single metric. The algorithm doesn't commit to one arbitrary trade-off; instead, it preserves a spectrum of high-performing solutions. Researchers can then analyze this Pareto front for insights into how different architectures trade off performance vs. complexity (53).
- Favoring Minimal Architectures: By explicitly treating computational cost as an objective to minimize, the evolution naturally favors minimal efficient networks that align with the Minimum Description principle-the best solution is the simplest one that explains the data (55). The smallest networks that still perform well can be seen as minimum description estimates (MDEs) of the task. These MDEs are not just efficient; they are also scientifically interpretable, as they highlight the core components necessary to implement a given function or behavior.
- No Manual Tuning of Trade-offs: Pareto ranking removes the need to hand-tune weights between objectives (such as how much to penalize complexity). Instead of a single "optimal" network according to a weighted sum, set of Pareto-optimal networks is obtained that captures different compromises. This is especially useful in research contexts, where one might prefer simpler models for interpretation unless complexity is absolutely required for performance (53).
In summary, multi-objective evolutionary selection ensures that the project optimizes not only for how well a network learns, but also for how elegant or tractable its design is. This keeps the search grounded, preventing bloated solutions and steering evolution toward general-purpose networks that are both high-performing and low-complexity.
Another key innovation is a Variational Autoencoder (VAE) that enables unrestricted structural recombination of neural network architectures. Traditional neuroevolution methods like NEAT use crossover within the same "species" of networks and rely on aligning genes (nodes and connections) based on historical markers to recombine two parents. Those methods struggle to recombine vastly different topologies because the correspondence between parts of two very different networks is ambiguous. The approach addresses this by learning a continuous encoding of network graphs, allowing any two networks-even of entirely different designs-to mate in a meaningful way (54).
The implemented module (called SelfCompressingFitnessRegularizedDAGVAE, in search_space_compression.py
) acts as a generative recombination mechanism or a "cross-species mating system" for networks:
- Graph Encoding and Decoding: The VAE is trained to take arbitrary computation graphs and encode them into a continuous latent vector space. It can then decode a latent vector back into a network graph. This provides a common representation for all networks, regardless of their topology (54).
- Fitness-Regularized Compression: The VAE's training is not just a standard graph autoencoding. It is regularized with fitness prediction: part of the VAE's objective is to predict the network's performance and complexity from the latent encoding. This means the VAE is encouraged to organize the latent space such that important features (those that correlate with high fitness) are captured in the representation. Dimensions of the latent vector that do not contribute to explaining variation in performance tend to be pruned out automatically (using techniques like Automatic Relevance Determination). The result is a compressed, fitness-informed search space: a smooth landscape where distances reflect meaningful differences in network capability.
- Cross-Species Mating via Latent Interpolation: Once networks are encoded in this latent space, any two networks can be recombined by interpolating or randomly mixing their latent vectors and then decoding the result. In other words, the VAE enables cross-species crossover-two parent networks from entirely different niches or architectures can produce an offspring network. This generative crossover bypasses the historical gene-matching problem that NEAT faced without needing to explicitly align nodes or connections between parents.
- Sharing Innovations Across Lineages: This mechanism effectively breaks down barriers between separate evolutionary lineages. Useful innovations that arise in one species can be transferred to very different architectures. Because a portion of each generation's offspring are generated via latent-space recombination, new architectures can emerge that combine traits from wildly different ancestors. This helps inject novel structural patterns that pure mutation or traditional crossover (limited to similar parents) might never produce.
- Expanded Exploration with Guided Search: The VAE-driven crossover significantly expands the exploration of the search space while still being guided towards promising regions (due to the fitness-informed latent space). It serves as a kind of surrogate model that directs evolution: by sampling in latent space, it effectively samples networks that are expected to be high-performing or at least valid. This compresses the combinatorially vast search space of all possible networks into a more tractable form. The net effect is increased diversity of candidate solutions and the ability to discover innovative network motifs that conventional genetic operators might miss (54).
Importantly, the evolutionary algorithm combines this Graph-VAE crossover with more traditional NEAT-style mating within species. In practice, this means there are two crossover pathways: (1) standard crossover between similar individuals (preserving fine-tuned structures within a species), and (2) occasional graph-VAE generated offspring that mix across species. This balance ensures both exploitation and exploration: the population can refine known good solutions while still injecting radically new variations.
- On the Relationship Between Variational Inference and Auto-Associative Memory
- "In order to improve the memory capacity, modern Hopfield networks [22, 21, 8] propose several variants of the energy function using polynomial or exponential interactions. Extending these models to the continuous case, [30] proposed the Modern Continuous Hopfield Network (MCHN) with update rules implementing self attention, that they relate to the transformer model [36]. In [26], the authors introduce a general Hopfield network framework where the update rules are built using three components: a similarity function, a separation function, and a projection function."
- "It has been shown that overparameterized auto-encoders also implement AM [28, 33]. These methods embed the stored patterns as attractors through training, and retrieval is performed by iterating over the auto-encoding loop."
- Multifactorial Evolutionary Algorithm with Online Transfer Parameter Estimation: MFEA-II
- STCA: Spatio-Temporal Credit Assignment with Delayed Feedback in Deep Spiking Neural Networks
- Sparse Distributed Memory using N-of-M Codes
- A discrete time neural network model with spiking neurons: II: Dynamics with noise
- The Information Bottleneck Problem and Its Applications in Machine Learning
- “The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation T should be maximally informative about Y while minimizing the mutual information with X.” (a.k.a. compression)
- Neurons detect cognitive boundaries to structure episodic memories in humans
- Why think step-by-step? Reasoning emerges from the locality of experience
- Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time
- Dynamics-Aware Unsupervised Discovery of Skills
- Frahmentation Instability in Aggregating Systems
- Rethinking the performance comparison between SNNs and ANNs
- A survey on computationally efficient neural architecture search
- Scaling Laws for Reward Model Overoptimization
- Interpreting neural computations by examining intrinsic and embedding dimensionality of neural activity
- "Task-relevant computations usually depend on structured neural activity whose dimensionality is lower than that of the state space"
- "we refer to the number of Euclidean dimensions that are required to capture a given fraction of the structured activity as the embedding dimensionality of the data"
- "While the embedding dimensionality carries information about structure, because of an assumption of linearity, it does not in general correspond to the number of independent variables needed to describe the data ... we call the number of independent variables the intrinsic dimension, and we refer to each independent variable describing the neural activity as a latent variable."
- talks about instances that have been shown to have higher embedding than intrinsic dimensionality
- "we focus on the key question of how latent variables and their embedding might relate to task variables and the underlying computations."
- "the intrinsic dimensionality of neural activity is largely determined by three sources of information: (1) incoming stimuli, (2) ongoing movements, and (3) the multitude of latent variables that characterize prior experiences and future expectations."
- "In early sensory areas, the intrinsic dimensionality is expected to be strongly associated with incoming stimuli."
- "signals in the primary motor cortex seem to carry information about the upcoming movement without regard to higher order latent variables ... the intrinsic dimensionality in the late stages of the motor system seem to be strongly tied to ongoing movements"
- "Although the general principles that govern embedding dimensionality are not known, several computationally inspired hypotheses have been proposed. One dominant hypothesis is that the information in any given brain area is extracted by other areas and peripheral systems through a linear readout. A linear readout scheme is a weighted average of the activity across neurons and has an intuitive geometric interpretation: it is a projection of the neural activity in the full Euclidean state space along a specific direction. One common strategy to evaluate this hypothesis is to quantify the degree to which a linear decoder can extract desired information from population activity in a brain area."
- "An extension of this hypothesis is that embedding across a population of neurons is organized such that different linear decoders can extract information about different task-relevant variables without interference. For example, movement and movement preparation signals in monkeys’ motor cortex reside in orthogonal subspaces [53]. This organization could ensure that preparation does not trigger movement [54,55], and may help protect previous motor memories [56]. Similar principles might govern interactions across cortical areas, by confining to orthogonal subspaces information that is private to an area and information that is communicated [57,58] ... We note however, that despite strong enthusiasm, direct evidence that the brain processes and/or communicates information through orthogonal linear decoders is wanting."
- "In general, high-dimensional embeddings with more degrees of freedom can facilitate extraction of task-relevant variables without interference (Rigotti et al. 2013; Cayco-Gajic et al. 2017; Cayco-Gajic and Silver 2019; Litwin-Kumar et al. 2017; Lanore et al. 2021). However, embedding information in arbitrarily high dimensional subspaces can have adverse effects for generalization [64]. To improve generalization, embeddings have to be appropriately constrained to capture the structural relationships and inherent invariances in the environment [10,38,65]. A theory for the ensuing trade-offs has been recently developed for the case where the activity is organized in multiple disjoint manifolds corresponding to different categories [66] and applied to interpret representations in the visual system [67] and in deep networks [68]. It is also possible to organize information embedding such that certain linear projection reflect information in more abstract form and therefore enable generalization, while others enable finer discrimination [15]."
- "The utility of linear decodability however, becomes less clear for intermediate stages of information processing in higher brain areas that carry information about latent variables that support flexible mental computations. While an experimenter may apply linear decoders to find information about a hypothesized latent variable in a certain brain area, there is no a priori reason to assume that the brain relies on such decoders."
- "For instance, ring-like manifolds, on which activity is represented by a single angular latent variable, can emerge from only weak structure in the connectivity"
- Model-agnostic Measure of Generalization Difficulty
- Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions
- very interesting paper about progressive inference using an exit policy based on context
- probably use the "vanilla backbone networks, enhanced with early exits along their depth" approach because modularity would be useful for the gene representations
- "when disentangling the backbone network’s design from the early exits, one can have the flexibility of lazily selecting the architecture of the latter ones"
- "existence of residual connections spanning across early exits can help generalisability of the network"
- "maintaining multiple feature size representations, can prove detrimental in terms of model footprint"
- even though using a non-uniform architecture for the early exits increases the search space, it also allows for tradeoffs between "The number (and type) of exit-specific layers accuracy vs. their overhead"
- "too many early classifiers can negatively impact convergence when training end-to-end"
- equidistant vs variable distance "decision depends on the use-case, the exit rate and the accuracy of each early exit"
- "inter-exit distance is not actual 'depth', but can be quantified by means of FLOPs or parameters in the network"
- train network and early exits together
- "joint loss function is shaped which sums intermediate and the last output losses (𝐿(𝑖)) in a 𝑡𝑎𝑠𝑘 weighted manner (Eq. 1) and then backpropagates the signals to the respective parts of the network"
- "accuracy of this approach can be higher both for the intermediate (𝑦𝑖<𝑁) and the last exit (𝑦𝑁)" but "not guaranteed due to cross-talk between exits"
- "interplay of multiple backpropagation signals and the relative weighting (𝑤𝑖) of the loss components" finnicky w.r.t. "enabl[ing] the extraction of reusable features across exits"
- train separately
- first train backbone then freeze it, place and train the exit policies
- "means that each exit is only fine-tuning its own layers and does not affect the convergence of the rest of the network"
- no "cross talk between classifiers nor need to hand-tune the loss function", which allows "more exit variants [to] be placed at arbitrary positions in the network and be trained in parallel, offering scalability in training while leaving the selection of exit heads for deployment time"
- "more restrictive in terms of degrees of freedom on the overall model changes, and thus can yield lower accuracy than an optimised jointly trained variant."
- can use knowledge distillation setup where "the student 𝑖 is typically an early exit and the teacher 𝑗 can be a subsequent or the last exit"
- hyperparameters:
- distillation temperature (𝑇) "controls how 'peaky' the teacher softmax (soft labels) should be"
- alpha (𝛼) "balances the learning objective between ground truth (𝑦) and soft labels (𝑦𝑗)"
- hyperparameters:
- to deal with non-IID and inference-context-specific variation, can customize early exits "while retaining the performance of the last exit in the source global domain"
- can still be used in conjunction with knowledge distillation
- 3 main deployment methods: up to specific exit, full adaptive inference (each sample exits early depending on difficulty), budgeted adaptive inference ("maximise the throughput of correct predictions" given i.e. a total latency budget)
- rule-based exiting "need[s] to manually define an arbitrary threshold"
- learned early exiting can be modeled with or without independence from previous states of exit decisions
- Network of Evolvable Neural Units: Evolving to Learn at a Synaptic Level
- uses 4-gate recurrent nodes (reset, update, cell, output)
- each node uses same shared gate parameters
- each gate is a single layer feedforward network with nonlinear activation that has the same number of outputs as there are memory states
- this reduces the required chromosomes to two
- Leveraging dendritic properties to advance machine learning and neuro-inspired computing
- Dendrites and Efficiency: Optimizing Performance and Resource Utilization
- Faith and Fate: Limits of Transformers on Compositionality
- Building transformers from neurons and astrocytes
- An Empirical Review of Automated Machine Learning
- AutoML: A survey of the state-of-the-art
- Automated machine learning: Review of the state-of-the-art and opportunities for healthcare
- A Systematic Literature Review of the Successors of “NeuroEvolution of Augmenting Topologies”
- Neural Architecture Search without Training
- Spiking Neural Networks and online learning: An overview and perspectives
- Supervised learning in spiking neural networks: A review of algorithms and evaluations
- A review of learning in biologically plausible spiking neural networks
- Training Spiking Neural Networks Using Lessons From Deep Learning
- Deep Residual Learning in Spiking Neural Networks
- Event-based backpropagation can compute exact gradients for spiking neural networks
- Symmetric spike timing-dependent plasticity at CA3–CA3 synapses optimizes storage and recall in autoassociative networks
- Attention Spiking Neural Networks
- Foundations and Modeling of Dynamic Networks Using Dynamic Graph Neural Networks: A Survey
- The synchronized dynamics of time-varying networks
- Multimodality in Meta-Learning: A Comprehensive Survey
- Meta-Learning in Neural Networks: A Survey
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Neural networks
- Perceiver: General Perception with Iterative Attention
- A Generalist Agent
- PaLM-E: An Embodied Multimodal Language Model
- Attention is all you need
- Investigating the Limitations of Transformers with Simple Arithmetic Tasks
- Transformers In Vision: A Survey
- Coinductive guide to inductive transformer heads
- Addressing Some Limitations Of Transformers With Feedback Memory
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
- In-context Learning and Induction Heads
- Emergence of Scale-free Graphs in Dynamical Spiking Neural Networks
- Learning place cells, grid cells and invariances with excitatory and inhibitory plasticity
- On the approximation by single hidden layer feedforward neural networks with fixed weights
- Compressive Transformers For Long-Range Sequence Modelling
- Compressed Learning: A Deep Neural Network Approach
- Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff
- Lecture from Stanford by Jack Rae - Compression for AGI
- How learning unfolds in the brain: toward an optimization view
- Orbitofrontal Cortex as a Cognitive Map of Task Space
- Population coding in the cerebellum: a machine learning perspective
- A bio-inspired bistable recurrent cell allows for long-lasting memory
- Scale-Invariant Memory Representations Emerge from Moiré Interference between Grid Fields That Produce Theta Oscillations: A Computational Model
- Scale-Invariance and Self-Similar 'Wavelet' Transforms: an Analysis of Natural Scenes and Mammalian Visual Systems
- A Comprehensive Survey of Continual Learning: Theory, Method and Application
- Riemannian walk for incremental learning: Understanding forgetting and intransigence
- Gradient Episodic Memory for Continual Learning
- Signaling Mechanisms Linking Neuronal Activity to Gene Expression
- Plasticity of the Nervous System & Neuronal Activity–Regulated Gene Transcription in Synapse Development and Cognitive Function
- Neural Coding in Spiking Neural Networks: A Comparative Study for Robust Neuromorphic Systems
- Lifelong learning of human actions with deep neural network self-organization
- Brain-inspired learning in artificial neural networks: a review
- There's Plenty of Room Right Here: Biological Systems as Evolved, Overloaded, Multi-scale Machines
- Universal Mechanical Polycomputation in Granular Matter
- A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II
- GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
- Modeling by shortest data description