Learning to Learn: Memory-Guided Evolution of Computation Graphs via Self-Compressing Fitness-Regularized Latent Space

Existing Literature

Neural Architecture Search

Limitations of current methods
- Limited adaptability: The final solution is static (even ones with continued online learning have a static learning paradigm)
- Most focus only on topology and/or weights
  - “Such experimental trials convinced us that to solve the [Neural Architecture Search] dilemma, the connectionist paradigm alone is not adequate.” 1
- Weight sharing speeds up the search by reducing the size of the search space but "there is evidence that [weight sharing] inhibits the search for optimal architectures (Yu et al., 2020)" though this would only apply under some circumstances. 5
- Can take a while to find solution
  - 5 addresses this by linearizing the loss function of untrained networks at the data points in each batch and using Kendall’s Tau correlation coefficient, which measures the strength and direction of association that exists between two variables, to determine that an initialized network is not worth training. It is uncertain whether this generalizes to harder tasks. It also remains to be seen whether it can be used for an unsupervised context. It might be adaptable to reinforcement learning but that has not been shown and could prove unstable due to the general sparsity of signal in the reinforcement context. It's also unclear if this method can be adapted to spiking neural networks.

Proposed Solution

Multi-Objective Fitness (Pareto Optimization)

A central innovation is the use of Pareto-based multi-objective optimization for evaluating and selecting candidate networks. Each network (or "individual") in the population is evaluated on multiple objectives rather than a single metric. One set of objectives reflect task-specific performance and another reflects computational cost. Instead of combining these objectives into a single weighted fitness score, the algorithm employs Pareto ranking via non-dominated sorting: an individual is considered fitter if it is Pareto-superior-that is, better in at least one objective without being worse in any other-compared to others in the population (53).

Using Pareto-based selection yields a diverse Pareto front of solutions, each representing a different balance of objectives (for example, one network might be extremely simple but moderately accurate, while another is more complex but achieves higher accuracy). This approach has several advantages for evolving learning systems:

Robust, Generalizable Solutions: Multi-objective evolution tends to produce more robust models than optimizing a single metric. The algorithm doesn't commit to one arbitrary trade-off; instead, it preserves a spectrum of high-performing solutions. Researchers can then analyze this Pareto front for insights into how different architectures trade off performance vs. complexity (53).
Favoring Minimal Architectures: By explicitly treating computational cost as an objective to minimize, the evolution naturally favors minimal efficient networks that align with the Minimum Description principle-the best solution is the simplest one that explains the data (55). The smallest networks that still perform well can be seen as minimum description estimates (MDEs) of the task. These MDEs are not just efficient; they are also scientifically interpretable, as they highlight the core components necessary to implement a given function or behavior.
No Manual Tuning of Trade-offs: Pareto ranking removes the need to hand-tune weights between objectives (such as how much to penalize complexity). Instead of a single "optimal" network according to a weighted sum, set of Pareto-optimal networks is obtained that captures different compromises. This is especially useful in research contexts, where one might prefer simpler models for interpretation unless complexity is absolutely required for performance (53).

In summary, multi-objective evolutionary selection ensures that the project optimizes not only for how well a network learns, but also for how elegant or tractable its design is. This keeps the search grounded, preventing bloated solutions and steering evolution toward general-purpose networks that are both high-performing and low-complexity.

Generative Cross-Species Crossover (Graph-VAE)

Another key innovation is a Variational Autoencoder (VAE) that enables unrestricted structural recombination of neural network architectures. Traditional neuroevolution methods like NEAT use crossover within the same "species" of networks and rely on aligning genes (nodes and connections) based on historical markers to recombine two parents. Those methods struggle to recombine vastly different topologies because the correspondence between parts of two very different networks is ambiguous. The approach addresses this by learning a continuous encoding of network graphs, allowing any two networks-even of entirely different designs-to mate in a meaningful way (54).

The implemented module (called SelfCompressingFitnessRegularizedDAGVAE, in search_space_compression.py) acts as a generative recombination mechanism or a "cross-species mating system" for networks:

Graph Encoding and Decoding: The VAE is trained to take arbitrary computation graphs and encode them into a continuous latent vector space. It can then decode a latent vector back into a network graph. This provides a common representation for all networks, regardless of their topology (54).
Fitness-Regularized Compression: The VAE's training is not just a standard graph autoencoding. It is regularized with fitness prediction: part of the VAE's objective is to predict the network's performance and complexity from the latent encoding. This means the VAE is encouraged to organize the latent space such that important features (those that correlate with high fitness) are captured in the representation. Dimensions of the latent vector that do not contribute to explaining variation in performance tend to be pruned out automatically (using techniques like Automatic Relevance Determination). The result is a compressed, fitness-informed search space: a smooth landscape where distances reflect meaningful differences in network capability.
Cross-Species Mating via Latent Interpolation: Once networks are encoded in this latent space, any two networks can be recombined by interpolating or randomly mixing their latent vectors and then decoding the result. In other words, the VAE enables cross-species crossover-two parent networks from entirely different niches or architectures can produce an offspring network. This generative crossover bypasses the historical gene-matching problem that NEAT faced without needing to explicitly align nodes or connections between parents.
Sharing Innovations Across Lineages: This mechanism effectively breaks down barriers between separate evolutionary lineages. Useful innovations that arise in one species can be transferred to very different architectures. Because a portion of each generation's offspring are generated via latent-space recombination, new architectures can emerge that combine traits from wildly different ancestors. This helps inject novel structural patterns that pure mutation or traditional crossover (limited to similar parents) might never produce.
Expanded Exploration with Guided Search: The VAE-driven crossover significantly expands the exploration of the search space while still being guided towards promising regions (due to the fitness-informed latent space). It serves as a kind of surrogate model that directs evolution: by sampling in latent space, it effectively samples networks that are expected to be high-performing or at least valid. This compresses the combinatorially vast search space of all possible networks into a more tractable form. The net effect is increased diversity of candidate solutions and the ability to discover innovative network motifs that conventional genetic operators might miss (54).

Importantly, the evolutionary algorithm combines this Graph-VAE crossover with more traditional NEAT-style mating within species. In practice, this means there are two crossover pathways: (1) standard crossover between similar individuals (preserving fine-tuned structures within a species), and (2) occasional graph-VAE generated offspring that mix across species. This balance ensures both exploitation and exploration: the population can refine known good solutions while still injecting radically new variations.

Other Papers that Might be Useful

On the Relationship Between Variational Inference and Auto-Associative Memory
- "In order to improve the memory capacity, modern Hopfield networks [22, 21, 8] propose several variants of the energy function using polynomial or exponential interactions. Extending these models to the continuous case, [30] proposed the Modern Continuous Hopfield Network (MCHN) with update rules implementing self attention, that they relate to the transformer model [36]. In [26], the authors introduce a general Hopfield network framework where the update rules are built using three components: a similarity function, a separation function, and a projection function."
- "It has been shown that overparameterized auto-encoders also implement AM [28, 33]. These methods embed the stored patterns as attractors through training, and retrieval is performed by iterating over the auto-encoding loop."
Multifactorial Evolutionary Algorithm with Online Transfer Parameter Estimation: MFEA-II
STCA: Spatio-Temporal Credit Assignment with Delayed Feedback in Deep Spiking Neural Networks
Sparse Distributed Memory using N-of-M Codes
A discrete time neural network model with spiking neurons: II: Dynamics with noise
The Information Bottleneck Problem and Its Applications in Machine Learning
- “The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation T should be maximally informative about Y while minimizing the mutual information with X.” (a.k.a. compression)
Neurons detect cognitive boundaries to structure episodic memories in humans
Why think step-by-step? Reasoning emerges from the locality of experience
Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time
Dynamics-Aware Unsupervised Discovery of Skills
Frahmentation Instability in Aggregating Systems
Rethinking the performance comparison between SNNs and ANNs
A survey on computationally efficient neural architecture search
Scaling Laws for Reward Model Overoptimization
Interpreting neural computations by examining intrinsic and embedding dimensionality of neural activity
- "Task-relevant computations usually depend on structured neural activity whose dimensionality is lower than that of the state space"
- "we refer to the number of Euclidean dimensions that are required to capture a given fraction of the structured activity as the embedding dimensionality of the data"
- "While the embedding dimensionality carries information about structure, because of an assumption of linearity, it does not in general correspond to the number of independent variables needed to describe the data ... we call the number of independent variables the intrinsic dimension, and we refer to each independent variable describing the neural activity as a latent variable."
- talks about instances that have been shown to have higher embedding than intrinsic dimensionality
- "we focus on the key question of how latent variables and their embedding might relate to task variables and the underlying computations."
- "the intrinsic dimensionality of neural activity is largely determined by three sources of information: (1) incoming stimuli, (2) ongoing movements, and (3) the multitude of latent variables that characterize prior experiences and future expectations."
- "In early sensory areas, the intrinsic dimensionality is expected to be strongly associated with incoming stimuli."
- "signals in the primary motor cortex seem to carry information about the upcoming movement without regard to higher order latent variables ... the intrinsic dimensionality in the late stages of the motor system seem to be strongly tied to ongoing movements"
- "Although the general principles that govern embedding dimensionality are not known, several computationally inspired hypotheses have been proposed. One dominant hypothesis is that the information in any given brain area is extracted by other areas and peripheral systems through a linear readout. A linear readout scheme is a weighted average of the activity across neurons and has an intuitive geometric interpretation: it is a projection of the neural activity in the full Euclidean state space along a specific direction. One common strategy to evaluate this hypothesis is to quantify the degree to which a linear decoder can extract desired information from population activity in a brain area."
- "An extension of this hypothesis is that embedding across a population of neurons is organized such that different linear decoders can extract information about different task-relevant variables without interference. For example, movement and movement preparation signals in monkeys’ motor cortex reside in orthogonal subspaces [53]. This organization could ensure that preparation does not trigger movement [54,55], and may help protect previous motor memories [56]. Similar principles might govern interactions across cortical areas, by confining to orthogonal subspaces information that is private to an area and information that is communicated [57,58] ... We note however, that despite strong enthusiasm, direct evidence that the brain processes and/or communicates information through orthogonal linear decoders is wanting."
- "In general, high-dimensional embeddings with more degrees of freedom can facilitate extraction of task-relevant variables without interference (Rigotti et al. 2013; Cayco-Gajic et al. 2017; Cayco-Gajic and Silver 2019; Litwin-Kumar et al. 2017; Lanore et al. 2021). However, embedding information in arbitrarily high dimensional subspaces can have adverse effects for generalization [64]. To improve generalization, embeddings have to be appropriately constrained to capture the structural relationships and inherent invariances in the environment [10,38,65]. A theory for the ensuing trade-offs has been recently developed for the case where the activity is organized in multiple disjoint manifolds corresponding to different categories [66] and applied to interpret representations in the visual system [67] and in deep networks [68]. It is also possible to organize information embedding such that certain linear projection reflect information in more abstract form and therefore enable generalization, while others enable finer discrimination [15]."
- "The utility of linear decodability however, becomes less clear for intermediate stages of information processing in higher brain areas that carry information about latent variables that support flexible mental computations. While an experimenter may apply linear decoders to find information about a hypothesized latent variable in a certain brain area, there is no a priori reason to assume that the brain relies on such decoders."
- "For instance, ring-like manifolds, on which activity is represented by a single angular latent variable, can emerge from only weak structure in the connectivity"
Model-agnostic Measure of Generalization Difficulty
Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions
- very interesting paper about progressive inference using an exit policy based on context
- probably use the "vanilla backbone networks, enhanced with early exits along their depth" approach because modularity would be useful for the gene representations
  - "when disentangling the backbone network’s design from the early exits, one can have the flexibility of lazily selecting the architecture of the latter ones"
- "existence of residual connections spanning across early exits can help generalisability of the network"
- "maintaining multiple feature size representations, can prove detrimental in terms of model footprint"
- even though using a non-uniform architecture for the early exits increases the search space, it also allows for tradeoffs between "The number (and type) of exit-specific layers accuracy vs. their overhead"
- "too many early classifiers can negatively impact convergence when training end-to-end"
- equidistant vs variable distance "decision depends on the use-case, the exit rate and the accuracy of each early exit"
- "inter-exit distance is not actual 'depth', but can be quantified by means of FLOPs or parameters in the network"
- train network and early exits together
  - "joint loss function is shaped which sums intermediate and the last output losses (𝐿(𝑖)) in a 𝑡𝑎𝑠𝑘 weighted manner (Eq. 1) and then backpropagates the signals to the respective parts of the network"
  - "accuracy of this approach can be higher both for the intermediate (𝑦𝑖<𝑁) and the last exit (𝑦𝑁)" but "not guaranteed due to cross-talk between exits"
  - "interplay of multiple backpropagation signals and the relative weighting (𝑤𝑖) of the loss components" finnicky w.r.t. "enabl[ing] the extraction of reusable features across exits"
- train separately
  - first train backbone then freeze it, place and train the exit policies
  - "means that each exit is only fine-tuning its own layers and does not affect the convergence of the rest of the network"
  - no "cross talk between classifiers nor need to hand-tune the loss function", which allows "more exit variants [to] be placed at arbitrary positions in the network and be trained in parallel, offering scalability in training while leaving the selection of exit heads for deployment time"
  - "more restrictive in terms of degrees of freedom on the overall model changes, and thus can yield lower accuracy than an optimised jointly trained variant."
- can use knowledge distillation setup where "the student 𝑖 is typically an early exit and the teacher 𝑗 can be a subsequent or the last exit"
  - hyperparameters:
    - distillation temperature (𝑇) "controls how 'peaky' the teacher softmax (soft labels) should be"
    - alpha (𝛼) "balances the learning objective between ground truth (𝑦) and soft labels (𝑦𝑗)"
- to deal with non-IID and inference-context-specific variation, can customize early exits "while retaining the performance of the last exit in the source global domain"
  - can still be used in conjunction with knowledge distillation
- 3 main deployment methods: up to specific exit, full adaptive inference (each sample exits early depending on difficulty), budgeted adaptive inference ("maximise the throughput of correct predictions" given i.e. a total latency budget)
- rule-based exiting "need[s] to manually define an arbitrary threshold"
- learned early exiting can be modeled with or without independence from previous states of exit decisions
Network of Evolvable Neural Units: Evolving to Learn at a Synaptic Level
- uses 4-gate recurrent nodes (reset, update, cell, output)
- each node uses same shared gate parameters
- each gate is a single layer feedforward network with nonlinear activation that has the same number of outputs as there are memory states
- this reduces the required chromosomes to two
Leveraging dendritic properties to advance machine learning and neuro-inspired computing
Dendrites and Efficiency: Optimizing Performance and Resource Utilization
Faith and Fate: Limits of Transformers on Compositionality
Building transformers from neurons and astrocytes

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
computation_graphs		computation_graphs
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
attributes.py		attributes.py
compare_encoders.py		compare_encoders.py
complexity.py		complexity.py
create-optimizer-graphs.sh		create-optimizer-graphs.sh
data.py		data.py
genes.py		genes.py
genome.py		genome.py
graph_builder.py		graph_builder.py
main.py		main.py
metrics.py		metrics.py
models.py		models.py
neat-config		neat-config
pareto.py		pareto.py
population.py		population.py
pyproject.toml		pyproject.toml
reproduction.py		reproduction.py
requirements.txt		requirements.txt
search_space_compression.py		search_space_compression.py
tasks.py		tasks.py
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning to Learn: Memory-Guided Evolution of Computation Graphs via Self-Compressing Fitness-Regularized Latent Space

Contents

Existing Literature

Neural Architecture Search

Proposed Solution

Multi-Objective Fitness (Pareto Optimization)

Generative Cross-Species Crossover (Graph-VAE)

Other Papers that Might be Useful

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

TimeDelta/learning-to-learn

Folders and files

Latest commit

History

Repository files navigation

Learning to Learn: Memory-Guided Evolution of Computation Graphs via Self-Compressing Fitness-Regularized Latent Space

Contents

Existing Literature

Neural Architecture Search

Proposed Solution

Multi-Objective Fitness (Pareto Optimization)

Generative Cross-Species Crossover (Graph-VAE)

Other Papers that Might be Useful

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages