Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit L1 regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit L1 regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit L2 regularization that gradually transitions to an L1 penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.
@inproceedings{jacobs2025mask,title={Mask in the Mirror: Implicit Sparsification},author={Jacobs, Tom and Burkholz, Rebekka},booktitle={The Thirteenth International Conference on Learning Representations},year={2025},url={https://openreview.net/forum?id=U47ymTS3ut},}
Maximizing the spectral gap through graph rewiring has been proposed to enhance the performance of message-passing graph neural networks (GNNs) by addressing over-squashing. However, as we show, minimizing the spectral gap can also improve generalization. To explain this, we analyze how rewiring can benefit GNNs within the context of stochastic block models. Since spectral gap optimization primarily influences community strength, it improves performance when the community structure aligns with node labels. Building on this insight, we propose three distinct rewiring strategies that explicitly target community structure, node labels, and their alignment: (a) community structure-based rewiring (ComMa), a more computationally efficient alternative to spectral gap optimization that achieves similar goals; (b) feature similarity-based rewiring (FeaSt), which focuses on maximizing global homophily; and (c) a hybrid approach (ComFy), which enhances local feature similarity while preserving community structure to optimize label-community alignment. Extensive experiments confirm the effectiveness of these strategies and support our theoretical insights.
@inproceedings{rubio-madrigal2025gnns,title={{GNN}s Getting ComFy: Community and Feature Similarity Guided Rewiring},author={Rubio-Madrigal, Celia and Jamadandi, Adarsh and Burkholz, Rebekka},booktitle={The Thirteenth International Conference on Learning Representations},year={2025},url={https://openreview.net/forum?id=g6v09VxgFw},}
Message Passing Graph Neural Networks are known to suffer from two problems that are sometimes believed to be diametrically opposed: over-squashing and over-smoothing. The former results from topological bottlenecks that hamper the information flow from distant nodes and are mitigated by spectral gap maximization, primarily, by means of edge additions. However, such additions often promote over-smoothing that renders nodes of different classes less distinguishable. Inspired by the Braess phenomenon, we argue that deleting edges can address over-squashing and over-smoothing simultaneously. This insight explains how edge deletions can improve generalization, thus connecting spectral gap optimization to a seemingly disconnected objective of reducing computational resources by pruning graphs for lottery tickets. To this end, we propose a computationally effective spectral gap optimization framework to add or delete edges and demonstrate its effectiveness on the long range graph benchmark and on larger heterophilous datasets.
@inproceedings{jamadandi2024spectral,title={Spectral Graph Pruning Against Over-Squashing and Over-Smoothing},author={Jamadandi, Adarsh and Rubio-Madrigal, Celia and Burkholz, Rebekka},booktitle={Thirty-eighth Conference on Neural Information Processing Systems},year={2024},url={https://openreview.net/forum?id=EMkrwJY2de},}
Graph neural networks (GNNs) with a rescale invariance, such as GATs, can be re-parameterized during optimization through dynamic rescaling of network parameters and gradients while keeping the loss invariant. In this work, we explore dynamic rescaling as a tool to influence GNN training dynamics in two key ways: i) balancing the network with respect to various criteria, and ii) controlling the relative learning speeds of different layers. We gain novel insights, unique to GNNs, that reveal distinct training modes for different tasks. For heterophilic graphs, achieving balance based on relative gradients leads to faster training and better generalization. In contrast, homophilic graphs benefit from delaying the learning of later layers. Additionally, we show that training in balance supports larger learning rates, which can improve generalization. Moreover, controlling layer-wise training speeds is linked to grokking-like phenomena, which may be of independent interest.
@inproceedings{mustafa2024dynamic,title={Dynamic Rescaling for Training {GNN}s},author={Mustafa, Nimrah and Burkholz, Rebekka},booktitle={Thirty-eighth Annual Conference on Neural Information Processing Systems},year={2024},url={https://openreview.net/forum?id=IfZwSRpqHl},}
The practical utility of machine learning models in the sciences often hinges on their interpretability. It is common to assess a model’s merit for scientific discovery, and thus novel insights, by how well it aligns with already available domain knowledge - a dimension that is currently largely disregarded in the comparison of neural network models. While pruning can simplify deep neural network architectures and excels in identifying sparse models, as we show in the context of gene regulatory network inference, state-of-the-art techniques struggle with biologically meaningful structure learning. To address this issue, we propose DASH, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise. Using both synthetic data with ground truth information, as well as real-world gene expression data, we show that DASH, using knowledge about gene interaction partners within the putative regulatory network, outperforms general pruning methods by a large margin and yields deeper insights into the biological systems being studied.
@inproceedings{hossain2024pruning,title={Pruning neural network models for gene regulatory dynamics using data and domain knowledge},author={Hossain, Intekhab and Fischer, Jonas and Burkholz, Rebekka and Quackenbush, John},booktitle={Thirty-eighth Conference on Neural Information Processing Systems},year={2024},url={https://openreview.net/forum?id=FNtsZLwkGr},}
Gene regulatory network (GRN) models that are formulated as ordinary differential equations (ODEs) can accurately explain temporal gene expression patterns and promise to yield new insights into important cellular processes, disease progression, and intervention design. Learning such gene regulatory ODEs is challenging, since we want to predict the evolution of gene expression in a way that accurately encodes the underlying GRN governing the dynamics and the nonlinear functional relationships between genes. Most widely used ODE estimation methods either impose too many parametric restrictions or are not guided by meaningful biological insights, both of which impede either scalability, explainability, or both.
@article{hossain2024biologically,author={Hossain, Intekhab and Fanfani, Viola and Fischer, Jonas and Quackenbush, John and Burkholz, Rebekka},title={Biologically informed NeuralODEs for genome-wide regulatory dynamics},journal={Genome Biology},year={2024},month=may,volume={25},url={https://doi.org/10.1186/s13059-024-03264-0},}
Graph Attention Networks (GATs) are designed to provide flexible neighborhood aggregation that assigns weights to neighbors according to their importance. In practice, however, GATs are often unable to switch off task-irrelevant neighborhood aggregation, as we show experimentally and analytically. To address this challenge, we propose GATE, a GAT extension that holds three major advantages: i) It alleviates over-smoothing by addressing its root cause of unnecessary neighborhood aggregation. ii) Similarly to perceptrons, it benefits from higher depth as it can still utilize additional layers for (non-)linear feature transformations in case of (nearly) switched-off neighborhood aggregation. iii) By down-weighting connections to unrelated neighbors, it often outperforms GATs on real-world heterophilic datasets. To further validate our claims, we construct a synthetic test bed to analyze a model’s ability to utilize the appropriate amount of neighborhood aggregation, which could be of independent interest.
@inproceedings{mustafa2024gate,title={{GATE}: How to Keep Out Intrusive Neighbors},author={Mustafa, Nimrah and Burkholz, Rebekka},booktitle={Forty-first International Conference on Machine Learning},year={2024},url={https://openreview.net/forum?id=Sjv5RcqfuH},}
Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.
@inproceedings{gadhikar2024masks,title={Masks, Signs, And Learning Rate Rewinding},author={Gadhikar, Advait Harshal and Burkholz, Rebekka},booktitle={The Twelfth International Conference on Learning Representations},year={2024},url={https://openreview.net/forum?id=qODvxQ8TXW},}
Layer normalization, for which Batch Normalization (BN) is a popular choice, is an integral part of many deep learning architectures and contributes significantly to the learning success. We provide a partial explanation for this phenomenon by proving that training normalization layers alone is already sufficient for universal function approximation if the number of available, potentially random features matches or exceeds the weight parameters of the target networks that can be expressed. Our bound on the number of required features does not only improve on a recent result for fully-connected feed-forward architectures but also applies to CNNs with and without residual connections and almost arbitrary activation functions (which include ReLUs). Our explicit construction of a given target network solves a depth-width trade-off that is driven by architectural constraints and can explain why switching off entire neurons can have representational benefits, as has been observed empirically. To validate our theory, we explicitly match target networks that outperform experimentally obtained networks with trained BN parameters by utilizing a sufficient number of random features.
@inproceedings{burkholz2024batch,title={Batch normalization is sufficient for universal function approximation in {CNN}s},author={Burkholz, Rebekka},booktitle={The Twelfth International Conference on Learning Representations},year={2024},url={https://openreview.net/forum?id=wOSYMHfENq},}
While the expressive power and computational capabilities of graph neural networks (GNNs) have been theoretically studied, their optimization and learning dynamics, in general, remain largely unexplored. Our study undertakes the Graph Attention Network (GAT), a popular GNN architecture in which a node’s neighborhood aggregation is weighted by parameterized attention coefficients. We derive a conservation law of GAT gradient flow dynamics, which explains why a high portion of parameters in GATs with standard initialization struggle to change during training. This effect is amplified in deeper GATs, which perform significantly worse than their shallow counterparts. To alleviate this problem, we devise an initialization scheme that balances the GAT network. Our approach i) allows more effective propagation of gradients and in turn enables trainability of deeper networks, and ii) attains a considerable speedup in training and convergence time in comparison to the standard initialization. Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms.
@inproceedings{mustafa2023are,title={Are {GAT}s Out of Balance?},author={Mustafa, Nimrah and Bojchevski, Aleksandar and Burkholz, Rebekka},booktitle={Thirty-seventh Conference on Neural Information Processing Systems},year={2023},url={https://openreview.net/forum?id=qY7UqLoora},}
Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting sparse networks can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms, even though they do not rely on computationally expensive prune-train iterations and can be drawn initially without significant computational overhead. We offer a theoretical explanation of how random masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1 / \log(1/\textsparsity). This overparameterization factor is necessary at least for 3-layer random networks, which elucidates the observed degrading performance of random networks at higher sparsity. At moderate to high sparsity levels, however, our results imply that sparser networks are contained within random source networks so that any dense-to-sparse training scheme can be turned into a computationally more efficient sparse-to-sparse one by constraining the search to a fixed random mask. We demonstrate the feasibility of this approach in experiments for different pruning methods and propose particularly effective choices of initial layer-wise sparsity ratios of the random source network. As a special case, we show theoretically and experimentally that random source networks also contain strong lottery tickets.
@inproceedings{pmlr-v202-gadhikar23a,title={Why Random Pruning Is All We Need to Start Sparse},author={Gadhikar, Advait Harshal and Mukherjee, Sohom and Burkholz, Rebekka},booktitle={Proceedings of the 40th International Conference on Machine Learning},pages={10542--10570},year={2023},editor={Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},volume={202},series={Proceedings of Machine Learning Research},publisher={PMLR},url={https://proceedings.mlr.press/v202/gadhikar23a.html},}
The strong lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning, which has inspired interesting practical and theoretical insights into how neural networks can represent functions. For networks with ReLU activation functions, it has been proven that a target network with depth L can be approximated by the subnetwork of a randomly initialized neural network that has double the target’s depth 2L and is wider by a logarithmic factor. We show that a depth L+1 is sufficient. This result indicates that we can expect to find lottery tickets at realistic, commonly used depths while only requiring logarithmic overparametrization. Our novel construction approach applies to a large class of activation functions and is not limited to ReLUs. Code is available on Github (RelationalML/LT-existence).
@inproceedings{NEURIPS2022_76bf7786,author={Burkholz, Rebekka},booktitle={Advances in Neural Information Processing Systems},editor={Koyejo, S. and Mohamed, S. and Agarwal, A. and Belgrave, D. and Cho, K. and Oh, A.},pages={18707--18720},publisher={Curran Associates, Inc.},title={Most Activation Functions Can Win the Lottery Without Excessive Depth},url={https://papers.nips.cc/paper_files/paper/2022/hash/76bf7786d311217077bc8bb021946cd9-Abstract-Conference.html},volume={35},year={2022},}
The Lottery Ticket Hypothesis continues to have a profound practical impact on the quest for small scale deep neural networks that solve modern deep learning tasks at competitive performance. These lottery tickets are identified by pruning large randomly initialized neural networks with architectures that are as diverse as their applications. Yet, theoretical insights that attest their existence have been mostly focused on deed fully-connected feed forward networks with ReLU activation functions. We prove that also modern architectures consisting of convolutional and residual layers that can be equipped with almost arbitrary activation functions can contain lottery tickets with high probability.
@inproceedings{pmlr-v162-burkholz22a,title={Convolutional and Residual Networks Provably Contain Lottery Tickets},author={Burkholz, Rebekka},booktitle={Proceedings of the 39th International Conference on Machine Learning},pages={2414--2433},year={2022},editor={Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},volume={162},series={Proceedings of Machine Learning Research},publisher={PMLR},url={https://proceedings.mlr.press/v162/burkholz22a.html},}