Forget the Data and Fine-tuning! Just Fold the Network to Compress

Accepted by ICLR 2025

Dong Wang^*,1, Haris Šikić^*,1, Lothar Thiele², Olga Saukh^1,3

¹Graz University of Technology, Austria
²ETH Zurich, Switzerland
³Complexity Science Hub Vienna, Austria
*Equal contribution

Abstract

Model Folding $k$ -means clustering and novel variance control techniques. Experiments on ResNet18 and LLaMA-7B show that Model Folding matches data-driven compression methods and outperforms recent data-free approaches, especially at high sparsity levels, making it ideal for resource-constrained deployments.

How does it work?

$3\times3$ filter weights in conv1 of a pre-trained ResNet18.

Inspired by the structural similarities in pre-trained models, we propose model folding, clustering similar neurons instead of zeroing them out. We proposed two data-free repair algorithms to correct the BatchNorm statistics of a folded model.

To fold a model, there are only three steps: cluster, merge, and repair. No ~~Data~~ No ~~Fine-tuning~~

Model Folding Animation

Result

We compared model folding to other SOTA methods including:

Fold-Naïve: model folding w/o repair
Fold-R: model folding w repair
Fold-AR: model folding w approximate repair
Fold-DIR: model folding w deep inversion repair
SP L1/L2: L1/L2 structured pruning
IFM: Iterative Feature Merging
INN: Integral neural networks

Comparsion with structured magnitude pruning and IFM

Comparison with IFM and structured magnitude pruning. Model folding, when tested on ResNet18 (top row) and VGG11-BN (bottom row) trained on CIFAR10 (left column) and ImageNet (right column), outperforms IFM with higher sparsity and increasing dataset difficulty.

Comparsion with IFM and INN

Comparison of model folding with IFM, and INN using ResNet18 on CIFAR10. In the original experiment defined in the IFM and INN papers, where only the last two blocks of a ResNet18 are pruned, folding is significantly better than INN while it matches the performance of IFM for lower sparsities and becomes significantly better for higher sparsities.

Folding Llama-1-7b

Performance of structured pruning methods on LLaMA-7B without post-tuning, showing perplexity on WikiText2 and zero-shot performance across tasks. The "Average" is computed over four tasks. "Wanda_sp" represents an adapted Wanda method for structured pruning. Despite not using data or fine-tuning, model folding achieves comparable performance to data-driven methods.

📑 Citation


xxxxxxxxxx
7
1
@inproceedings{wang2025forget,
2
  title     = {Forget the Data and Fine-tuning!\\Just Fold the Network to Compress},
3
  author    = {Dong Wang and Haris \v{S}iki\'{c} and Lothar Thiele and Olga Saukh},
4
  booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
5
  year      = {2025},
6
  url       = {https://openreview.net/forum?id=W2Wkp9MQsF} 
7
}

Model Folding Team w/ ❤️