Forget the Data and Fine-tuning! Just Fold the Network to Compress

 

Accepted by ICLR 2025
Dong Wang*,1, Haris Šikić*,1, Lothar Thiele2, Olga Saukh1,3
1Graz University of Technology, Austria
2ETH Zurich, Switzerland
3Complexity Science Hub Vienna, Austria
*Equal contribution

 

Abstract

Model Folding is a data-free model compression technique that merges structurally similar neurons across layers, reducing model size without fine-tuning or training data. It preserves data statistics using k-means clustering and novel variance control techniques. Experiments on ResNet18 and LLaMA-7B show that Model Folding matches data-driven compression methods and outperforms recent data-free approaches, especially at high sparsity levels, making it ideal for resource-constrained deployments.

 

 

How does it work?

Models learned by SGD trend to have correlated patterns or similar parameters in the weight space. The right-top plot in the following figure shows 3×3​ filter weights in conv1 of a pre-trained ResNet18.

 

Inspired by the structural similarities in pre-trained models, we propose model folding, clustering similar neurons instead of zeroing them out. We proposed two data-free repair algorithms to correct the BatchNorm statistics of a folded model.

 

To fold a model, there are only three steps: cluster, merge, and repair. No Data No Fine-tuning

Model Folding Animation

 

 

Result

 

We compared model folding to other SOTA methods including:

 

Comparsion with structured magnitude pruning and IFM

 

Comparison with IFM and structured magnitude pruning. Model folding, when tested on ResNet18 (top row) and VGG11-BN (bottom row) trained on CIFAR10 (left column) and ImageNet (right column), outperforms IFM with higher sparsity and increasing dataset difficulty.

 

 

Comparsion with IFM and INN

Comparison of model folding with IFM, and INN using ResNet18 on CIFAR10. In the original experiment defined in the IFM and INN papers, where only the last two blocks of a ResNet18 are pruned, folding is significantly better than INN while it matches the performance of IFM for lower sparsities and becomes significantly better for higher sparsities.

 

Folding Llama-1-7b

 

Performance of structured pruning methods on LLaMA-7B without post-tuning, showing perplexity on WikiText2 and zero-shot performance across tasks. The "Average" is computed over four tasks. "Wanda_sp" represents an adapted Wanda method for structured pruning. Despite not using data or fine-tuning, model folding achieves comparable performance to data-driven methods.

 

 

📑 Citation


Model Folding Team w/ ❤️