Neural Network Pruning Explained

Reduce on-CPU prediction and model storage costs by zeroing-out weights while minimally increasing the loss.

Pruning performs lossy compression of the model during repeated training steps to reduce prediction costs, and model size. In decision trees, pruning at first improves test accuracy (generalization), until a maximum is reached, after which it decreases accuracy, which corresponds to bias-variance trade-off. Neural networks use various regularization allowing high over-parametrization without overfitting, but we can still cut computation cost with pruning.

Neural Network Pruning Explained
JS disabled! Watch Neural Network Pruning Explained on Youtube
Watch video "Neural Network Pruning Explained"

General Pruning Steps

In the first step we find or pre-select some initial network. Then we iteratively train, prune, and repeat. We end after the training step, as the weights are optimized for the pruned network, which has a different architecture in which pruned weight are fixed to zero.

pruning steps
pruning steps

Decision Trees

Decision trees are composed of a tree of if-else statements of the input variables only without use of any internal representations (embeddings), and consequently they are highly interpretable. The if-else condition are of type: input variable x is greater than value z. They overfit without regularization is applied by encoding entire dataset. Commonly regularization methods are pruning, minimum samples per leaf node, or maximum tree depth. Popular algorithms available:

decision tree regression with data points and two maximum depths
decision tree regression with data points and two maximum depths

Minimal Cost-Complexity Pruning in CART Decision Trees

Minimal Cost-Complexity Pruning is a greedy algorithm, which iteratively removes the best to prune subtrees until reaching a specified limit alpha. Increasing the pruning intensity by increasing alpha, improves generalization up to a point and then leads to decrease in generalization.

decision tree cost complexity pruning improves test accuracy until a maximum, scikit docs
decision tree cost complexity pruning improves test accuracy until a maximum, scikit docs

Minimal Cost-Complexity Pruning Algorithm

For each non-terminal node t and we can calculate cost complexity of its subtree: def cost_complexity(t): misclassification_rate(t) + alpha * n_terminal_nodes(t)

We start with alpha_j of 0 and increase it until we find a node, for which cost_complexity(t) would be lower if pruned, and so we prune the node’s subtree. We repeat this until we reach a specified limit alpha. Note that cost_complexity above has similarities to lasso regularization.

decision tree pruning example (Kijsirikul 2001)
decision tree pruning example (Kijsirikul 2001)

Decision Tree vs Neural Network

neural network as a multivariate decision tree for a parabola dataset
neural network as a multivariate decision tree for a parabola dataset

Pruning in Neural Networks

In general, we start with a random initialization or a pretrained model of a certain architecture and then prune the model, producing sparser architecture, and train again. Various methods exists:

  • Unstructured Pruning prunes individual weights. Structured Pruning prunes entire architectural blocs (layers, heads). Semi-structured Pruning prunes square blocks of weights.
  • Magnitude Pruning prunes the smallest weights, the most obvious idea.
  • Pruning based on estimating change of loss corrensponding to a task are detailed below.

Pruning Neural Networks by Estimating Loss Change

Similar to decision trees, we can zero-out weights that don’t change the loss based on Taylor series approximation. First-order Pruning prunes based on training loss gradients. For example Movement Pruning removes those weights with gradient pointing towards zero during fine-tuning on a downstream task.

If the network weights are optimized and the first order gradient is zero, we need to use the second order derivative. Second-order Pruning prunes based on Hessian prunes by minimization of the second-order loss change approximation e.g., M-FAC method.

Pruning Methods in Neural Networks

pruning synapses, neurons, layers
pruning synapses, neurons, layers

Optimal BERT Surgeon Pruning Method

oBERT paper from Neural Magic achieves the same latency on 4-Core CPU as on A100 GPU with batch size 1, metrics within 1% difference, and 10x smaller BERT model. oBERT extends Second-order to Structured Pruning, reusing Fisher Information Matrix approximation of the Hessian, speeds ups inversion with WSM inversion formula and block-wise approximation, and fine-tunes with Knowledge Distillation. Fine-tuning for another task on pruned models is possible, but requires distillation to preserve the accuracy. Initial training stays dense and there is additional step of pruning fine-tune, so pruning helps only with prediction and not training. oBERT optimized for CPU achieves 8.4x speed up, while when tuned for GPU only 2.3x speed up.

optimal bert surgeon evaluation  comparission of compression methods  obert paper
optimal bert surgeon evaluation comparission of compression methods obert paper

Neural Magic Engine

Neural Magic Engine is a closed source inference software for sparse models optimized for CPU. Popular models like BERT and Resnet50 in vision and text available for experimentation on their website.

But once FLOPs are reduced, the computation becomes more “memory bound”, that is, there is less compute per data item, so the cost of data movement in and out of memory becomes crucially important. ... rather than execute the network layer after layer, we are able to execute the neural network in depth-wise stripes we call Tensor Columns. ... Each Tensor Column stays completely in the CPU’s large fast caches along the full execution length of the neural network, ... In this way, we almost completely avoid moving data in and out of memory.
From Nvidia: GPU vs CPU in CUDA documentation
From Nvidia: GPU vs CPU in CUDA documentation

Neural Magic Engine CPU vs GPU Costs Estimate

In my experiment, oBERT outperformed Distilbert latency 2.3x on my CPU with 12 cores with batch size 1, while oBERT should have higher accuracy. Predicting on 4-Core CPU should save 2x to 4x costs with more device availability in contrast to T4 GPU on Ohio. See below table for AWS Ohio pricing.

Instance name On-Demand hourly rate vCPU Memory
g4dn.xlarge (T4 GPU) $0.526 4 16 GiB
t4g.xlarge (4 core CPU) $0.1344 4 16 GiB


Sparsification libraries SparseML are Apache 2 licenced. But the inference engine has restrictive licence not for production use (“solely for evaluation use in development and testing environments, and not for production use.”)

Neural Magic Management

Nir Shavit won Dijkstra price and Godel price for research in computer science, mainly in memory and parallel computation. Other ML Performance Research Papers on Neural Magic

Neural Magic Solution Summary

  • could save 2x money minus pricing of NeuralMagic by inferring on commodity CPU with similar time cost as on GPU.
    • the next best thing would be just use similar methods without their specific technology
    • How much they charge?
  • training still requires GPU (often not important)
  • vendor lock in
  • pricing?

Created on 23 Oct 2022. Updated on: 30 Mar 2024.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.