Pruning performs **lossy compression** of the model during repeated training steps to **reduce prediction costs, and model size**.
In decision trees, pruning **at first improves test accuracy (generalization)**, until a maximum is reached, after which it decreases accuracy, which corresponds to bias-variance trade-off.
Neural networks use various regularization **allowing high over-parametrization without overfitting**, but we can still cut computation cost with pruning.

## General Pruning Steps

In the first step we find or pre-select some initial network.
Then we iteratively **train, prune, and repeat**.
We **end after the training step**, as the weights are optimized for the pruned network, which has **a different architecture** in which pruned weight are **fixed to zero**.

## Decision Trees

Decision trees are composed of **a tree of if-else statements of the input variables** only without use of any internal representations (embeddings), and consequently they are highly **interpretable**.
The if-else condition are of type: **input variable x is greater than value z**.
They **overfit without regularization** is applied by encoding entire dataset.
Commonly **regularization methods are pruning, minimum samples per leaf node, or maximum tree depth**.
Popular algorithms available:

- C4.5 and its predesor ID3, extension C5.0. C4.5 is one of the top 10 algorithms listed in ML in 2008 survey paper.
- CART another popular similar approach. Used by Scikit-learn.

### Minimal Cost-Complexity Pruning in CART Decision Trees

Minimal Cost-Complexity Pruning is **a greedy algorithm, which iteratively removes the best to prune subtrees** until reaching a specified limit `alpha`

.
Increasing the pruning intensity by increasing alpha, improves generalization up to a point and then leads to decrease in generalization.

#### Minimal Cost-Complexity Pruning Algorithm

For each non-terminal node `t`

and we can calculate cost complexity of its subtree:
`def cost_complexity(t): misclassification_rate(t) + alpha * n_terminal_nodes(t)`

We start with `alpha_j`

of `0`

and increase it until we find a node, **for which cost_complexity(t) would be lower if pruned**, and so we prune the node’s subtree.
We repeat this until we reach a specified limit

`alpha`

.
Note that `cost_complexity`

above has similarities to lasso regularization.## Decision Tree vs Neural Network

- Vanilla
**decision trees split on the input variables only**and have no embeddings, while neural networks also train linear transformations of the values and thus create embeddings. - Neural networks with piecewise activation functions (e.g. ReLU) are equivalent to extension of a decision tree with linear manifold decision boundaries called Multivariate Decision Trees.

## Pruning in Neural Networks

In general, we start with a random initialization or a pretrained model of a certain architecture and then prune the model, producing sparser architecture, and train again. Various methods exists:

**Unstructured Pruning**prunes individual weights.**Structured Pruning**prunes entire architectural blocs (layers, heads).**Semi-structured Pruning**prunes square blocks of weights.**Magnitude Pruning prunes the smallest weights**, the most obvious idea.- Pruning based on estimating change of loss corrensponding to a task are detailed below.

### Pruning Neural Networks by Estimating Loss Change

Similar to decision trees, we can zero-out weights that don’t change the loss based on Taylor series approximation.
**First-order Pruning prunes based on training loss gradients**.
For example Movement Pruning removes those weights with gradient pointing towards zero during fine-tuning on a downstream task.

If the network weights are optimized and the first order gradient is zero, we need to use the second order derivative.
**Second-order Pruning prunes based on Hessian** prunes by minimization of the second-order loss change approximation e.g., M-FAC method.

### Pruning Methods in Neural Networks

- Learning both Weights and Connections for Efficient Neural Networks paper, starts with random initialization, trains, prunes the smallest weights (small magnitude), and then crucially trains again the new network, and repeat. Tensorflow pruning API uses this method and sparsifies the layer’s weights during training.
- The Lottery Ticket Hypothesis paper prunes 90% its smallest weights, and
**finds a subnetworks in the random initialization**, that trains to near optima. - A notable oBERT method of 2022 is described below.

### Optimal BERT Surgeon Pruning Method

oBERT paper from Neural Magic achieves **the same latency on 4-Core CPU as on A100 GPU** with batch size 1, metrics within 1% difference, and 10x smaller BERT model.
oBERT extends **Second-order to Structured Pruning**, reusing Fisher Information Matrix **approximation of the Hessian**, speeds ups inversion with WSM inversion formula and block-wise approximation, and fine-tunes with **Knowledge Distillation**.
**Fine-tuning for another task on pruned models is possible**, but requires distillation to preserve the accuracy.
**Initial training stays dense** and there is additional step of pruning fine-tune, so pruning helps only with **prediction and not training**.
oBERT optimized for CPU achieves 8.4x speed up, while when tuned for GPU only 2.3x speed up.

### Neural Magic Engine

Neural Magic Engine is a **closed source inference software** for sparse models optimized for CPU.
Popular models like BERT and Resnet50 in vision and text available for experimentation on their website.

But once FLOPs are reduced, the computation becomes more “memory bound”, that is, there is less compute per data item, so the cost of data movement in and out of memory becomes crucially important. ... rather than execute the network layer after layer, we are able toexecute the neural network in depth-wise stripes we call Tensor Columns. ... Each Tensor Column stays completely in the CPU’s large fast cachesalong the full execution length of the neural network, ... In this way, we almost completely avoid moving data in and out of memory.

#### Neural Magic Engine CPU vs GPU Costs Estimate

**In my experiment**, oBERT outperformed Distilbert latency 2.3x on my CPU with 12 cores with batch size 1, while oBERT should have higher accuracy.
**Predicting on 4-Core CPU should save 2x to 4x costs** with more device availability in contrast to T4 GPU on Ohio.
See below table for AWS Ohio pricing.

Instance name | On-Demand hourly rate | vCPU | Memory |

g4dn.xlarge (T4 GPU) | $0.526 | 4 | 16 GiB |

t4g.xlarge (4 core CPU) | $0.1344 | 4 | 16 GiB |

#### Licence

Sparsification libraries SparseML are Apache 2 licenced. But **the inference engine** has restrictive licence not for production use (“solely for evaluation use in development and testing environments, and not for production use.”)

#### Neural Magic Management

**Nir Shavit won Dijkstra price and Godel price** for research in computer science, mainly in memory and parallel computation.
Other ML Performance Research Papers on Neural Magic

#### Neural Magic Solution Summary

- could save 2x money minus pricing of NeuralMagic by inferring on commodity CPU with similar time cost as on GPU.
- the next best thing would be just use similar methods without their specific technology
- How much they charge?

- training still requires GPU (often not important)
- vendor lock in
- switching cost?
- no clear alternative to me now (maybe Vertigo Applied Intelligence - Vertigo.ai ?)

- pricing?