In the development of the python confidence interval library, for the analytic confidence intervals of some metrics Iâve been relying on results from the remarkable Confidence interval for micro-averaged F1 and macro-averaged F1 scores paper by Kanae Takahashi, Kouji Yamamoto, Aya Kuchiba and Tatsuki Koyama.
In the paper they derive confidence intervals for Micro F1 and Macro F1 (and by extension to Micro Precision/Recall, since they are equal to Micro F1).
However there are a few common variants that the paper didnât address:
The next sections derive the confidence intervals for these missing metrics in the spirit of the paper above, using the delta method. Some of the sections are a bit more verbose than in the paper that (elegently) combines some steps together, however I found it helpful to break it down a bit more.
These were implemented in the python confidence interval library.
This is the computation flow weâre going to go through:
\(C_{ij}\) is the confusion matrix: the number of predictions with the ground truth category i, that were actually predicted as j. Note that here wee keep the scikit-learn notation, intead of the notation in the paper thatâs transposed. We have an actual observed confusion matrix \(\hat{C_{ij}}\), but we assumed it was sampled from a distribution \(C_{ij}\).
The core assumption here is that \(C_{ij}\) has a multinomial distribution with parameters \(p_{ij}\).
\[E(C_{ij}) = n p_{ij}\] \[Cov(C_{ij}, C_{ij}) = Var(C_{ij}) = np_{ij}(1-p_{ij})\]And when ij != kl:
\[Cov(C_{ij}, C_{kl}) = -np_{ij}p_{kl}\]By combining the two cases above, the covariance matrix of the multinomial distribution can be written as: \(Cov(C_{ij}, C_{kl}) = n * [diag(p)_{ij} - (pp^T)_{ij}]\)
We donât know what \(p_{ij}\) actually is. But our best guess for it, the maximum likelihood estimator, is:
\(\hat{p_{ij}} = \frac{C_{ij}}{n}\).
n = \(\sum_{ij}C_{ij}\), is the total number of predictions.
\(C_{ij}\) can be seen as the sum of n individual trial binary variables \(X_{ij}\), where \(X_{ij}=1\) with probability \(p_{ij}\).
\[\hat{p_{ij}} = \frac{\sum_{k=1}^{N}{X_{ijk}}}{n} = \frac{\hat{C_{ij}}}{n}\]From the central limit theorem, since \(\hat{p_{ij}}\) is the average of many variables, we know that it has a normal distribution. We also know itâs mean and covariance, since \(\hat{p_{ij}} = \frac{\hat{C_{ij}}}{n}\) and we know from above what the distribution of \(C_{ij}\) is.
\[E[\hat{p}] = \frac{E[C]}{n}, Cov(\hat{p}) = \frac{Cov(C)}{n^2}\] \[\hat{p_{ij}} \sim Normal(E[p], \frac {diag(p) - (pp^T)]}{n}\]Binary F1: \(metric(p) = F1_{binary} = \frac {2p_{11} }{2p_{11} + p_{01} + p_{10}} = \frac {2p_{11} }{d}\)
Macro Recall: \(metric(p) = R = \frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{\sum_j{p_{ij}}}\)
Macro Prcecision: \(metric(p) = P = \frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{\sum_j{p_{ji}}}\)
\(r\) = number of categories
If we plug in our estimate for p, we get the (point estimation of the) metric.
The various metrics above are functions \(metric(\hat{p})\). We know from above that \(\hat{p}\) has approximately a normal distribution. The multi-variate delta method gives us a recepie to get the distribution of the \(metric(\hat{p})\):
\[metric(\hat{p}) \sim Normal(metric(p), \frac{\partial metric(p)^T}{\partial p} Cov(p) \frac{\partial metric(p)}{\partial p})\]We also know from above that \(Cov(p) = \frac {diag(p) - pp^T}{n}\)
Now the only thing missing is to compute those derivatives!
\(\frac{\partial f1}{\partial p_{10}} = \frac{\partial f1}{\partial p_{01}} = -2\frac {p_{11}} {d^2} = -\frac {f1} {d}\) \(\frac{\partial f1}{\partial p_{11}} = \frac 2 {d} - \frac {4 p_{11}} {d^2} = \frac {2(1-f1)} {d}\)
The code can be found here.
\(metric(p) = R = \frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{\sum_j{p_{ij}}} =\frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{d_i} = \frac{1}{r}\sum_{i=1}^{r}R_i\)
\(\frac{\partial R}{\partial p_{ii}} =\frac{1}{r} [\frac{1}{di} - \frac{p_{ii}}{di^2}] = \frac{1}{r} \frac{1-R_i}{d_i}\) \(\frac{\partial R}{\partial p_{ij}} = -\frac{1}{r} \frac{p_{ii}}{d_i^2} = -\frac{1}{r} \frac{R_i}{d_i}\)
In terms of computation, the non diagonal row elements will all be the same expression of the \(R_i\) of that row.
The code can be found here
\(metric(p) = P = \frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{\sum_j{p_{ji}}} =\frac{1}{r}\sum_{i=1}^{r} \frac{p_{ii}}{d_i} = \frac{1}{r}\sum_{i=1}^{r}P_i\)
\[\frac{\partial P}{\partial p_{ii}} =\frac{1}{r} [\frac{1}{di} - \frac{p_{ii}}{di^2}] = \frac{1}{r} \frac{1-P_i}{d_i}\] \[\frac{\partial p}{\partial p_{ji}} = -\frac{1}{r} \frac{p_{ii}}{d_i^2} = -\frac{1}{r} \frac{P_i}{d_i}\]Note how for precision we derive for \(p_{ji}\) instead of \(p_{ij}\). In terms of computation, the non diagonal columns elements will all be the same expression of the \(P_i\) of that column.
The code can be found here
]]>In the last few months before writing this post, there seems to be a sort of a breakthrough in bringing Transformers into the world of Computer Vision.
To list a few notable works about this:
If I can make a prediction for 2021 - in the next year we are going to see A LOT of papers about using Transformers in vision tasks (feel free to comment here in one year if Iâm wrong).
But what is going on inside Vision Transformers? How do they even work? Can we poke at them and dissect them into pieces to understand them better?
âExplainabilityâ might be an ambitious and over-loaded term that means different things to different people, but when I say Explainability I mean the next things:
(useful for the developer) Whatâs going on inside when we run the Transformer on this image? Being able to look at intermediate activation layers. In computer vision - these are usually images! These are kind of interpretable since you can display the different channel activations as 2D images.
(useful for the developer) What did it learn? Being able to investigate what kind of patterns (if any) did the model learn. Usually this is in the form of the question âWhat input image maximizes the response from this activation?â , and you can use variants of âActivation Maximizationâ for that.
(useful for both the developer and the user) What did it see in this image? Being able to Answer âWhat part of the image is responsible for the network predictionâ, sometimes called âPixel Attributionâ.
So we are going to need this for Vision Transformers as well!
In this post we will go over my attempt to do this for Vision Transformers.
Everything here is going to be done with the recently released âDeit Tinyâ model from Facebook, i.e:
model = torch.hub.load('facebookresearch/deit:main', 'deit_tiny_patch16_224', pretrained=True)
And we are going to assume 224xx224 input images to make it easier follow the shapes, although it doesnât have to be.
Python code is released here: https://github.com/jacobgil/vit-explain
The rest of this post assumes you understand how Vision Transformers work.
They are basically vanilla transformers, but the images are split into 14x14 different tokens, where every token represents a 16x16 patch from the image.
Before continuing, you might want to read the two papers given above, and these blog posts about them:
https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html
https://ai.facebook.com/blog/data-efficient-image-transformers-a-promising-new-technique-for-image-classification
A Vision Transformer is composed of a few Encoding blocks, where every block has:
Itâs as simple as this, taken from Ross Wightmanâs Amazing Pytorch Image Models package implementation of vision transformers:
def forward(self, x):
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
return x
Inside every attention head (the âDeit Tinyâ model has 3 attention heads in every layer), the players are Q,k and V.
The shape of each of these are - 3x197x64
There are 3 attention heads.
Each attention heads sees 197 tokens.
Every token has a feature representation of length 64.
Among these 197 tokens, 196 represent the original image 14x14=196 image patches, and the first token represents a class token that flows through the Transformer, and will be used at the end to make the prediction.
If for every attention head separately, we look inside the second dimension with 197 tokens, we can peek at the last 14x14=196 tokens.
This gives us an image of size 14x14x64 which we can then visualize.
The rows of Q and K, are 64 length feature that represents a location in the image.
We can then think of Q, K and V in the next way: For every image patch with \(q_i\), Information is going to flow from locations in the image that have keys \(k_j\) that are similar to that \(q_i\).
image from http://jalammar.github.io/illustrated-transformer
Lets look at an example.
The input to the network is this image of a plane:
We can now look at the Q and K images in different layers, and visualize them for one of the 64 channels c.
This activation vector is going to be a 14x14 image, with positive and negative values, that seem to be in the range [-5, 5].
Here is a tricky part:
For every location j in K (remember that it comes from one of the 14x14 patches in the original image), we can ask âhow is that location going to spread information to other parts of the image?â
Since we take the dot product between the token vectors (every \(q_{i}\) and \(k_{j}\)), there are two scenarios:
Two tokens, in the same channel c, \(q_{ic}\) and \(k_{jc}\), have the same sign (both are positive or negative)- their multiplication is positive.
This means that the image location j and channel c - \(k_{jc}\) - is going to contribute to flowing information into that image location \(q_{i}\).
Two tokens, in the same channel c, \(q_{ic}\) and \(k_{jc}\), have different signs (one is positive and one is negative)- their multiplication is negative.
This means that the image j location and channel c - \(k_{jc}\) - is NOT going to contribute to flowing information into that image location \(q_{i}\).
To contrast the negative and positive pixels, weâre going to pass every image through a torch.nn.Sigmoid()
layer (the bright values are positive, the dark values are negative).
From looking at the Q,K visualizations for different channels I think there are kind of two patterns that emerge.
Layer 8, channel 26, first attention head:
Query image | Key image | Original |
---|---|---|
For most locations in the Query image, since they are positive, information is going to flow to them only from the positive locations in the Key image - that come from the Airplane.
Q, K here are telling us -
We found an airplane, and we want all the locations in the image to know about this!
Layer 11, channel 59, first attention head:
Query image | Key image | Original |
---|---|---|
The information flows in two directions here:
The top part of the plane (negative values in the Key) is going to spread into all the image (negative values in the Query).
Hey we found this plane, lets tell the rest of the image about it.
Information from the âNon Planeâ parts of the image (positive values in the Key) is going to flow into the bottom part of the Plane (positive values in the Query).
Lets tell the plane more about what's around it.
Another thing we can do is visualize how the attention flows for the class token, in different layers in the network.
Since we have multiple attention heads, to keep it simple we will just look at the first one.
The attention matrix (\(Q*K^T\)) has a shape of 197x197.
If we look at the first row (shape 197), and discard the first value (left with shape 196=14x14) thatâs how the information flows from the different locations in the image to the class token.
Here is how the class attention activations looks like through the layers:
It looks like from layer 7 the network was able to segment the plane pretty well.
However - if we look at consecutive layers, some plane parts are lost, and then re-appear again:
And we can thank the residual connections for this!
Although the attention suddenly discarded parts of the plane (the middle image above), we donât loose that information since we have a residual connection from the previous layer.
The images above show us how individual activations look like, but they donât show us how the attention flows from the start to the end throughout the Transformer.
To quantify this we can use a technique called âAttention Rolloutâ from Quantifying Attention Flow in Transformers, by Samira Abnar and Willem Zuidema
This is also what the authors at An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale suggested.
At every Transformer block we get an attention Matrix \(A_{ij}\) that defines how much attention is going to flow from token j in the previous layer to token i in the next layer.
We can multiply the Matrices between every two layers, to get the total attention flow between them.
However - we also have the residual connections (like we saw in the previous section).
We can model them by adding the identity matrix I to the layer Attention matrices: \(A_{ij} + I\).
The Attention rollout paper suggests taking the average of the heads. As we will see, it can make sense using other choices: like the minimum, the maximum, or using different weights.
Finally we get a way to recursively compute the Attention Rollout matrix at layer L:
\[AttentionRollout_{L} = (A_L + I ) \dot AttentionRollout_{L-1}\]We also have to normalize the rows, to keep the total attention flow 1.
I implemented this and ran this on the recent âData Efficientâ models from Facebook, but the results werenât quite as nice as in the An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, paper.
Results were very noisy, and the attention doesnât seem to focus only on the interesting part of the image.
Trying to get this to work, I noticed two things:
For example, here is how the result looks if we take minimum value among the attention heads, instead of the mean value as suggested in the Attention Rollout paper:
Image | Mean Fusion | Min Fusion |
---|---|---|
Different attention heads look at different things, so I guess taking the minimum removes noise by finding their common denominator.
However, combined with discarding low attention pixels (next section), fusing the attention heads with the maximum operator seems to work best.
Discarding the lowest attention values has a huge effect in how the results look like.
Here is it how it looks as we increase the portion of attention pixels we discard:
As you can see, the more pixels we remove, we are able to better isolate the salient object in the image.
Finally, here is how it looks like for a few different images:
Image | Vanilla Attention Rollout | With discarding lowest pixels + max fusion |
---|---|---|
Edit - it turns out there is another technique out there about this!
Apart from what I implemented below, I refer you to Hila Cheferâs Transformer Interpretability Beyond Attention Visualization and their github repo.
Another question we can ask is - âWhat in the image contributes to a higher output score in category 42?â
Or in other words, Class Specific Explainability.
When fusing the attention heads in every layer, we could just weight all the attentions (in the current implementation itâs the attentions after the softmax, but maybe it makes sense to change that) by the target class gradient, and then take the average among the attention heads.
\[A_{ij} * grad_{ij}\]
Another thing we can do, is apply Activation Maximization, to find the kind of image inputs that maximize different parts in the network.
In Vision Transformers the images are split into 14x14 independent patches (that represent 16x16 pixels).
We also see this in the Activation Maximization result below- instead of getting a continuous image, we get 14x14 patches.
Since the positional embeddings are added to the inputs, nearby patches should get more similar outputs.
I think you can see this in the image below - many neighboring patches look similar, but they also have a discontinuity between them.
I guess future work can be about using some kind of a spatial continuity constraint between the patches here (and maybe also incorporate that into how the transformers process the images).
In this post we applied Explainability techniques for Vision Transformers.
This was my attempt to try to better understand how are they working and whatâs going on inside them.
You can access the code here: https://github.com/jacobgil/vit-explain
I hope you enjoyed.
]]>To train supervised machine learning algorithms, we need:
Most of the focus of the machine learning community is about (2), creating better algorithms for learning from data. But getting useful annotated datasets is difficult. Really difficult. It can be expensive, time consuming, and you still end up with problems like annotations missing from some categories.
I think that being able to build practical machine learning systems is a lot about tools to annotate data, and that a lot of the future innovation in building systems that solve real problems will be about being able to annotate high quality datasets quickly.
Active Learning is a great building block for this, and is under utilized in my opinion.
In this post I will give a short introduction to Classical Active Learning, and then go over several papers that focus on Active Learning for Deep Learning.
In many scenarios we will actually have access to a lot of data, but it will be infeasible to annotate everything. Just a few examples:
There are two closely related fields that help us deal with these scenarios:
Semi-supervised Learning. Exploit the unannotated data to get better feature representations and improve the algorithms learned on the annotated data.
Active Learning. Choose the data that is going to be annotated.
Image from https://arxiv.org/abs/1703.02910
The image above is a typical image in a paper about active learning. The x axis is the size of the dataset. The y axis is accuracy on the test set.
A good active learning algorithm is able to cleverly select that data we are going to annotate, so that when we train on it, our model will be better than if we trained on other data.
So if for example we have the constraint of being able to annotate only 400 images, the goal of active learning will be to select the best 400 images to annotate.
The most common scenario considered in active learning literature, which is also most similar to what happens in real life problems, is the unlabeled pool scenario. This is a good place to say that Iâm going to interchange the words label and annotation, and Iâm going to assume weâre using images, because the principles will be the same for everything else.
The unlabeled pool scenario:
Then we train a model.
In active learning papers at this stage usually the model is trained from scratch on the new dataset, but in real-life youâre probably going to continue from the previous model you had.
At this point I want to write something obvious, thatâs probably worth mentioning anyway. A really good strategy would be to select the images the model is just wrong about. But we canât do that since we donât know the real labels.
There are two main approaches that most of the active learning works follow. Sometimes they are a combination of the two.
A function that gets an image and returns a ranking score, is often called an âacquisition functionâ in the literature.
Letâs look a few examples.
Lets look at a few classic uncertainty acquisition functions. We will cover some more when we get to the deep learning methods.
Entropy \(H(p) = -\sum p_i Log_2(p_i)\)
\(H( [0.5, 0.5] )\) = 1.0
\(H( [1.0, 0.0] )\) = 0.0
This is probably the most important example to understand. The idea is that when the model output is the same for all the categories, it is completely confused between the categories.
The rank will be highest in this case, because the entropy function is maximized the all itâs inputs are equal, so we will select the image. \(H\) will grow as the probabilities p tend to be more uniform, and will shrink when fewer of the categories tend to get higher values.
Variation Ratio: \(1-max(p)\)
Another concept from classical active learning papers, is QBC. The idea here is that instead of measuring the uncertainty of a single model, we can train an ensemble of many different models (maybe with different seeds, or hyper-parameters, or structures). Then for a given image, we can check if the output changes a lot between models. If it does, it means the models arenât very consistent about this image, and some of them arenât doing a good job on this image.
In the typical QBC implementation, every model decides the output category and votes for it, and a vector with the vote count is created.
(Minimize) The difference between the two categories with the most votes.
(Maximize) Vote Entropy: \(-\frac{V(c)}{C}Log(\frac{V(c)}{C})\)
Combining Active Learning and deep learning is hard.
Deep neural networks arenât really good at telling when they are not sure. The output from the final softmax layer tends to be over confident.
Deep neural networks are computationally heavy, so you usually want to select a batch with many images to annotate at once.
But the acquisition functions we saw so far tell us how to select the single best image, not a batch of images.
How should we select a batch of images at once? Thatâs an open research question, and we will cover a few batch aware methods below.
Now lets cover a few papers about Active Learning for Deep Learning.
The paper: https://arxiv.org/abs/1703.02910
In my opinion this is the currently the most important paper about active learning for deep learning, so we are going to cover this in detail.
The idea is that Bayesian neural networks give better uncertainty measures.
In a Bayesian neural network, every parameter in the model is sampled from a distribution. Then, when doing inference, we need to integrate over all the possible parameters. So weâre using an ensemble of infinite different networks to compute the output.
An uncertainty measure from a single network might be flawed (maybe itâs over confident in the output), but the idea is that going over many networks is going to improve that.
Intuitively, if most models agree about one of the categories, the ensembled network will have high confidence for that category. If they disagree, we will get large outputs for several of the categories.
Itâs intractable to integrate over all possible parameter values in the distribution, so instead Monte Carlo integration can be used.
With Monte Carlo dropout, the idea is that we will simulate a case where every neuron output has a Bernoulli prior, multiplied by some value M (the actual value of that neuron output).
So a parameter \(i\) is going to be 0 with some probability p, and \(M_i\) otherwise.
Now we can sample from the neuron priors by running the network and applying dropout at test time. If we apply dropout many times and sum the results, weâre doing Monte Carlo integration. Lets break this into steps:
We have a Bayesian neural network and an input image x. To get the output for the category c, weâre going over all the possible weight configurations, weighting every configuration by its probability.
We donât know the actual real parameter distribution, but we can approximate them assuming they belong to the Bernoulli distribution. We can simulate the Bernoulli distribution, simply by using Dropout.
To further approximate, we apply Monte Carlo integration, by running the network with dropout many times, and summing the results. As T gets larger, the approximation will get better.
This gives us a simple recipe: to approximate the output probability for every category, run the model many times with dropout and taking the average of all the runs.
An example of Uncertainty Sampling
The Entropy uncertainty acquisition function is:
\[\sum_c p(y=c|x)Log(p(y=c|x))\]If we plug the approximation from above, we get:
\[H \approx-\sum_c(\frac{1}{T}\sum_tp_c^t)Log(\frac{1}{T}\sum_tp_c^t)\]We need to run the network multiple time with dropout, average the outputs, and take the entropy.
Lets see how this measure behaves in two important cases that achieve a high H:
Every time we run the model with dropout, itâs confident about a different category. This means that some of the models were very wrong about the input X. These models are each associated with a set of parameters that survived the dropout.
If we select this image, we can correct those parameters that caused the models to be wrong.
Many of the runs end up being not so confident, and are confused between different categories.
But the many models we sampled actually agree about the input X. They all think itâs a confusing example. For example, who knows, maybe there is a large occlusion in the image and we canât clearly see the object weâre looking for. How can we ever be correct on this image? If we label this image, how should the model change?
This leads us to a modification that handles the second case:
An example of Uncertainty Sampling
In Uncertainty sampling, the ultimate goal would be to find images the model is wrong about.
We canât do that since we donât know the labels. So the idea in BALD is to instead find examples for which many of the different sampled networks are wrong about.
If we sample many networks using MC Dropout, and they disagree about the output, this means that some of them are wrong.
Lets get to the math. The objective is to find the image that maximizes the mutual information between the model output, and the model parameters.
\[I(y; \omega | x, D_{train}) = H(y | x, D_{train}) - E_{p(\omega|D_{train})} [H(y|x, \omega, D_{train})]\]If we plug in the Monte Carlo approximation again, we get:
\[I(y; \omega | x, D_{train}) \approx-\sum_c(\frac{1}{T}\sum_tp_c^t)Log(\frac{1}{T}\sum_tp_c^t) + \frac 1 T \sum_{t, c} p_c^t Log p_c^t\]To get the first term, we make many runs, average the output, and measure the entropy.
To get the second term, we make many runs, measure the entropy of every run, and take the average.
An example of Uncertainty Sampling
The paper: https://arxiv.org/abs/1905.03677
The main idea here is to try to predict what would be the loss of the learning algorithm for a given input.
A higher loss means its a more difficult image and should be annotated.
They extract features from intermediate layers and combine them to predict the network loss.
To learn the loss, you could just add an additional term to the loss: MSE(Real loss, predicted loss).
But during training, the scale of the loss is changing all the time (since the loss is usually going down), and its difficult to learn.
Instead, in this paper they suggest comparing the predicted losses between images.
Then during the active learning rounds, you select the images with the highest loss.
Image from https://arxiv.org/abs/1811.03897
A possible problem with all these methods, is that during training your dataset wonât necessarily be balanced. The acquisition strategy might favor bringing more images from some of the categories.
https://arxiv.org/abs/1811.03897 showed this happens for common uncertainty based acquisition functions: random acquisition ends up bringing a much more balanced dataset than active learning.
I think this is a really nice visualization that shows that scoring images by how difficult they are, wonât be good enough, and that we have to somehow also optimize for having diverse a diverse dataset.
Of course itâs not enough to have enough images from all the categories. Ideally we would capture the sub-categories inside every category as well.
This leads us to the next part about âBatch awareâ methods, where we will see some works that try to combine Uncertainty sampling and Diversity sampling.
Most of the active learning works selects a single image, at every active learning round.
However for expensive/data hungry methods like Deep Learning we need to select a batch with many images at every round. After all, weâre going to train on the entire new dataset and not just only on the new images, so we canât afford to train again every time a single new image is added to the dataset.
Problem: Datasets can have many near duplicates. If we just select the top ranking images, we might select many near duplicate images that all rank high.
Image from https://arxiv.org/abs/1906.08158
Next we are going to go over a few âbatch awareâ methods, that try to select a good batch of images.
An example of Diversity sampling
The paper: https://arxiv.org/abs/1708.00489
This paper is an example from the diversity sampling family of active learning algorithms, and is also a âbatch awareâ method, that tries to choose a batch of B images at once.
The main idea here is that we want the training set to capture the diversity in the dataset. To find the data that isnât represented well yet by the training set, we need to find the âCore Setâ at every step.
The core set: B images such that when added to the training set, the distance between an image in the unlabeled pool and itâs closest image in the training set, will be minimized.
Finding the ideal B images is NP hard. Instead a simple greedy approximation is used:
How do we define distances between images? Itâs an open question with place for creativity.
For images they use the Euclidian distance between feature vectors extracted from the end of the network.
An example of combining Uncertainty and Diversity Sampling
In BALD, the acquisition function for a single data point was the mutual information between the model parameters and the model output:
\[I(y; \omega | x, D_{train}) = H(y | x, D_{train}) - E_{p(\omega|D_{train})} [H(y|x, \omega, D_{train})]\]In this work the goal is to select a batch of images at once, so they the change the acquisition function to:
\[I(y_1,...,y_B; \omega | x_1,..,x_B, D_{train}) = H(y_1,...,y_B | x_1,...,x_B, D_{train}) - E_{p(\omega|D_{train})} [H(y_1,...,y_B|x_1,...,x_B, \omega, D_{train})]\]Lets say we have a batch size of 3, and have two images that have a high BALD acquisition score: a and b, and another image that is a duplicate of a: aâ.
If we select the batch images one by one, we will select a, aâ and b, since they all have high scores.
aâ is redundant since it already exists in the batch.
BatchBald wonât select aâ, since it doesnât contribute anything to the total mutual information:
\[I(a,b,a') = I(a, b)\]This encourages adding informative images that are different from the rest of the images in the batch.
Approximating \(H(y_1,...,y_B | x_1,...,x_B, D_{train})\) involves quite a bit of math and details, so refer to the paper for the details.
An example of combining Uncertainty and Diversity Sampling
Paper: https://arxiv.org/abs/1901.05954
The idea here is to combine uncertainty sampling and diversity sampling.
For diversity sampling, they cluster the data into K clusters using K-means (in the case of images, as features they use features extracted from intermediate layers from the neural network classifier).
Then they select images that are closest to the centers of each of the clusters, to make sure the sampled images are diverse.
To incorporate uncertainty sampling and select difficult images, they use weighted K-means, where every image is assigned a weight from an uncertainty acquisition function.
Since K-means can be slow, they pre-filter the unlabeled images to keep the top \(\beta*K\) images with the highest uncertainty scores, and do the clustering only on them (\(\beta\) is typically 10).
An example of combining Uncertainty and Diversity Sampling
This is really similar in nature to the previous paper, and in my opinion also to Core Sets, but has a unique twist.
For uncertainty sampling, instead of using the model output like is usually done, they compute the gradient of the predicted category, with respect to the parameters of the last layer. There are many parameters, so the gradient is a vector.
They call these.. drums roll.. gradient embeddings.
The rational here is that when the gradient norm is large, the parameters need to change a lot to be more confident about the category.
They give the rational that itâs more natural to use gradients as uncertainty measures in neural networks, since gradients are used to train the network.
But this also gives us an embedding we can cluster to chose diverse points. So it kills two birds with one stone: a way to choose uncertain images, and a way to chose diverse images.
They then proceed to cluster the embeddings to chose diverse points. Instead of using K-means to choose the batch points, they use a K-means initialization algorithm called K-means++, which is much faster to compute.
Whatâs K-means++? From Wikipedia:
1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that
has already been chosen.
3. Choose one new data point at random as a new center, using a weighted
probability distribution where a point x is chosen with probability proportional to D(x)^2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
If you recall the Core-Set work, itâs really similar to K-means++ ! In Core-Sets, the embeddings were features from one of the last layers. In this work itâs almost the same thing, but with gradient embeddings.
Thanks for reading. We went over active learning methods for Deep Learning.
These methods are really creative, and it was a joy to write.
We were focusing on images, but these methods can be used for other domains like text.
I hope this will do some help to demystify active learning for Deep Learning.
]]>Here is the github repository with all the code for this post.
Scroll to the end if you just want to see images.
In this post I will describe two experiments I did with Dlibâs deep learning face detector:
Dlibâs deep learning face detector is one of the most popular open source face detectors. It is used in many open source projects like the open face project, but also in countless industry applications as well. It is trained with the clever max margin object detection agorithm that penalizes objects that are not exactly in the center of the scanning window, thus learning non maximum supression, giving very accurate localization.
This is a good place to say that DLib is a remarkable piece of software, and itâs creator Davis King is one of the heros of the internet.
At this point Dlib only has support for converting the model weights to caffe, so I decided to jump in and add support for converting the face detector model to PyTorch. From PyTorch it can be easily be ported to many other platforms with the ONNX format, so getting dlibâs face detector to work in mobile deep learning frameworks should be straight forward from here.
The first part here was saving the face detector model in an XML format, using net_to_xml
, like in this dlib example.
The XML is fairly easy to parse in python, with each layerâs parameters (like the layer type, padding, kernel size etc) stored in XML attributes, followed by a list of floats for each layerâs biases and weights.
Batch normalization is implemented a bit differently in DLib, without a running mean and running variance as part of the layer parameters, so a running mean and variance of 0 and 1 is used in PyTorch.
get_model gets the XML path, and returns a PyTorch Sequential model.
The purpose of this section was to make sure the ported model is usable. You can skip to the next section for the face hallucinations.
On a i7 processor, the inference took between 30ms to 150ms on a 640x480 feed from a webcam, depending on the scales used, which isnât bad at all. Running it on higher end mobile devices (after porting to ONNX) should give a much faster inference time.
Dlibâs face detector is a fully convolutional network, that slides over an input image and outputs a score for each window in the image. The network is aimed at detecting faces that have a certain size, determined by the receptive field of the network.
To get scale invariance, Dlib resizes the input images to different sizes, and packs them in a single image with paddings between the scaled images. This trades off more inference time with scale invariance. Inference on the packed larger image gives larger GPU utilization. Since I was doing this on a CPU, I didnât really have a motivation for doing the image packing, so instead I just did multiple forward passes on resized images.
After detection, non maxima suppression is done between the different scales, and the box size is receptive field is multiplied by the scale that best detected the object.
Here is the code for face detection on a webcam.
Now that we have the PyTorch model, we can use activation maximization to find images that cause a large response in specific filters. The idea is to perform gradient ascent iterations on the input image pixels, until a large activation in filter output is caused.
I tried a lot of things until I managed to get this to work. Here is a short summary of some of the things I used:
Peeking at the second last convolutional layer. The output from the last convolutional layer used a combination of the outputs in the one before, and tends to return multiple faces (often in different poses) in the same image. This kind of makes sense, since there are many different types of faces that can all cause a face to be detected.
On the other hand, maximizing the second last convolutional layer responses returns single faces, probably because they learned to be much more selective in the kind of faces they respond to. Different runs on the same filter often returns different poses and expressions, of the same face.
These are selected filters. Some filters did not correspond to faces, or had multiple faces. For each filter, there were 900 iterations of gradient ascent, repeated 10 times to create 10 different images.
]]>
Notebook contributed to TensorLy.
In this post I will cover a few low rank tensor decomposition methods for taking layers in existing deep learning models and making them more compact. I will also share PyTorch code that uses Tensorly for performing CP decomposition and Tucker decomposition of convolutional layers.
Although hopefully most of the post is self contained, a good review of tensor decompositions can be found here. The author of Tensorly also created some really nice notebooks about Tensors basics. That helped me getting started, and I recommend going through that.
Together with pruning, tensor decompositions are practical tools for speeding up existing deep neural networks, and I hope this post will make them a bit more accessible.
These methods take a layer and decompose it into several smaller layers. Although there will be more layers after the decomposition, the total number of floating point operations and weights will be smaller. Some reported results are on the order of x8 for entire networks (not aimed at large tasks like imagenet, though), or x4 for specific layers inside imagenet. My experience was that with these decompositions I was able to get a speedup of between x2 to x4, depending on the accuracy drop I was willing to take.
In this blog post I covered a technique called pruning for reducing the number of parameters in a model. Pruning requires making a forward pass (and sometimes a backward pass) on a dataset, and then ranks the neurons according to some criterion on the activations in the network.
Quite different from that, tensor decomposition methods use only the weights of a layer, with the assumption that the layer is over parameterized and its weights can be represented by a matrix or tensor with a lower rank. This means they work best in cases of over parameterized networks. Networks like VGG are over parameterized by design. Another example of an over parameterized model is fine tuning a network for an easier task with fewer categories.
Similarly to pruning, after the decomposition usually the model needs to be fine tuned to restore accuracy.
One last thing worth noting before we dive into details, is that while these methods are practical and give nice results, they have a few drawbacks:
There are works that try to address these issues, and its still an active research area.
The first reference I could find of using this for accelerating deep neural networks, is in the Fast-RCNN paper. Ross Girshick used it to speed up the fully connected layers used for detection. Code for this can be found in the pyfaster-rcnn implementation.
The singular value decomposition lets us decompose any matrix A with n rows and m columns:
\[A_{nxm} = U_{nxn} S_{nxm} V^T_{mxm}\]S is a diagonal matrix with non negative values along its diagonal (the singular values), and is usually constructed such that the singular values are sorted in descending order. U and V are orthogonal matrices: \(U^TU=V^TV=I\)
If we take the largest t singular values and zero out the rest, we get an approximation of A: \(\hat{A} = U_{nxt}S_{txt}V^T_{mxt}\)
\(\hat{A}\) has the nice property of being the rank t matrix that has the Frobenius-norm closest to A, so \(\hat{A}\) is a good approximation of A if t is large enough.
A fully connected layer essentially does matrix multiplication of its input by a matrix A, and then adds a bias b:
\(Ax+b\).
We can take the SVD of A, and keep only the first t singular values.
\((U_{nxt}S_{txt}V^T_{mxt})x + b\) = \(U_{nxt} ( S_{txt}V^T_{mxt} x ) + b\)
Instead of a single fully connected layer, this guides us how to implement it as two smaller ones:
The total number of weights dropped from nxm to t(n+m).
A 2D convolutional layer is a multi dimensional matrix (from now on - tensor) with 4 dimensions:
cols x rows x input_channels x output_channels
.
Following the SVD example, we would want to somehow decompose the tensor into several smaller tensors. The convolutional layer would then be approximated by several smaller convolutional layers.
For this we will use the two popular (well, at least in the world of Tensor algorithms) tensor decompositions: the CP decomposition and the Tucker decomposition (also called higher-order SVD and many other names).
1412.6553 Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition shows how CP-Decomposition can be used to speed up convolutional layers. As we will see, this factors the convolutional layer into something that resembles mobile nets.
They were able to use this to accelerate a network by more than x8 without significant decrease in accuracy. In my own experiments I was able to use this get a x2 speedup on a network based on VGG16 without accuracy drop.
My experience with this method is that the finetuning learning rate needs to be chosen very carefuly to get it to work, and the learning rate should usually be very small (around \(10^{-6}\)).
A rank R matrix can be viewed as a sum of R rank 1 matrices, were each rank 1 matrix is a column vector multiplying a row vector: \(\sum_1^Ra_i*b_i^T\)
The SVD gives us a way for writing this sum for matrices using the columns of U and V from the SVD: \(\sum_1^R \sigma_i u_i*v_i^T\).
If we choose an R that is less than the full rank of the matrix, than this sum is just an approximation, like in the case of truncated SVD.
The CP decomposition lets us generalize this for tensors.
Using CP-Decompoisition, our convolutional kernel, a 4 dimensional tensor \(K(i, j, s, t)\) can be approximated similarly for a chosen R:
\(\sum_{r=1}^R K^x_r(i)K^y_r(j)K^s_r(s)K^t_r(t)\).
We will want R to be small for the decomposition to be effecient, but large enough to keep a high approximation accuracy.
To forward the layer, we do convolution with an input \(X(i, j, s)\):
\(V(x, y, t) = \sum_i \sum_j \sum_sK(i, j, s, t)X(x-i, y-j, s)\) \(= \sum_r\sum_i \sum_j \sum_sK^x_r(i)K^y_r(i)K^s_r(s)K^t_r(t)X(x-i, y-j, s)\) \(= \sum_rK^t_r(t) \sum_i \sum_j K^x_r(i)K^y_r(i)\sum_sK^s_r(s)X(x-i, y-j, s)\)
This gives us a recipe to do the convlution:
First do a point wise (1x1xS) convolution with \(K_r(s)\). This reduces the number of input channels from S to R. The convolutions will next be done on a smaller number of channels, making them faster.
Perform seperable convolutions in the spatial dimensions with \(K^x_r,K^y_r\). Like in mobilenets the convolutions are depthwise seperable, done in each channel separately. Unlike mobilenets the convolutions are also separable in the spatial dimensions.
Do another pointwise convolution to change the number of channels from R to T If the original convolutional layer had a bias, add it at this point.
Notice the combination of pointwise and depthwise convolutions like in mobilenets. While with mobilenets you have to train a network from scratch to get this structure, here we can decompose an existing layer into this form.
As with mobile nets, to get the most speedup you will need a platform that has an efficient implementation of depthwise separable convolutions.
Image taken from the paper. The bottom row is an illustration of the convolution steps after CP-decomposition.
def cp_decomposition_conv_layer(layer, rank):
""" Gets a conv layer and a target rank,
returns a nn.Sequential object with the decomposition """
# Perform CP decomposition on the layer weight tensorly.
last, first, vertical, horizontal = \
parafac(layer.weight.data, rank=rank, init='svd')
pointwise_s_to_r_layer = torch.nn.Conv2d(in_channels=first.shape[0], \
out_channels=first.shape[1], kernel_size=1, stride=1, padding=0,
dilation=layer.dilation, bias=False)
depthwise_vertical_layer = torch.nn.Conv2d(in_channels=vertical.shape[1],
out_channels=vertical.shape[1], kernel_size=(vertical.shape[0], 1),
stride=1, padding=(layer.padding[0], 0), dilation=layer.dilation,
groups=vertical.shape[1], bias=False)
depthwise_horizontal_layer = \
torch.nn.Conv2d(in_channels=horizontal.shape[1], \
out_channels=horizontal.shape[1],
kernel_size=(1, horizontal.shape[0]), stride=layer.stride,
padding=(0, layer.padding[0]),
dilation=layer.dilation, groups=horizontal.shape[1], bias=False)
pointwise_r_to_t_layer = torch.nn.Conv2d(in_channels=last.shape[1], \
out_channels=last.shape[0], kernel_size=1, stride=1,
padding=0, dilation=layer.dilation, bias=True)
pointwise_r_to_t_layer.bias.data = layer.bias.data
depthwise_horizontal_layer.weight.data = \
torch.transpose(horizontal, 1, 0).unsqueeze(1).unsqueeze(1)
depthwise_vertical_layer.weight.data = \
torch.transpose(vertical, 1, 0).unsqueeze(1).unsqueeze(-1)
pointwise_s_to_r_layer.weight.data = \
torch.transpose(first, 1, 0).unsqueeze(-1).unsqueeze(-1)
pointwise_r_to_t_layer.weight.data = last.unsqueeze(-1).unsqueeze(-1)
new_layers = [pointwise_s_to_r_layer, depthwise_vertical_layer, \
depthwise_horizontal_layer, pointwise_r_to_t_layer]
return nn.Sequential(*new_layers)
1511.06530 Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications is a really cool paper that shows how to use the Tucker Decomposition for speeding up convolutional layers with even better results. I also used this accelerate an over-parameterized VGG based network, with better accuracy than CP Decomposition. As the authors note in the paper, it lets us do the finetuning using higher learning rates (I used \(10^{-3}\)).
The Tucker Decomposition, also known as the higher order SVD (HOSVD) and many other names, is a generalization of SVD for tensors. \(K(i, j, s, t) = \sum_{r_1=1}^{R_1}\sum_{r_2=1}^{R_2}\sum_{r_3=1}^{R_3}\sum_{r_4=1}^{R_4}\sigma_{r_1 r_2 r_3 r_4} K^x_{r1}(i)K^y_{r2}(j)K^s_{r3}(s)K^t_{r4}(t)\)
The reason its considered a generalization of the SVD is that often the components of \(\sigma_{r_1 r_2 r_3 r_4}\) are orthogonal, but this isnât really important for our purpose. \(\sigma_{r_1 r_2 r_3 r_4}\) is called the core matrix, and defines how different axis interact.
In the CP Decomposition described above, the decomposition along the spatial dimensions \(K^x_r(i)K^y_r(j)\) caused a spatially separable convolution. The filters are quite small anyway, typically 3x3 or 5x5, so the separable convolution isnât saving us a lot of computation, and is an aggressive approximation.
The Tucker decomposition has the useful property that it doesnât have to be decomposed along all the axis (modes). We can perform the decomposition along the input and output channels instead (a mode-2 decomposition):
\[K(i, j, s, t) = \sum_{r_3=1}^{R_3}\sum_{r_4=1}^{R_4}\sigma_{i j r_3 r_4}(j)K^s_{r3}(s)K^t_{r4}(t)\]Like for CP decomposition, lets write the convolution formula and plug in the kernel decomposition:
\[V(x, y, t) = \sum_i \sum_j \sum_sK(i, j, s, t)X(x-i, y-j, s)\] \[V(x, y, t) = \sum_i \sum_j \sum_s\sum_{r_3=1}^{R_3}\sum_{r_4=1}^{R_4}\sigma_{(i)(j) r_3 r_4}K^s_{r3}(s)K^t_{r4}(t)X(x-i, y-j, s)\] \[V(x, y, t) = \sum_i \sum_j \sum_{r_4=1}^{R_4}\sum_{r_3=1}^{R_3}K^t_{r4}(t)\sigma_{(i)(j) r_3 r_4} \sum_s\ K^s_{r3}(s)X(x-i, y-j, s)\]This gives us the following recipe for doing the convolution with Tucker Decomposition:
Point wise convolution with \(K^s_{r3}(s)\) for reducing the number of channels from S to \(R_3\).
Regular (not separable) convolution with \(\sigma_{(i)(j) r_3 r_4}\). Instead of S input channels and T output channels like the original layer had, this convolution has \(R_3\) input channels and \(R_4\) output channels. If these ranks are smaller than S and T, this is were the reduction comes from.
Pointwise convolution with \(K^t_{r4}(t)\) to get back to T output channels like the original convolution. Since this is the last convolution, at this point we add the bias if there is one.
One way would be trying different values and checking the accuracy. I played with heuristics like \(R_3 = S/3\) , \(R_4 = T/3\) with good results.
Ideally selecting the ranks should be automated.
The authors suggested using variational Bayesian matrix factorization (VBMF) (Nakajima et al., 2013) as a method for estimating the rank.
VBMF is complicated and is out of the scope of this post, but in a really high level summary what they do is approximate a matrix \(V_{LxM}\) as the sum of a lower ranking matrix \(B_{LxH}A^T_{HxM}\) and gaussian noise. After A and B are found, H is an upper bound on the rank.
To use this for tucker decomposition, we can unfold the s and t components of the original weight tensor to create matrices. Then we can estimate \(R_3\) and \(R_4\) as the rank of the matrices using VBMF.
I used this python implementation of VBMF and got convinced it works :-)
VBMF usually returned ranks very close to what I previously found with careful and tedious manual tuning.
This could also be used for estimating the rank for Truncated SVD acceleration of fully connected layers.
def estimate_ranks(layer):
""" Unfold the 2 modes of the Tensor the decomposition will
be performed on, and estimates the ranks of the matrices using VBMF
"""
weights = layer.weight.data.numpy()
unfold_0 = tensorly.base.unfold(weights, 0)
unfold_1 = tensorly.base.unfold(weights, 1)
_, diag_0, _, _ = VBMF.EVBMF(unfold_0)
_, diag_1, _, _ = VBMF.EVBMF(unfold_1)
ranks = [diag_0.shape[0], diag_1.shape[1]]
return ranks
def tucker_decomposition_conv_layer(layer):
""" Gets a conv layer,
returns a nn.Sequential object with the Tucker decomposition.
The ranks are estimated with a Python implementation of VBMF
https://github.com/CasvandenBogaard/VBMF
"""
ranks = estimate_ranks(layer)
print(layer, "VBMF Estimated ranks", ranks)
core, [last, first] = \
partial_tucker(layer.weight.data, \
modes=[0, 1], ranks=ranks, init='svd')
# A pointwise convolution that reduces the channels from S to R3
first_layer = torch.nn.Conv2d(in_channels=first.shape[0], \
out_channels=first.shape[1], kernel_size=1,
stride=1, padding=0, dilation=layer.dilation, bias=False)
# A regular 2D convolution layer with R3 input channels
# and R3 output channels
core_layer = torch.nn.Conv2d(in_channels=core.shape[1], \
out_channels=core.shape[0], kernel_size=layer.kernel_size,
stride=layer.stride, padding=layer.padding, dilation=layer.dilation,
bias=False)
# A pointwise convolution that increases the channels from R4 to T
last_layer = torch.nn.Conv2d(in_channels=last.shape[1], \
out_channels=last.shape[0], kernel_size=1, stride=1,
padding=0, dilation=layer.dilation, bias=True)
last_layer.bias.data = layer.bias.data
first_layer.weight.data = \
torch.transpose(first, 1, 0).unsqueeze(-1).unsqueeze(-1)
last_layer.weight.data = last.unsqueeze(-1).unsqueeze(-1)
core_layer.weight.data = core
new_layers = [first_layer, core_layer, last_layer]
return nn.Sequential(*new_layers)
In this post we went over a few tensor decomposition methods for accelerating layers in deep neural networks.
Truncated SVD can be used for accelerating fully connected layers.
CP Decomposition decomposes convolutional layers into something that resembles mobile-nets, although it is even more aggressive since it is also separable in the spatial dimensions.
Tucker Decomposition reduced the number of input and output channels the 2D convolution layer operated on, and used pointwise convolutions to switch the number of channels before and after the 2D convolution.
I think itâs interesting how common patterns in network design, pointwise and depthwise convolutions, naturally appear in these decompositions!
]]>Pruning neural networks is an old idea going back to 1990 (with Yan Lecunâs optimal brain damage work) and before. The idea is that among the many parameters in the network, some are redundant and donât contribute a lot to the output.
If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network.
Getting faster/smaller networks is important for running these deep learning networks on mobile devices.
The ranking can be done according to the L1/L2 mean of neuron weights, their mean activations, the number of times a neuron wasnât zero on some validation set, and other creative methods . After the pruning, the accuracy will drop (hopefully not too much if the ranking clever), and the network is usually trained more to recover.
If we prune too much at once, the network might be damaged so much it wonât be able to recover.
So in practice this is an iterative process - often called âIterative Pruningâ: Prune / Train / Repeat.
The image is taken from [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference]
There are a lot of papers about pruning, but Iâve never encountered pruning used in real life deep learning projects.
Which is surprising considering all the effort on running deep learning on mobile devices. I guess the reason is a combination of:
So, I decided to implement pruning myself and see if I could get good results with it.
In this post we will go over a few pruning methods, and then dive into the implementation details of one of the recent methods.
We will fine tune a VGG network to classify cats/dogs on the Kaggle Dogs vs Cats dataset, which represents a kind of transfer learning that I think is very common in practice.
Then we will prune the network and speed it up by a factor of almost x3, and reduce the size by a factor of almost x4!
In VGG16 90% of the weights are in the fully connected layers, but those account for 1% of the total floating point operations.
Up until recently most of the works focused on pruning the fully connected layers. By pruning those, the model size can be dramatically reduced.
We will focus here on pruning entire filters in convolutional layers.
But this has a cool side affect of also reducing memory. As observed in the [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] paper, the deeper the layer, the more it will get pruned.
This means the last convolutional layer will get pruned a lot, and a lot of neurons from the fully connected layer following it will also be discarded!
When pruning the convolutional filters, another option would be to reduce the weights in each filter, or remove a specific dimension of a single kernel. You can end up with filters that are sparse, but itâs not trivial the get a computational speed up. Recent works advocate âStructured sparsityâ where entire filters are pruned instead.
One important thing several of these papers show, is that by training and then pruning a larger network, especially in the case of transfer learning, they get results that are much better than training a smaller network from scratch.
Lets now briefly review a few methods.
In this work they advocate pruning entire convolutional filters. Pruning a filter with index k affects the layer it resides in, and the following layer. All the input channels at index k, in the following layer, will have to removed, since they wonât exist any more after the pruning.
The image is from [1608.08710 Pruning filters for effecient convnets]
In case the following layer is a fully connected layer, and the size of the feature map of that channel would be MxN, then MxN neurons be removed from the fully connected layer.
The neuron ranking in this work is fairly simple. Itâs the L1 norm of the weights of each filter.
At each pruning iteration they rank all the filters, prune the m lowest ranking filters globally among all the layers, retrain and repeat.
This work seems similar, but the ranking is much more complex. They keep a set of N particle filters, which represent N convolutional filters to be pruned.
Each particle is assigned a score based on the network accuracy on a validation set, when the filter represented by the particle was not masked out. Then based on the new score, new pruning masks are sampled.
Since running this process is heavy, they used a small validation set for measuring the particle scores.
This is a really cool work from Nvidia.
First they state the pruning problem as a combinatorial optimization problem: choose a subset of weights B, such that when pruning them the network cost change will be minimal.
Notice how they used the absolute difference and not just the difference. Using the absolute difference enforces that the pruned network wonât decrease the network performance too much, but it also shouldnât increase it. In the paper they show this gives better results, presumably because itâs more stable.
Now all ranking methods can be judged by this cost function.
VGG16 has 4224 convolutional filters. The âidealâ ranking method would be brute force - prune each filter, and then observe how the cost function changes when running on the training set. Since they are from Nvidia and they have access to a gazillion GPUs they did just that. This is called the oracle ranking - the best possible ranking for minimizing the network cost change. Now to measure the effectiveness of other ranking methods, they compute the spearman correlation with the oracle. Surprise surprise, the ranking method they came up with (described next) correlates most with the oracle.
They come up with a new neuron ranking method based on a first order (meaning fast to compute) Taylor expansion of the network cost function.
Pruning a filter h is the same as zeroing it out.
C(W, D) is the average network cost function on the dataset D, when the network weights are set to W. Now we can evaluate C(W, D) as an expansion around C(W, D, h = 0). They should be pretty close, since removing a single filter shouldnât affect the cost too much.
The ranking of h is then abs(C(W, D, h = 0) - C(W, D)).
The rankings of each layer are then normalized by the L2 norm of the ranks in that layer. I guess this kind of empiric, and iâm not sure why is this needed, but it greatly effects the quality of the pruning.
This rank is quite intuitive. We couldâve used both the activation, and the gradient, as ranking methods by themselves. If any of them are high, that means they are significant to the output. Multiplying them gives us a way to throw/keep the filter if either the gradients or the activations are very low or high.
This makes me wonder - did they pose the pruning problem as minimizing the difference of the network costs, and then come up with the taylor expansion method, or was it other way around, and the difference of network costs oracle was a way to back up their new method ? :-)
In the paper their method outperformed other methods in accuracy, too, so it looks like the oracle is a good indicator.
Anyway I think this is a nice method thatâs more friendly to code and test, than say, a particle filter, so we will explore this further!
So lets say we have a transfer learning task where we need to create a classifier from a relatively small dataset. Like in this Keras blog post.
Can we use a powerful pre-trained network like VGG for transfer learning, and then prune the network?
If many features learned in VGG16 are about cars, peoples and houses - how much do they contribute to a simple dog/cat classifier ?
This is a kind of a problem that I think is very common.
As a training set we will use 1000 images of cats, and 1000 images of dogs, from the Kaggle Dogs vs Cats data set. As a testing set we will use 400 images of cats, and 400 images of dogs.
The accuracy dropped from 98.7% to 97.5%.
The network size reduced from 538 MB to 150 MB.
On a i7 CPU the inference time reduced from 0.78 to 0.277 seconds for a single image, almost a factor x3 reduction!
We will take VGG16, drop the fully connected layers, and add three new fully connected layers. We will freeze the convolutional layers, and retrain only the new fully connected layers. In PyTorch, the new layers look like this:
After training for 20 epoches with data augmentation, we get an accuracy of 98.7% on the testing set.
To compute the Taylor criteria, we need to perform a Forward+Backward pass on our dataset (or on a smaller part of it if itâs too large. but since we have only 2000 images lets use that).
Now we need to somehow get both the gradients and the activations for convolutional layers. In PyTorch we can register a hook on the gradient computation, so a callback is called when they are ready:
Now we have the activations in self.activations, and when a gradient is ready, compute_rank will be called:
This did a point wise multiplication of each activation in the batch and itâs gradient, and then for each activation (that is an output of a convolution) we sum in all dimensions except the dimension of the output.
For example, if the batch size was 32, the number of outputs for a specific activation was 256 and the spatial size of that activation was 112x112 such the activation/gradient shapes were 32x256x112x112, then the output will be a 256 sized vector representing the ranks of the 256 filters in this layer.
Now that we have the ranking, we can use a min heap to get the N lowest ranking filters. Unlike in the Nvidia paper where they used N=1 at each iteration, to get results faster we will use N=512! This means that each pruning iteration, we will remove 12% from the original number of the 4224 convolutional filters.
The distribution of the low ranking filters is interesting. Most of the filters pruned are from the deeper layer. Here is a peek of which filters were pruned after the first iteration:
Layer number | Number of pruned filters pruned |
---|---|
Layer 0 | 6 |
Layer 2 | 1 |
Layer 5 | 4 |
Layer 7 | 3 |
Layer 10 | 23 |
Layer 12 | 13 |
Layer 14 | 9 |
Layer 17 | 51 |
Layer 19 | 35 |
Layer 21 | 52 |
Layer 24 | 68 |
Layer 26 | 74 |
Layer 28 | 73 |
At this stage, we unfreeze all the layers and retrain the network for 10 epoches, which was enough to get good results on this dataset. Then we go back to step 1 with the modified network, and repeat.
This is the real price we pay - thatâs 50% of the number of epoches used to train the network, at a single iteration. In this toy dataset we can get away with it since the dataset is small. If youâre doing this for a huge dataset, you better have lots of GPUs.
I think pruning is an overlooked method that is going to get a lot more attention and use in practice. We showed how we can get nice results on a toy dataset. I think many problems deep learning is used to solve in practice are similar to this one, using transfer learning on a limited dataset, so they can benefit from pruning too.
]]>Video using the hypercolumns and occlusion map methods described below
This post is about understanding how a self driving deep learning network decides to steer the car wheel.
NVIDIA published a very interesting paper, that describes how a deep learning network can be trained to steer a wheel, given a 200x66 RGB image from the front of a car.
This repository shared a Tensorflow implementation of the network described in the paper, and (thankfully!) a dataset of image / steering angles collected from a human driving a car. The dataset is quite small, and there are much larger datasets available like in the udacity challenge.
However it is great for quickly experimenting with these kind of networks, and visualizing when the network is overfitting is also interesting. I ported the code to Keras, trained a (very over-fitting) network based on the NVIDIA paper, and made visualizations.
I think that if eventually this kind of a network will find use in a real world self driving car, being able to debug it and understand its output will be crucial.
Otherwise the first time the network decides to make a very wrong turn, critics will say that this is just a black box we donât understand, and it should be replaced!
The first thing we will try, wonât require any knowledge about the network, and in fact we wonât peak inside the network, just look at the output. Weâl create an occlusion map for a given image, where we take many windows in the image, mask them out, run the network, and see how the regressed angle changed. If the angle changed a lot - that window contains information that was important for the network decision. We then can assign each window a score based on how the angle changed!
We need to take many windows, with different sizes - since we donât know in advance the sizes of important features in the image.
Now we can make nice effects like filtering the occlusion map, and displaying the focused area on top of a blurred image:
Some problems with this - Its expensive to create the visualization since we need many sliding windows, and it is possible that just masking out the windows created artificial features like sharp angles that were used by the network. Also - this tells us which areas were important for the network, but it doesnât give us any insight on why. Can we do better?
So we want to understand what kind of features the network saw in the image, and how it used them for its final decision. Lets use a heuristic - take the outputs of the convolutional layers, resize them to the input image size, and aggregate them. The collection of these outputs are called hypercolumns, and here is a good blog post about getting them with Keras. One way of aggregating them is by just multiplying them - so pixels that had high activation in all layers will get a high score. We will take the average output image from each layer, normalize it, and multiply these values from wanted layers. In the NVIDIA model, the output from the last convolutional layer is a 18x1 image. If we peak only at that layer, we basically get a importance map for columns of the image:
Anyway, this is quite naive and completely ignores the fully connected layers, and the fact that in certain situations some outputs are much more important than other outputs, but its a heuristic.
(The above image shows pixels that contribute to steering right)
Class activation maps are a technique to visualize the importance of image pixels to the final output of the network. Basically you take the output of the last convolutional layer, you take a spatial average of that (global average pooling), and you feed that into a softmax for classification. Now you can look at the softmax weights used to give a category score - large weights mean important features - and multiply them by the corresponding conv outputs.
Relative to the rest of the stuff we tried here - this technique is great. It gives us an insight of how exactly each pixel was used in the overall decision process. However this technique requires a specific network architecture - conv layers + GAP (global average pooling, the spatial mean for every channel in a feature map), so existing networks with fully connected layers, like the nvidia model, canât be used as is. We could just train a new model with conv layers + GAP (I actually did that), however we really want the fully connected layers here. They enable the network to reason spatially about the image - If it finds interesting features in the left part of the image - perhaps that road is blocked?
This paper solves the issue, and generalizes class activation maps. To get the importance of images in the conv outputs, you use back propagation - you take the gradient of the target output with respect to the pixels in conv output images. Conv output images that are important for the final classification decision, will contain a lot of positive gradients. So to assign them an importance value - we can just take a spatial average of the gradients in each conv output image (global average pooling again).
I wrote some Keras code to try this out for classification networks.
So lets adapt this for the steering angle regression. We canât just always take gradient of the output, since now when the gradient is high, it isnât contributing to a certain category like in the classification case, but instead to a positive steering angle. And maybe the actual steering angle was negative.
Lets look at the gradient of the regressed angle with respect to some pixel in some output image - If the gradient is very positive, that means that the pixel contributes to enlarging the steering angle - steering right. If the gradient is very negative, the pixel contributes to steering left. If the gradient is very small, the pixel contributes to not steering at all.
We can divide the angles into ranges - if the actual output angle was large, we can peak at the image features that contributed to a positive steering angle, etc. If the angle is small, we will just take the inverse of the steering angle as our target - since then pixels that contribute to small angles will get large gradients.
Lets look at an example. For the same image, we could target pixels that contribute to steering right:
And we could also target pixels that contribute to steering to the center:
]]>Class activation maps are a simple technique to get the discriminative image regions used by a CNN to identify a specific class in the image. In other words, a class activation map (CAM) lets us see which regions in the image were relevant to this class. The authors of the paper show that this also allows re-using classifiers for getting good localization results, even when training without bounding box coordinates data. This also shows how deep learning networks already have some kind of a built in attention mechanism.
This should be useful for debugging the decision process in classification networks.
To be able to create a CAM, the network architecture is restricted to have a global average pooling layer after the final convolutional layer, and then a linear (dense) layer. Unfortunately this means we canât apply this technique on existing networks that donât have this structure. What we can do is modify existing networks and fine tune them to get this. Designing network architectures to support tricks like CAM is like writing code in a way that makes it easier to debug.
The first building block for this is a layer called global average pooling. After the last convolutional layer in a typical network like VGG16, we have an N-dimensional image, where N is the number of filters in this layer. For example in VGG16, the last convolutional layer has 512 filters. For an 1024x1024 input image (lets discard the fully connected layers, so we can use any input image size we want), the output shape of the last convolutional layer will be 512x64x64. Since 1024/64 = 16, we have a 16x16 spatial mapping resolution. A global average pooling (GAP) layer just takes each of these 512 channels, and returns their spatial average. Channels with high activations, will have high signals. Lets look at keras code for this:
The output shape of the convolutional layer will be [batch_size, number of filters, width, height]. So we can take the average in the width/height axes (2, 3). We also need to specify the output shape from the layer, so Keras can do shape inference for the next layers. Since we are creating a custom layer here, Keras doesnât really have a way to just deduce the output size by itself.
The second building block is to assign a weight to each output from the global average pooling layer, for each of the categories. This can be done by adding a dense linear layer + softmax, training an SVM on the GAP output, or applying any other linear classifier on top of the GAP. These weights set the importance of each of the convolutional layer outputs.
Lets combine these building blocks in Keras code:
Now to create a heatmap for a class we can just take output images from the last convolutional layer, multiply them by their assigned weights (different weights for each class), and sum.
To test this out I trained a poor manâs person/not person classifier on person images from here: http://pascal.inrialpes.fr/data/human In the training all the images are resized to 68x128, and 20% of the images are used for validation. After 11 epochs the model over-fits the training set with almost 100% accuracy, and gets about 95% accuracy on the validation set.
To speed up the training, I froze the weights of the VGG16 network (in Keras this is as simple as model.trainable=False), and trained only the weights applied on the GAP layer. Since we discarded all the layers after the last convolutional layer in VGG16, we can load a much smaller model: https://github.com/awentzonline/keras-vgg-buddy
Here are some more examples, using the weights for the âpersonâ category:
In this image itâs disappointing that the person classifier made a correct decision without even using the face regions at all. Perhaps it should be trained on more images with clear faces. Class activation maps look useful for understanding issues like this.
Hereâs an example with weights from the ânot personâ category. It looks like itâs using large âline-likeâ regions for making a ânot personâ decision.
The original CAM method described above requires changing the network structure and then retraining it. This work generelizes CAM to be able to apply it with existing networks. In case the network already has a CAM-compibtable structure, grad-cam converges to CAM.
The output of grad-cam will be pixels that contribute to the maximization of this target function. If for example you are interested in what maximizes category number 20, then zero out all the other categories.
Simple tensorflow code that does this can look like:
Keras makes this quite easily to obtain, using the backend module. Python code for this can look like this:
Instead of scaling by the spatial average like in the paper, multiplying the gradient images by the conv output images seems more natural to me, since then we get a relevance coeffecient for each pixel in each channel.
We can then sum all the scaled channels to obtain the a heatmap.
If you are unfamiliar with bazel, then there are some quirks in getting TensorFlow to work with OpenCV, optimizations turned on, and with building shared libraries.
bazel build -c opt :project
.
The binary will be in bazel-bin/tensorflow/project.Here we want to build a shared library with C++ code that uses the Tensorflow C++ API. This will probably be the common case for production use, since you will have a large code base with its own build system (like CMake), but you need to call Tensorflow. Building against Tensorflow restricts you to bazel (at least that seems the simplest way for now), but you can create a shared library that can be called from the larger code base.
The main issue is that bazel outputs a shared library containing only Tensorflow symbols (checked with nm -g), and *.o object files with the C++ files compiled. That is kind of weird behaviour and seems to be an issue with bazel. We will deal with that by just compiling against both files (the actual dynamic loaded shared library will have the tensorflow part, and our C++ client code in the object file will be linked statically. You can also just wrap the object file in another shared library).
The BUILD file should now look like this:
bazel build -c opt --copt="-fPIC" :libproject.so
Now you can compile against both files, for example like this:
]]>By default the utility uses the VGG16 model, but you can change that to something else.
The entire VGG16 model weights about 500mb.
However we donât need to load the entire model if we only want to explore the the convolution filters and ignore the final fully connected layers.
You can download a much smaller model containing only the convolution layers (~50mb) from here:
https://github.com/awentzonline/keras-vgg-buddy
There is a lot of work being done about visualizing what deep learning networks learned.
This in part is due to criticism saying that itâs hard to understand what these black box networks learned, but this is also very useful to debug them.
Many techniques propagating gradients back to the input image became popular lately, like Googleâs deep dream, or even the neural artistic style algorithm.
I found the Stanford cs231n course section to be good starting point for all this:
http://cs231n.github.io/understanding-cnn/
This awesome Keras blog post is a very good start for visualizing filters with Keras:
http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
The idea is quite simple: we want to find an input image that would produce the largest output from one of convolution filters in one of the layers.
To do that, we can perform back propagation from the output of the filter weâre interested in, back to an input image. That gives us the gradient of the output of the filter with respect to the input image pixels.
We can use that to perform gradient ascent, searching for the image pixels that maximize the output of the filter.
The output of the filter is an image. We need to define a scalar score function for computing the gradient of it with respect to the image.
One easy way of doing that, is just taking the average output of that filter.
If you look at the filters there, some look kind of noisy.
This project suggested using a combination of a few different regularizations for producing more nice looking visualizations, and I wanted to try those out.
Lets first look at the visualization produced with gradient ascent for a few filters from the conv5_1 layer, without any regularizations:
Some of the filters did not converge at all, and some have interesting patterns but are a bit noisy.
The first simple regularization they used in âUnderstanding Neural Networks Through Deep Visualizationâ is L2 decay.
The calculated image pixels are just multiplied by a constant < 1. This penalizes large values.
Here are the same filters again, using only L2 decay, multiplying the image pixels by 0.8:
Notice how some of the filters contain more information, and a few of filters that previously did not converge now do.
The next regularization just smooths the image with a gaussian blur.
In the paper above they apply it only once every few gradient ascent iterations, but here we apply it every iterations.
Here are the same filters, now using only gaussian blur with a 3x3 kernel:
Notice how the structures become thicker, while the rest becomes smoother.
This regularization zeros pixels that had weak gradient norms.
For each RGB channels the percentile of the average gradient value is
Even where a pattern doesnât appear in the filters, pixels will have noisy non zero values.
By clipping weak gradients we can have more sparse outputs.
Here are 256 filters with the Gaussian blur and L2 decay regularizations, and a small weight for the small norm regularization: