A few months ago we had a small post here discussing different weight initializations, and I remember +Sander Dieleman and a few others had a good discussion . It is fairly important to do good weight initializations, as the rewards are non-trivial.
For example, AlexNet, which is fairly popular, from Alex's One Weird Trick paper, converges in 90 epochs (using alex's 0.01 stdv initialization).
I retrained it from scratch using the weight initialization from Yann's 98 paper, and it converges to the same error within just 50 epochs, so technically +Alex Krizhevsky could've rewritten the paper with even more stellar results (training Alexnet in 8 hours with 8 GPUs).
In fact, more interestingly, just by doing good weight initialization, I even removed the Local Response Normalization layers in AlexNet with no drop in error.
I've noticed the same trend with several other imagenet-size models, like Overfeat and OxfordNet, they converge in much lesser epochs than what is reported in the paper, just by doing this small change in weight initialization.
If you want the exact formulae, look at the two links below:
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L28
https://github.com/torch/nn/blob/master/Linear.lua#L18
And read yann's 98 paper Efficient Backprop: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
On that note, Surya Ganguli's talk this year at NIPS workshop wrt optimal weight initializations triggered this post. Check out his papers on that side, great work.
But I agree, reddit would be way better, I like it much more. But communities cant be moved forcefully, it is what it is :)
So by chance, there is something particular about the initialization I used for my best network, and I still haven't figured out what it is. The differences in performance aren't huge, but they are significant (and for Kaggle competitions, every fraction of a percent matters). One interesting thing I noted was that I always get better results with initial weights drawn from a Gaussian than with weights drawn from a uniform distribution (same variance).
I'll have a look at Surya Ganguli's work. Unfortunately I couldn't make it to NIPS this year :(
EDIT: I realized I should clarify that the 'standard' initialization I was talking about is not the same as the one you link to. I used the version that also takes into account both the fanin and the fanout of each unit, as in Glorot & Bengio 2010, formula (16): http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
I also tried a Gaussian initialization based on this idea, and it seemed to work better than uniform. Still not as good as my 'custom' initialization though. I suspect it might be an interaction with dropout, which I only applied in the fully connected layers of my models.
I should probably compare with using only the fanin (as you did) as well.
In the code, the distribution is Uniform[-stddev, stddev], but to get the desired standard deviation, this should actually be Uniform[-sqrt(3)*stddev, sqrt(3)*stddev].
With that factor of sqrt(3) missing there isn't really a theoretical justification for this initialization. But clearly it seems to work well, from Soumith's experience, so what's going on there?
So to get a uniform distribution with an stddev of a, you actually need to specify the interval [-sqrt(3)*a, sqrt(3)*a], and as far as I can see this doesn't happen in the code you linked.
Paper: http://arxiv.org/abs/1312.6120
Presentation: https://www.youtube.com/watch?v=Ap7atx-Ki3Q&list=PL4VIKuMdJoBETL2PAx14pvmVDLjGuIC1_&index=5
If the variance of the input is around 1, and you use f(x) = 1.7159*tanh(2/3 * x), as suggested in LeCun '98, then the variance of the output after the nonlinearity will also be around 1.
If you use tanh(x), sigmoid(x) or max(x, 0), this obviously no longer holds, but people use it anyway because it just works. The reason why is probably that the missing 'correction' factor to bring the variance back to 1 is usually fairly close to 1 anyway, so if your network is not too deep, this doesn't cause huge problems.
For example, tanh(x) with x ~ N(0, 1) has a variance of about 0.63. Note that this actually depends on the shape of the distribution now: tanh(x) with x ~ U[-sqrt(3), sqrt(3)] has a variance of about 0.67.
With today's 20+-layer networks though (ImageNet 2014), I think finding good initializations is going to gain importance. All these missing scale factors would compound into a pretty big scale factor at the top of the network.
I don't think dividing by sqrt(3) helps because of ReLUs. In fact, in my experience what you want to do instead is multiply the variance by sqrt(2) to compensate for the fact that approximately half the units will be off at the start of training (provided that the biases are initialized to zero, which is not always done).
There must be some other reason why making the target output variance smaller than 1 somehow works a bit better, and the ideal scale factor probably isn't sqrt(3). I really want to know what's going on here, "it just works a bit better" is rather unsatisfying :)
+Alec Radford: awesome, thanks for the reference! I'll check this out and probably give it a try as well.
function W = ortho_init(sz,g)
% sz: list of layer sizes (any sizes allowed)
% g: scale factor to apply to each layer
%
% g should be set based on the activation function:
% linear activations g = 1 (or greater)
% tanh activations g > 1
% ReLU activations g = sqrt(2) (or greater)
%
% Author: Andrew Saxe, 2014
Nl = length(sz);
for i = 1:Nl
[R{i},tmp] = qr(randn(sz(i),sz(i)));
end
for i = 1:Nl-1
dim = min(size(R{i+1},2),size(R{i},1));
W{i} = g*R{i+1}(:,1:dim)*R{i}(:,1:dim)';
end
It’s basically an initialization to orthogonal matrices, scaled by an overall gain g. This g should be set for different activation functions:
Activation g
Linear 1
Tanh greater than 1
ReLU sqrt(2)
(our paper shows how to do this for other activation functions)
A couple papers have reported some luck with this scheme, eg, http://arxiv.org/pdf/1406.1078v3.pdf and http://papers.nips.cc/paper/5350-learning-to-discover-efficient-mathematical-identities.pdf.
The scaling factor seems to be more important than the orthogonal initialization in many cases, so you may be able to get away with just scaling the Glorot initialization suggestion by the above values.
This initialization uses orthogonal matrices, but there’s a bit of subtlety when it comes to undercomplete layers—basically you need to make sure that the paths from the input layer to output layer, through the bottleneck, are preserved. This is accomplished by reusing parts of the same orthogonal matrices across different layers of the network.
In general, scaling up the weights can cause training of even very deep networks to be fast. See for example the plot here http://youtu.be/Ap7atx-Ki3Q?t=15m39s. That shows train and test error after a fixed number of epochs for a 30 layer tanh network as a function of the scaling factor g. Two things to note: first, larger gains make the optimization go super fast (low train error); second, larger gains at some point start hurting generalization error. So that’s the downside… and I think a really interesting candidate for more analysis!
Also keep in mind our paper only analyzes the training error dynamics. From an optimization perspective, we should strive to optimize training error as fast as possible, adding regularization to achieve good generalization—in present practice, we mix these two things. Our models are implicitly regularized because of the way we optimize them by starting with small weights.
So to summarize what I think is a more accurate set of intuitions for why deep learning is hard,
1) Small random initializations are slow because they start you near a saddle point, in particular, the saddle point in which all weight matrices are zero.
2) The time spent near this saddle point is proportional to 1/epsilon where epsilon is the “initial mode strength” defined precisely in the paper, but to get the general idea, it’s a product of the strengths in each layer: epsilon = a_1 * a_2 * … a_n. To get the intuition, just think of a deep linear network with only one neuron per layer—then the initial mode strength a_i just is the initial weight value for that layer, and epsilon is the product of them all. With small random weights you’re multiplying lots of little numbers and so the overall epsilon is going to be exponentially small in depth. Learning is slow since 1/epsilon is huge. If the a_i are > 1, then the product is going to explode—also a problem. A good initialization scheme makes these a_i about equal to one, because then epsilon is about equal to one regardless of depth and learning proceeds quickly.
3) In fact, if epsilon is about 1, then very deep networks are only a constant factor slower than shallow networks.
4) Everything I just described is from deep linear networks. In a nonlinear network, you can get a good initialization by autoencoder pretraining (still improves training speed even now in the era of second order methods—our paper has a tiny lit review in the appendix). This establishes epsilon = 1. Or, you can compensate for your typically compressive nonlinearity by using slightly larger than orthogonal matrices. That’s the gain factor proposed above, and for different nonlinearities you can compute the gain factor that will cause activity to propagate indefinitely, rather than decay away in a deep network.
This part of your comment is intriguing, and I think I don't fully understand it: "basically you need to make sure that the paths from the input layer to output layer, through the bottleneck, are preserved. This is accomplished by reusing parts of the same orthogonal matrices across different layers of the network."
Could you elaborate on this a bit? What does this look like in practice? Many practical neural nets have undercomplete layers ("bottleneck layers" were a huge buzzword at ICASSP this year), so I'm wondering what's so special about them and how your "reuse parts of the same orthogonal matrices"-trick works.
EDIT: also, I'm curious if any special care needs to be taken with convolutional layers. I guess those will typically be 'undercomplete' in practice. For example, if a convolutional layer has 32 3x3 filters, and the input has 32 feature maps, then the number of filters is 9 times smaller than the fanin of a filter.
To my knowledge the paper claimed they couldn't train beyond 11 weight layers with standard random initalization and this was 16 weight layers.
The paper says "We note that it might be possible to appropriately initialise learning without pre-training, e.g. by adjusting the random initialisation procedure [19]."
So guess that's a yes then!
Suppose you have two weight matrices, W_1 and W_2, and W_1 goes to a tiny layer of neurons, e.g., it has very few rows. In the linear case, the overall function the network computes is just y = W_2W_1x, which you can rewrite as y = Wx for some W. But note that you can't represent any possible W with this network. The small W_1 layer imposes a rank constraint on the overall function. If you have 10 neurons in the W_1 layer, you can't represent more than a rank 10 linear function. If you took the SVD of W, i.e. W=USV^T, there could only be 10 nonzero elements in S.
Now about initializations: take a net with three weight matrices, W_1 through W_3, with two bottleneck layers, one between W_1 and W_2 (so W_1 has few rows), and one at the output of W_3 (so W_3 has few rows). To be concrete let’s say the undercomplete layers are size 10 and the others are size 100. We can break down each weight matrix using the SVD to get W_i = U_i S_i V_i^T. Then the overall map is y = U_3 S_3 V_3^T U_2 S_2 V_2^T U_1 S_1 V_1^Tx. At the undercomplete layers, the singular values S_1 and S_3 have the form [D_i 0] where D_i is a 10x10 matrix with the singular values on the diagonal, and the 0 there is a 10x90 matrix of zeros to fill out the rest. This means that inputs aligned with the first 10 input singular vectors of V_1 or V_3 will get passed through, but everything else won’t make it. At the overcomplete layer, S_2 has the form [D_2; 0] where the 0 is 90x10 now. All 10 input directions make it to the output, but there are 90 unused output directions in the overcomplete layer that aren’t hooked up to any input.
To be a good initialization, you want the overall function to have singular values of about one. You might think that you could just set all these D_i matrices equal to the identity to get this. This is sort of like an “orthogonal” initialization but our matrices aren’t square and so can’t be orthogonal in the standard sense that U^TU=I=UU^T. Now here’s the crucial bit: suppose you are quite unlucky and the 10 output singular vectors U_2 with nonzero singular values are aligned with 10 of the 90 input singular vectors of V_3 with zero singular values. Then W_3W_2 = 0. And so the overall function y = W_3W_2W_1x=0, regardless of the input x. The first bottleneck has funneled all of the information into directions that the second bottleneck wipes out. Or said another way, after the first bottleneck all of the activity lies in one 10 dimensional subspace of the 100 dimensional overcomplete layer, but the second bottleneck is only listening to information in some other 10 dimensional subspace.
The fix is to tie together the input singular vectors and output singular vectors across layers to make sure this doesn’t happen. I.e. recall that y = U_3 S_3 V_3^T U_2 S_2 V_2^T U_1 S_1 V_1^Tx. Choose the initialization such that V_3=U_2 and V_2=U_1. Then the overall function is y = U_3 S_bottleneck V_1^Tx where S_bottleneck has 10 nonzero elements (or whatever the dimension of the smallest bottleneck is) all equal to one, as desired. The code I posted does that (you can see each matrix R gets used twice, once as an output singular vector matrix U and once as an input singular vector matrix V^T).
+Alec Radford that’s quite encouraging to hear it helps with convolutional nets! I haven’t been sure whether this analysis would apply. As +Sander Dieleman noted, the solution we found isn’t clearly applicable to convolutional nets (a side note about that, it seems conv layers can be overcomplete in practice too, typically the very first layer. For example the first layer of Krizhevsky’s net goes from a 150k-dimensional input to a 250k-dimensional output). Alec’s approach of using orthogonal filters would seem to give independence between filters, which apparently is helpful, but you might also be able to reduce coupling between a single filter and itself at nearby locations. It’d be great to see some solutions of the gradient descent dynamics specialized for linear conv nets if that’s possible. I should try that :).
As you mentioned convolutional layers are sometimes overcomplete, but in most cases they seem to be undercomplete, so if you stack 20 of them I guess the probability of running into this problem becomes nonnegligible. I'd love to see some quantitative results :)
Side interesting note, very deep nets can learn using tanh and small update sizes / nesterov even with standard initalizations. 30 layer mnist learns when you just use uniform(-0.05, 0.05) but is slow compared to orthogonal. ReLU with uniform/normal don't ever seem to learn in this setup.
Seems to line up with observations that ReLU RNNs are difficult to train - begs the question that maybe they'll work now if orthogonal initialization is used.
Between layers, there is no problem - just like a feedforward net. ReLU, dropout, etc. is all helpful from http://arxiv.org/abs/1409.2329 .
Dunno what to comment on the deep deep MNIST thing, but my ongoing RNN work is here: https://github.com/kastnerkyle/net/blob/master/net.py
Is this another case of someone originally chose an arbitrary number that seemed to work, and others stuck with it? I'm thinking of the 0.01 stdev gaussian weight initialization from the AlexNet paper.
It depended on problem to problem, but for mnist-style problems i got away with upper bounds of 2.0, but for larger networks, i had to up the bound to about 8.0. This got me within 1% error difference vs ReLU nets
(cf. Fig 4)
http://ericbattenberg.com/school/drum_patterns_ismir2012.pdf
I chose 20 to get a specific noise level on the reconstructions, but I'm thinking 20 is some sort of universal magic number. Might as well stick with it. ;)
https://github.com/iassael/torch-linear-orthogonal
thank you for your constant support!
Do you understand why AlexNet even used LRN at all? Overfeat "fast model" didn't, and it's a slight variation on 1-col Alexnet (with more featuremaps) and "Krizhevsky initialization". In fact, it's very rare to see LRN elsewhere. In the paper they do say that it helps, but why it helps AlexNet, but not OverFeat "fast".
Krizhevsky et al are quite explicit in the paper that LRN does not help learning, but
"""
However, we still find that the following local normalization scheme aids
generalization.
"""
and they tend to specify those things explicitly, for example:
"""
We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error.
"""
Instead of deriving gain coefficients to ensure mean 0, var 1, of layer output, one could use data.
Init with Saxe -> forward pass, minus mean / divide by std.
It is very simple and works quite well, as independently found by me http://arxiv.org/abs/1511.06422 and Berkeley guys http://arxiv.org/abs/1511.06856.
P.S.Sorry for necro-posting and self-PR :)
I'm a little late to this party, but I'm interested in the Ganguli talk you mentioned.. I can't quite figure out which work of his you're referring to. Do you know if there is a paper / video of this online somewhere?
And here's an earlier talk from Saxe on the same paper: https://www.youtube.com/watch?v=Ap7atx-Ki3Q