Notes on the one cycle policy

The One-Cycle Policy: What it is, and what it’s good for

Deep learning is getting more and more effective at generating stunning results, and it’s also becoming increasingly accessible for non-experts. However, one area in which experience (or luck, or blind pattern matching) has continued to be necessary is in setting hyper-parameters. (In deep learning, parameters refer to what is actually learned; hyper-parameters refer to values that affect the training process itself.) The one-cycle approach described in Leslie Smith’s paper arXiv:1803.09820 is a systematic and efficient way to find good hyper-parameters. The ones Smith focuses on here are learning rate, batch size, momentum, and weight decay, with the observation that good settings for these parameters are interdependent, so it makes sense to set them together.

Central to this paper is the idea of over- versus under-fitting your data. If your model has under-fit the data, this means that there is complexity in the underlying structure of the data that the model doesn’t adequately capture. Another way of putting this is that the model isn’t powerful enough to describe the data well. An extreme example of this would be an image classifier that always says an image is of a cat, regardless of the image. While there are lots of pictures of cats on the internet, there are also pictures of non-cats, and our faulty classifier would be missing this nuance.

On the other hand, if the model has over-fit the data, there’s too much complexity in the model for the underlying data; some of what has been fit by the model is actually just noise. An image classification example of this would be a model learning to recognize specific cat pictures from the training set but not being able to recognize cats in new pictures. This is called the “generalization error.”

By looking at a couple of values early in the training process, we can get clues as to whether we’re under- or overfitting the data, and can make adjustments accordingly. In particular, we’re going to look at the training loss (a measure of how “off” the model is on the training set) and compare it to the test loss (a similar measure, but using novel data from the same or a similar distribution as the training data).

According to this paper, the primary hallmark of underfitting is that the test loss keeps decreasing during training instead of levelling off. Why isn’t this just a sign of training too slowly? I don’t know. Either way, we’d like to minimize our test loss with the least amount of training, so if we’re not finding a minimum we’re being inefficient with our resources. Conversely, the primary hallmark of overfitting is that the test loss – the loss on the data that the model hasn’t seen before – decreases for a while with training (this part is normal), but then starts increasing. This is the sign that the model is generalizing more badly. The overall strategy here is going to be running through our training data a few times in order to find good hyper-parameters before we commit to a lot of training.

There have been two main conventional ways of evaluating different combinations of hyper-parameters: grid search and random search. In a grid search, you look at how different combinations of hyper-parameters perform, and then vary them by predictable amounts (hence the grid). In a random search, you do basically the same thing but picking random points in the space of possible hyper-parameters instead of evenly spaced ones. Since there are so many possible hyper-parameter combinations, doing searches this way can be really time consuming, hence the need for a new strategy.

The central characters in our new strategy are the learning rate and the momentum. In gradient descent, we find the direction in which the loss decreases most quickly, and we take a step downwards in that direction. The learning rate determines the size of our step. Finding a good setting for the learning rate is important because if it’s too small, training will take a really long time, but if it’s too big, training might overshoot the minimum and fail to converge.

A recent development in learning rate setting has been the use of cyclical learning rates, or CLR. In CLR, the learning rate is not constant but instead varies between two values. The user still needs to determine the minimum and maximum values though, as well as how long (in terms of either epochs or iterations) each cycle should take.

How do we pick the minimum and maximum learning rates to cycle between? By using a learning rate range test. The idea is that we do a pre-training run, in which we start training with a small learning rate. With this small learning rate, the model starts converging, and the loss decreases. Then, we gradually increase the learning rate and see how big it can get before training starts to diverge (which we observe as the loss suddenly getting extremely large). This gives us a maximum usable learning rate. To pick the minimum rate, the recommendation for CLR is to pick a value 3 or 4 times smaller than the maximum; for one cycle (which we’ll get to momentarily) the recommendation is to pick a value 10 or 20 times smaller.

The big change proposed in this paper is using only one cycle through the data, and the reason for this is something called super-convergence. Super-convergence seems to be a cutting-edge, not-totally-understood development that’s described in more detail in another recent paper (Smith and Topin, 2017). In super-convergence, training can be done with learning rates that are much higher than normally used, with many fewer iterations to reach benchmark results (e.g. 10,000 iterations instead of 80,000), and in just one cycle of the learning rate (meaning one increasing and one decreasing phase). This means that training can go much faster, and our model will be less prone to overfitting. One insight seems to be that since larger learning rates have a regularizing effect, you need to turn down other sources of regularization to reach super-convergence. However this is not the main point of the paper – the main point is the heuristic of cycling the learning rate up and then down again, and with a slight modification – with an additional segment of learning rate reduction beyond the initial minimum. This learning rate schedule is the 1cycle policy. Basically, the learning rate starts off small so the model starts to converge. Next, the learning rate increases, so that training can proceed more quickly. (There’s an additional benefit that larger learning rates oppose overfitting.) Finally, the learning rate decreases again so the fit can be fine-tuned.

Momentum is similar to learning rate, in that it also affects the size of the updates during gradient descent. In gradient descent with momentum, information about previous gradients is used during each update step. If previous gradients have been steep in a particular direction, the size of the update will be larger in that direction. In contrast, if previous gradients have been varying, the update step will be smaller. So setting momentum well can make training much more efficient. Unlike with learning rate, though, Smith found experimentally that using a momentum range test wasn’t useful. How then should we find good values for momentum?

The advice is to use cyclical momentum, but cycling in the opposite direction from learning rate and varying from a maximum of somewhere between 0.9 and 0.99 and a minimum of 0.8 or 0.85. At the beginning of training, the momentum starts off relatively large and decreasing, and the learning rate is small and increasing. Later in training, the learning rate is decreasing, and the momentum is increasing. Smith reports that using cyclical momentum gave better results than just decreasing momentum, but without an explanation for why that might be. He also compared results on shallow and deep networks, and found the same result. For the deep learning practitioner, this is really good to know because it means that we can probably act on this advice regardless of the kind of network architecture we’re working with.

Finally, Smith also addresses weight decay and batch size, but the guidance for these is much more straightforward. He recommends using a grid search to find a good value for weight decay (and to try values of 10-3, 10-4, 10-5, and 0 if you have no idea), and to use the largest batch size possible given your hardware. It’s important to set weight decay and batch size well, and it’s nice to have a systematic approach to them. But the most exciting point of the paper in my opinion is the discovery of super-convergence and the advice for taking advantage of it. Excitingly, this paper is listed as Part I; Smith says that Part II will tackle architecture, regularization, data set, and task.