Practical hyperparameter optimization: Random vs. grid search

Question

I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance.

My question is: Do people here agree with that claim? In my work I've been using grid search mostly because of the lack of tools available to perform random search easily.

What is the experience of people using grid vs. random search?

Random search is better and should always be preferred. However, it would be even better to use dedicated libraries for hyperparameter optimization, such as Optunity, hyperopt or bayesopt. — Marc Claesen, Commented Mar 2, 2016 at 0:36
Bengio et al. write about it here: papers.nips.cc/paper/… So, GP works the best, but RS also works great. — Guy L, Commented Mar 27, 2016 at 10:25
@Marc When you provide a link to something you're involved with, you should make your association with it clear (one or two words can suffice, even something as brief as referring to it as our Optunity should do); as the help on behavior says, "if some ... happen to be about your product or website, that’s okay. However, you must disclose your affiliation" — Glen_b, Commented Mar 27, 2016 at 13:33

Sainath Adapa · Accepted Answer · 2020-09-12 12:36:08Z

Random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn't bog down in local optima.

Check this great blog post at Dato by Alice Zheng, specifically the section Hyperparameter tuning algorithms.

I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. [...] Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough.

In hindsight, there is a simple probabilistic explanation for the result: for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability. That may sound complicated, but it’s not. Imagine the 5% interval around the true maximum. Now imagine that we sample points from his space and see if any of it lands within that maximum. Each random draw has a 5% chance of landing in that interval, if we draw n points independently, then the probability that all of them miss the desired interval is $\left(1−0.05\right)^{n}$. So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity. We want at least a .95 probability of success. To figure out the number of draws we need, just solve for n in the equation:

$$1−\left(1−0.05\right)^{n}>0.95$$

We get $n\geqslant60$. Ta-da!

The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability.

You can improve that chance with a higher number of trials.

All in all, if you have too many parameters to tune, grid search may become unfeasible. That's when I try random search.

Link to the blog post is down :( Could this be the same article? oreilly.com/ideas/evaluating-machine-learning-models/page/5/… — n1k31t4, Commented Jun 25, 2017 at 22:02
@DexterMorgan Hey, thanks for the heads up. Yeah, the blog is apparently down, and I'm not sure I should link to other sources which might not be "official", so I'll just leave it as is for now I think. — Firebug, Commented Jun 28, 2017 at 4:40

Tim · Accepted Answer · 2017-10-16 20:20:40Z

15

Look again at the graphic from the paper (Figure 1). Say that you have two parameters, with 3x3 grid search you check only three different parameter values from each of the parameters (three rows and three columns on the plot on the left), while with random search you check nine (!) different parameter values of each of the parameters (nine distinct rows and nine distinct columns).

Obviously, random search, by chance, may not be representative for all the range of the parameters, but as sample size grows, the chances of this get smaller and smaller.

answered Oct 16, 2017 at 20:20

Tim

140k26 gold badges265 silver badges507 bronze badges

$\begingroup$ You could get the same effect with a rotated grid. $\endgroup$
– Sextus Empiricus
Commented Sep 12, 2020 at 13:59
$\begingroup$ If the dimensionality n is larger than two, then random sampling is guaranteed to win over the grid sampling. This is well known in Monte-Carlo integration. $\endgroup$
– Youjun Hu
Commented Nov 25, 2022 at 3:12

Add a comment |

Sycorax · Accepted Answer · 2018-06-26 18:35:33Z

If you can write a function to to grid search, it's probably even easier to write a function to do random search because you don't have to pre-specify and store the grid up front.

Setting that aside, methods like LIPO, particle swarm optimization and Bayesian optimization make intelligent choices about which hyperparameters are likely to be better, so if you need to keep the number of models fit to an absolute minimum (say, because it's expensive to fit a model), these tools are promising options. They're also global optimizers, so they have a high probability of locating the global maximum. Some of the acquisition functions of BO methods have provable regret bounds, which make them even more attractive.

More information can be found in these questions:

What are some of the disavantage of bayesian hyper parameter optimization?

Optimization when Cost Function Slow to Evaluate

IndieSolver · Accepted Answer · 2018-09-11 04:18:23Z

By default, random search and grid search are terrible algorithms unless one of the following holds.

Your problem does not have a global structure, e.g., if the problem is multimodal and the number of local optima is huge
Your problem is noisy, i.e., evaluating the same solution twice leads to different objective function values
The budget of objective function calls is very small compared to the number of variables, e.g., smaller than 1x or 10x.
The number of variables is very small, e.g., smaller than 5 (in practice).
a few other conditions.

Most people claim that random search is better than grid search. However, note that when the total number of function evaluations is predefined, grid search will lead to a good coverage of the search space which is not worse than random search with the same budget and the difference between the two is negligible if any. If you start to add some assumptions, e.g., that your problem is separable or almost separable, then you will find arguments to support grid search. Overall, both are comparably terrible unless in very few cases. Thus, there is no need to distinguish between them unless some additional assumptions about the problem are considered.

can you propose something better? How can we know what is best if we don't try? Seems to me random search over many models is the best compromise solution. — JPErwin, Commented Nov 20, 2018 at 16:53

ran8 · Accepted Answer · 2017-04-08 17:50:20Z

0

Finding a spot within 95% of maxima in a 2D topography with only one maxima takes 100%/25 =25%, 6.25%, 1.5625%, or 16 observations. So long as the first four observations correctly determine which quadrant the maxima (extrema) is in. 1D topography takes 100/2= 50, 25, 12.5, 6.25, 3.125 or 5*2. I guess people searching for multiple farflung local maxima use big inital grid search then regression or some other prediction method. A grid of 60 observations should have one observation within 100/60=1.66% of the extrema. Global Optimization Wikipedia I still think there is always a better method than randomness.

edited Apr 8, 2017 at 17:50

answered Apr 7, 2017 at 22:06

ran8

1578 bronze badges

$\begingroup$ Simulated annealing is one form of random search that has been around for a number of years. $\endgroup$
– Michael R. Chernick
Commented Apr 7, 2017 at 22:43

Add a comment |

JPro · Accepted Answer · 2019-11-20 17:05:21Z

-3

As Tim showed you can test more parameter values with random search than with grid search. This is especially efficient if some of the parameters you test end up not being impactful for your problem, like the 'Unimportant parameter' on Fig 1 from the article.

I did a post about hyperparameters tuning where I explain the differences between grid search, random search and Bayesian Optimization. You can check it out (and let me know if it was useful, feedback is appreciated!)

answered Nov 20, 2019 at 17:05

JPro

1

Add a comment |

Stack Exchange Network

Practical hyperparameter optimization: Random vs. grid search

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
machine-learning
hyperparameter
optimization
or ask your own question.

Linked

Hot Network Questions

Practical hyperparameter optimization: Random vs. grid search

6 Answers 6

Not the answer you're looking for? Browse other questions tagged machine-learninghyperparameteroptimization or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
hyperparameter
optimization
or ask your own question.