63
$\begingroup$

I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance.

My question is: Do people here agree with that claim? In my work I've been using grid search mostly because of the lack of tools available to perform random search easily.

What is the experience of people using grid vs. random search?

$\endgroup$
3
  • $\begingroup$ Random search is better and should always be preferred. However, it would be even better to use dedicated libraries for hyperparameter optimization, such as Optunity, hyperopt or bayesopt. $\endgroup$ Commented Mar 2, 2016 at 0:36
  • $\begingroup$ Bengio et al. write about it here: papers.nips.cc/paper/… So, GP works the best, but RS also works great. $\endgroup$
    – Guy L
    Commented Mar 27, 2016 at 10:25
  • 14
    $\begingroup$ @Marc When you provide a link to something you're involved with, you should make your association with it clear (one or two words can suffice, even something as brief as referring to it as our Optunity should do); as the help on behavior says, "if some ... happen to be about your product or website, that’s okay. However, you must disclose your affiliation" $\endgroup$
    – Glen_b
    Commented Mar 27, 2016 at 13:33

6 Answers 6

63
$\begingroup$

Random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn't bog down in local optima.

Check this great blog post at Dato by Alice Zheng, specifically the section Hyperparameter tuning algorithms.

I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. [...] Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough.

In hindsight, there is a simple probabilistic explanation for the result: for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability. That may sound complicated, but it’s not. Imagine the 5% interval around the true maximum. Now imagine that we sample points from his space and see if any of it lands within that maximum. Each random draw has a 5% chance of landing in that interval, if we draw n points independently, then the probability that all of them miss the desired interval is $\left(1−0.05\right)^{n}$. So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity. We want at least a .95 probability of success. To figure out the number of draws we need, just solve for n in the equation:

$$1−\left(1−0.05\right)^{n}>0.95$$

We get $n\geqslant60$. Ta-da!

The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability.

You can improve that chance with a higher number of trials.

All in all, if you have too many parameters to tune, grid search may become unfeasible. That's when I try random search.

$\endgroup$
2
  • 4
    $\begingroup$ Link to the blog post is down :( Could this be the same article? oreilly.com/ideas/evaluating-machine-learning-models/page/5/… $\endgroup$
    – n1k31t4
    Commented Jun 25, 2017 at 22:02
  • $\begingroup$ @DexterMorgan Hey, thanks for the heads up. Yeah, the blog is apparently down, and I'm not sure I should link to other sources which might not be "official", so I'll just leave it as is for now I think. $\endgroup$
    – Firebug
    Commented Jun 28, 2017 at 4:40
15
$\begingroup$

Look again at the graphic from the paper (Figure 1). Say that you have two parameters, with 3x3 grid search you check only three different parameter values from each of the parameters (three rows and three columns on the plot on the left), while with random search you check nine (!) different parameter values of each of the parameters (nine distinct rows and nine distinct columns).

Grid vs random search

Obviously, random search, by chance, may not be representative for all the range of the parameters, but as sample size grows, the chances of this get smaller and smaller.

$\endgroup$
2
  • $\begingroup$ You could get the same effect with a rotated grid. $\endgroup$ Commented Sep 12, 2020 at 13:59
  • $\begingroup$ If the dimensionality n is larger than two, then random sampling is guaranteed to win over the grid sampling. This is well known in Monte-Carlo integration. $\endgroup$
    – Youjun Hu
    Commented Nov 25, 2022 at 3:12
8
$\begingroup$

If you can write a function to to grid search, it's probably even easier to write a function to do random search because you don't have to pre-specify and store the grid up front.

Setting that aside, methods like LIPO, particle swarm optimization and Bayesian optimization make intelligent choices about which hyperparameters are likely to be better, so if you need to keep the number of models fit to an absolute minimum (say, because it's expensive to fit a model), these tools are promising options. They're also global optimizers, so they have a high probability of locating the global maximum. Some of the acquisition functions of BO methods have provable regret bounds, which make them even more attractive.

More information can be found in these questions:

What are some of the disavantage of bayesian hyper parameter optimization?

Optimization when Cost Function Slow to Evaluate

$\endgroup$
3
$\begingroup$

By default, random search and grid search are terrible algorithms unless one of the following holds.

  • Your problem does not have a global structure, e.g., if the problem is multimodal and the number of local optima is huge
  • Your problem is noisy, i.e., evaluating the same solution twice leads to different objective function values
  • The budget of objective function calls is very small compared to the number of variables, e.g., smaller than 1x or 10x.
  • The number of variables is very small, e.g., smaller than 5 (in practice).
  • a few other conditions.

Most people claim that random search is better than grid search. However, note that when the total number of function evaluations is predefined, grid search will lead to a good coverage of the search space which is not worse than random search with the same budget and the difference between the two is negligible if any. If you start to add some assumptions, e.g., that your problem is separable or almost separable, then you will find arguments to support grid search. Overall, both are comparably terrible unless in very few cases. Thus, there is no need to distinguish between them unless some additional assumptions about the problem are considered.

$\endgroup$
1
  • 1
    $\begingroup$ can you propose something better? How can we know what is best if we don't try? Seems to me random search over many models is the best compromise solution. $\endgroup$
    – JPErwin
    Commented Nov 20, 2018 at 16:53
0
$\begingroup$

Finding a spot within 95% of maxima in a 2D topography with only one maxima takes 100%/25 =25%, 6.25%, 1.5625%, or 16 observations. So long as the first four observations correctly determine which quadrant the maxima (extrema) is in. 1D topography takes 100/2= 50, 25, 12.5, 6.25, 3.125 or 5*2. I guess people searching for multiple farflung local maxima use big inital grid search then regression or some other prediction method. A grid of 60 observations should have one observation within 100/60=1.66% of the extrema. Global Optimization Wikipedia I still think there is always a better method than randomness.

$\endgroup$
1
  • $\begingroup$ Simulated annealing is one form of random search that has been around for a number of years. $\endgroup$ Commented Apr 7, 2017 at 22:43
-3
$\begingroup$

As Tim showed you can test more parameter values with random search than with grid search. This is especially efficient if some of the parameters you test end up not being impactful for your problem, like the 'Unimportant parameter' on Fig 1 from the article.

enter image description here

I did a post about hyperparameters tuning where I explain the differences between grid search, random search and Bayesian Optimization. You can check it out (and let me know if it was useful, feedback is appreciated!)

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.