33
$\begingroup$

I have been studying Bayesian Optimization lately and made the following notes about this topic:

  • Unlike deterministic functions, real world functions are constructed using physical measurements

  • Measurements can always have error (e.g. human error, design and experiment error, random error, measurement error) - if you record the same measurements at the same conditions but at a future time, it's very likely that these measurements might be different from the previous measurements

  • Thus, an objective function that is based on physical measurements is naturally "unstable" - two people might record the same measurements, end up with different values, and as a result end up with two different objective functions.

  • A "noisy" function is also an "unstable" function - if we were top optimize this "noisy function", the optimization results might not correspond to natural system we are studying due to inherent errors while recording measurements. This means that in some sense, we are dealing with a more complicated version of "apples and oranges".

  • Bayesian Optimization attempts to solve this problem by using the recorded measurements as "pegs" and fitting a "circus tent" over these measurements through the form of a Gaussian Process. This sort of acts like "probabilistic smoothing" and tries to statistically account for all possible uncaptured variations in the measurements that exist - provided the assumption of the "data generating process" being well represented by a Gaussian Process is somewhat true.

  • Thus, Bayesian Optimization tries to "smoothens out" the noise/variation/error in the objective function, adding a natural "robustness" to the final optimization solution. All this means is that Bayesian Optimization has the potential to give us better results.

Advantages of Bayesian Optimization:

  • Robustness (as descibed above)
  • Does not require the objective function to be differentiable (i.e. useful in discrete and combinatorial optimization problems)
  • Since it does not calculate the derivative, it has the potential to be more "computationally efficient" compared to gradient based optimization methods.

Disadvantages of Bayesian Optimization:

  • Requires the true objective function to be well modelled by a Gaussian Process
  • Empirically has been observed to perform poorly on high dimensional objective functions (i.e. higher than 20 dimensions) - however, I don't understand why.

I have often heard this claim being made about Bayesian Optimization performing poorly in more than 20 dimensions, but I have never been able to understand why this is. I tried to consult some references online:

1) "High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces" (Eriksson et al 2021)

  • "High-dimensional BO presents a particular challenge, in part because the curse of dimensionality makes it difficult to define—as well as do inference over—a suitable class of surrogate models."

  • "While BO has become a workhorse algorithm that is employed in a wide variety of settings, successful applications are often limited to low-dimensional problems, e.g. fewer than twenty dimensions [Frazier, 2018]. Applying BO to high-dimensional problems remains a significant challenge. The difficulty can be traced to both of the algorithm components mentioned above, although we postulate that suitable function priors are especially important for good performance. In particular, in order for BO to be sample-efficient in high-dimensional spaces, it is crucial to define surrogate models that are sufficiently parsimonious that they can be inferred from a small number of query points. An overly flexible class of models is likely to suffer from overfitting, which severely limits its effectiveness in decision-making. Likewise, an overly rigid class of models is unlikely to capture enough features of the objective function. A compromise between flexibility and parsimony is essential."

2) "High-dimensional Bayesian optimization using low-dimensional feature spaces" (Moriconi et al, 2020)

  • "However, BO (Bayesian Optimization) is practically limited to optimizing 10–20 parameters. To scale BO to high dimensions, we usually make structural assumptions on the decomposition of the objective and/or exploit the intrinsic lower dimensionality of the problem, e.g. by using linear projections. We could achieve a higher compression rate with nonlinear projections, but learning these nonlinear embeddings typically requires much data. This contradicts the BO objective of a relatively small evaluation budget."

3) "A Tutorial on Bayesian Optimization" (Frazier, 2018)

  • "It (Bayesian Optimization) is best-suited for optimization over continuous domains of less than 20"

  • "The input x is in R-d for a value of d that is not too large. Typically d ≤ 20 in most successful applications of BayesOpt."

My Question : No where in these papers do they explain why "20 Dimensions" seems to be a relevant threshold for deciding the conditions in which the performance of Bayesian Optimization begins to deteriorate.

  • Can someone please explain why "20 Dimensions" is said to be the maximum threshold for Bayesian Optimization?

  • Even though some explanations are provided that explain the difficulty of Bayesian Optimization in higher dimensions - can someone please help me understand this in more detail?

References:
High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces (PDF)
A Tutorial on Bayesian Optimization (PDF)
High-dimensional Bayesian optimization using low-dimensional feature spaces (PDF)

$\endgroup$
3
  • 6
    $\begingroup$ On this site there's no need to say "thank you" at the end of your post - it might seem rude at first, but it's part of the philosophy of this site to "Ask questions, get answers, no distractions", and it means future readers of your question don't need to read through the pleasantries. $\endgroup$ Commented Feb 16, 2022 at 15:34
  • $\begingroup$ For this kind of problem it would be interesting learn about Principal Component Analysis, or PCA. $\endgroup$
    – raulmd13
    Commented Feb 18, 2022 at 8:21
  • $\begingroup$ Very nice to see SAASBO linked here (not affiliated). I used it within Ax platform to optimize 23 hyperparameters of a transformer neural network, and it set a new state-of-the-art benchmark on a materials science task. Not meant to counter other comments about problems being structure-dependent, but it's certainly become my go-to for high-dimensional optimization (see tutorial). $\endgroup$
    – Sterling
    Commented Apr 15, 2022 at 17:46

4 Answers 4

43
$\begingroup$

To be completely honest, it's because everything performs poorly in more than 20 dimensions. Bayesian optimization isn't special here.

Trying to optimize any function in a lot of dimensions is hard, because the volume of a high-dimensional space goes up exponentially with the number of dimensions. Consider a line segment on $[0, k]$; that has length $k$. A unit square? That has area $k^2$. And so on. So the amount of space that you have to search when you're looking for a possible solution goes up very, very, fast. I recommend looking up the term "Curse of Dimensionality" for more.

This will always be true, regardless of what algorithm you use -- unless you're willing to make some strong simplifying assumptions about the shape of that function. For example, gradient descent can do quite well in high dimensions -- as long as your function is convex and differentiable. If you have a function where the gradient is 0 somewhere besides the minimum, you're screwed. And if you have a function with multiple minima, you're screwed.

Bayesian optimization is exactly the same. The papers you've linked point out that if your function has an interesting structure, you can exploit this by picking good priors. Namely, you need to assume sparsity (that only a few of those dimensions are important and the rest can be ignored), or differentiability, in which case you can use gradient-enhanced Gaussian processes. But if you don't have that structure, you're screwed.

As for 20 dimensions, that's a rule of thumb. There's no "threshold" or anything, but it gets hard exponentially quickly.

$\endgroup$
12
  • 5
    $\begingroup$ I do not find the argument about the volume convincing. Compare the volume of a unit cube and a unit ball. The volume of the unit cube is one, while the volume of the unit ball goes to zero with increasing dimension. Since the maximum is reached in dimension 5 this would imply optimisation is EASIER in higher dimensions than in lower. Furthermore, I have never seen an argument that BO is possible on balls but not on cubes. $\endgroup$
    – g g
    Commented Feb 16, 2022 at 12:06
  • 5
    $\begingroup$ @gg Probably more relevant is the fact that the ratio of the volume of the (R-)ball compared to the volume of the (R-) cube goes to zero with increasing dimension. So if you think about a "random step" as moving somewhere from your start point to somewhere else in the corresponding $\varepsilon$-ball, then as the number of dimensions increases, the relative proportion of the corresponding $\varepsilon$-cube that you have explored goes to zero. So I would argue that the two points about high-dimensional volume are actually concordant with one another and both point to optimization being hard. $\endgroup$ Commented Feb 16, 2022 at 16:43
  • 9
    $\begingroup$ @gg Statements like "the n dimensional sphere has maximum volume in dimension 5" are only true for spheres of radius 1. Spheres of radius 2 are largest in dimension 24. Spheres of radius 3 are largest in dimension 56. The truth, as boring as it may be, is that it is meaningless to compare volumes of different dimensions. You wouldn't say that the volume of a sphere is larger than the area of a circle because volume and area have different units. For the same reason, it is nonsensical to say that a 5-dimensional sphere has larger volume than a sphere in any other dimension. $\endgroup$
    – Brady Gilg
    Commented Feb 16, 2022 at 17:56
  • 1
    $\begingroup$ @Brady Yeah but this is exactly my point: The dimension/volume argument makes no sense and if it did, it would indicate high dimensions are easier. $\endgroup$
    – g g
    Commented Feb 16, 2022 at 18:27
  • 2
    $\begingroup$ @gg You're right, the explanation I gave was very handwavey. Trying to compare "Volume" across dimensions doesn't make sense. That being said, there's a way to make this formal in terms of measure theory. The relevant fact is that the number of points needed to "Cover" a shape goes up exponentially with dimension. See here. $\endgroup$ Commented Feb 16, 2022 at 19:13
7
$\begingroup$

You will not find a theoretical/scientific justification for this statement as there is none!

The difficulty of optimisation is related to a lot of things, dimension being just one of them and most likely not even a very important one. For example, if you just assume continuity and not differentiability of your objective function the question of the dimension of the domain becomes completely moot anyway. Using space filling curves you can always reparametrise an $n$-dimensional function to one of a single variable.

So it is really easy to find one-dimensional functions which are impossible to optimise using BO. And it is also easy to find functions in hundreds or even thousands dimension which are simple to optimise, be it Bayesian or otherwise.

Then why do some people make such statements and why do others believe them?

I think there are two reasons for this:

  1. They are (in those cases the researcher is interested in or knows about) a good heuristic.
  2. The full qualification of those statements is just too long and would contain too many complicated "purely theoretical" qualifications. And many of those qualifications are probably "obvious" to people working in the field, they "go without saying".
$\endgroup$
1
  • 1
    $\begingroup$ I think one way to reason about this is that space filling curves of differentiable functions will always give curves that are convex nowhere, and the derivatives (if they even exist) will be totally useless. This removes pretty much all the tricks that can normally be used to optimize efficiently. $\endgroup$ Commented Feb 17, 2022 at 15:02
7
$\begingroup$

Yes, high dimensional space is tough in general, but there a couple things that make Bayesian optimization with Gaussian processes particularly perplexing on those prickly problems.

In my view, there isn't a single reason high dimension makes BO difficult, but the difficulty is rather due to a confluence of multiple factors.

Firstly, Bayesian optimization is classically conducted with a Gaussian process using something like a squared exponential kernel. This is a kernel which gives great flexibility which is perfect in low dimension but can become a liability in high dimension, as it puts reasonable probability mass on too many explanations of the point cloud. So the first issue is that the model we're using is already struggling to understand what's going on.

This is related to the volume argument from the other answer. Since GPs depend explicitly on distances, when all interpoint distances are equal and large, the GP has little discrimination power.

See how as we go into higher dimension, the distances between random points vary less than in low dimension:

high distances ["A survey on high-dimensional Gaussian process modeling with application to Bayesian optimization" by Mickaël Binois and Nathan Wycoff ]

As you mention, putting structure into the kernel function so that we are learning a mapping into a low dimensional space onto which we put a "standard" gaussian process is a good way to go here. Another option is to assume a kernel function which combines information from input dimensions in a more frugal manner, such as additive (Additive Gaussian Processes Part of Advances in Neural Information Processing Systems 24 (NIPS 2011) by David K. Duvenaud, Hannes Nickisch, Carl Rasmussen) or ANOVA kernels ("ANOVA decomposition of conditional Gaussian processes for sensitivity analysis with dependent inputs" Gaëlle Chastaing, Loic Le Gratiet).

Secondly, we can't forget about the fact that BO is a nested optimization procedure: every time a BO algorithm wants to suggest a next point to optimize, it has to solve an entire sub-optimization problem over the entire space! Unlike the original optimization problem, the acquisition function defined by our Gaussian process (whose optimizing point is our next candidate in our outer search) usually has a known gradient and Hessian, which is indeed helpful in finding a local solution.

However, the acquisition function is notoriously nonconvex, which means that in high dimensional spaces, however quickly we can find a local optimum, we may have little chance of finding anything close to a global optimum without considerable effort. Though this doesn't impact the ability of the Gaussian process to model the unknown objective, it does impact our ability to exploit its knowledge for optimization.

When combined with kernel shenanigans, sometimes the acquisition function can be optimized in a lower dimensional space, which can make things easier (or sometimes harder in practice; linear dimension reduction means we're doing linear programming rather than unconstrained optimization now and also doesn't vibe well with hyperbox constraints).

And third is the hyperparameter estimation risk. Most popular these days are "separable", "ARD", "tensor product" or otherwise axis-aligned anisotropic kernels which look something like $k(\mathbf{x}_1,\mathbf{x}_2) = \sigma^2 e^{\sum_{p=1}^P \frac{(x_{1,p}-x_{2,p})^2}{2\ell_p}}$, so we have one additional thing to estimate for each input dimension, and estimating gaussian process hyperparameters is tough, both from a statistical inference perspective and numerical analytic one.

Using a parameterized mapping into low dimension only makes the estimation risk worse (but may be offset by a substantially lower variance class of functions a priori).

Fourth: Computation Time. GPs are known for their small data statistical efficiency (in terms of error/data) and large data computational inefficiency (in terms of inference / second). In high dimension, if we really want to find a global optimum, we are probably going to have to evaluate the objective function tons of times, and thus have a large dataset for our GP to consider. This is simply not been the GPs historical niche, as classical, decomposition-based GP (i.e. where you actually take a Cholesky of the kernel matrix) inference scales with $n^3$ so gets intractable quickly.

Optimization of the acquisition function also only gets more expensive as $n$ goes up too. And keep in mind that you have to do this between every optimization iteration. So if we have an optimization budget of $10,000$, naively we would need to do all this numerical optimization literally $10,000-n_{\textrm{init}}$ times (though of course in practice we would evaluate batches of points at a time as well as only optimize hyperparams every few iters).

Large scale GP inference in high dimension has really matured over the last few decades, however, and have been the engine of some of the recent large scale GP-BO papers recently.

$\endgroup$
5
  • $\begingroup$ Could you explain your statement "[...] become a liability in high dimension, as it puts reasonable probability mass on too many explanations of the point cloud." more precisely? I fail to understand this. The feature space of quadratic exponential kernels is always infinite dimensional. What then has the dimension of the input space to do with the flexibility of the model? $\endgroup$
    – g g
    Commented Feb 21, 2022 at 8:56
  • $\begingroup$ I also have trouble with the statement "when all interpoint distances are equal and large, the GP has little discrimination power." GP regression/optimisation works perfectly on grids where all distances are equal or very similar. And since you can always scale the inputs, what does "large" mean exactly? $\endgroup$
    – g g
    Commented Feb 21, 2022 at 9:00
  • $\begingroup$ @gg 1) Let's take a step back and think about complexity of individual functions rather than complexity of a kernel class. This complexity is measured not in terms of how many of "features" (i.e. basis elements wrt a certain basis) is employed by a certain function, but rather by what's called its Reproducing Kernel Hilbert Space Norm. As input dimension increases, you can find more and more functions which interpolate a dataset and which have similar RKHS norm, which is all your GP prior is really considering. $\endgroup$ Commented Feb 21, 2022 at 21:58
  • $\begingroup$ @gg 2) pairwise distances are not equal on a grid. The distance between two adjacent points is rather different from the distance between opposite sides of the grid. Large here is relative to the lengthscale (which will change under reparameterization). Great questions! $\endgroup$ Commented Feb 21, 2022 at 22:00
  • $\begingroup$ Sorry for my doggedness but I still do not understand. Call the minimum norm interpolant in some RKHS $f$ and its norm $||f||$. Could you explain how you would measure the number/amount of "nearby" interpolants with similar norm $||f|| + \epsilon$ and how these numbers could be compared and related to the dimension of the input space? $\endgroup$
    – g g
    Commented Feb 23, 2022 at 7:26
1
$\begingroup$

I think you really want to ask is that, how does standard GP perform, compare to all those HDBO methods. We actually found that Standard BO performs great in High dimensional benchmarks (Standard Gaussian Process is All You Need for High-Dimensional Bayesian Optimization)! And we indeed think BO performs poorly in more than 20 dimensions is a rumor.

To make things clear, we believe standard gaussian process is under investigate for high dimensional problems. In most HDBO paper, you will mostly find that their baselines did not include standard BO!

Back to the topic. The most common approaches for GP based HDBO problems are:

  1. Additive structure
  2. Random projection
  3. Sparse prior

Those all approaches seem to make sense at first. But if you think closely, even if the function is fully additive, learning the true additive decomposition is hard in high dimension setting. And in this paper(Are Random Decompositions all we need in High Dimensional Bayesian Optimisation?), it is even shown that a random decomposition graph performs better than a learnt one.

Next, random projection. Clearly learning a true projection is impossible given BO setting, where you only have limited data budget. Thus, most researchers opt for random projection. However, by doing so, we will face other difficulties. The low dimensional embedding space will not be exact as high dimensional bounded region(Most BO problem are bounded optimization)! For details of this, you could read S.5 of ALEBO.

Now for the one that we found works the best empirically, sparse prior. Before we even dare to question the rumor, we found SaasBO performs the best in most cases, despite being slow due to NUTS sampling. Then we thought, ARD prior also induce sparsity right (link)? So why not try it out. Then we found GP with ard kernel actually performs really well.

In general, everything is hard in high dimension. But it does not suggest that more complicated approaches(additive structures, ...) will have better results. Hence we advocated for more in depth evaluations on GP in high dimensional problems.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.