I have been studying Bayesian Optimization lately and made the following notes about this topic:
Unlike deterministic functions, real world functions are constructed using physical measurements
Measurements can always have error (e.g. human error, design and experiment error, random error, measurement error) - if you record the same measurements at the same conditions but at a future time, it's very likely that these measurements might be different from the previous measurements
Thus, an objective function that is based on physical measurements is naturally "unstable" - two people might record the same measurements, end up with different values, and as a result end up with two different objective functions.
A "noisy" function is also an "unstable" function - if we were top optimize this "noisy function", the optimization results might not correspond to natural system we are studying due to inherent errors while recording measurements. This means that in some sense, we are dealing with a more complicated version of "apples and oranges".
Bayesian Optimization attempts to solve this problem by using the recorded measurements as "pegs" and fitting a "circus tent" over these measurements through the form of a Gaussian Process. This sort of acts like "probabilistic smoothing" and tries to statistically account for all possible uncaptured variations in the measurements that exist - provided the assumption of the "data generating process" being well represented by a Gaussian Process is somewhat true.
Thus, Bayesian Optimization tries to "smoothens out" the noise/variation/error in the objective function, adding a natural "robustness" to the final optimization solution. All this means is that Bayesian Optimization has the potential to give us better results.
Advantages of Bayesian Optimization:
- Robustness (as descibed above)
- Does not require the objective function to be differentiable (i.e. useful in discrete and combinatorial optimization problems)
- Since it does not calculate the derivative, it has the potential to be more "computationally efficient" compared to gradient based optimization methods.
Disadvantages of Bayesian Optimization:
- Requires the true objective function to be well modelled by a Gaussian Process
- Empirically has been observed to perform poorly on high dimensional objective functions (i.e. higher than 20 dimensions) - however, I don't understand why.
I have often heard this claim being made about Bayesian Optimization performing poorly in more than 20 dimensions, but I have never been able to understand why this is. I tried to consult some references online:
1) "High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces" (Eriksson et al 2021)
"High-dimensional BO presents a particular challenge, in part because the curse of dimensionality makes it difficult to define—as well as do inference over—a suitable class of surrogate models."
"While BO has become a workhorse algorithm that is employed in a wide variety of settings, successful applications are often limited to low-dimensional problems, e.g. fewer than twenty dimensions [Frazier, 2018]. Applying BO to high-dimensional problems remains a significant challenge. The difficulty can be traced to both of the algorithm components mentioned above, although we postulate that suitable function priors are especially important for good performance. In particular, in order for BO to be sample-efficient in high-dimensional spaces, it is crucial to define surrogate models that are sufficiently parsimonious that they can be inferred from a small number of query points. An overly flexible class of models is likely to suffer from overfitting, which severely limits its effectiveness in decision-making. Likewise, an overly rigid class of models is unlikely to capture enough features of the objective function. A compromise between flexibility and parsimony is essential."
2) "High-dimensional Bayesian optimization using low-dimensional feature spaces" (Moriconi et al, 2020)
- "However, BO (Bayesian Optimization) is practically limited to optimizing 10–20 parameters. To scale BO to high dimensions, we usually make structural assumptions on the decomposition of the objective and/or exploit the intrinsic lower dimensionality of the problem, e.g. by using linear projections. We could achieve a higher compression rate with nonlinear projections, but learning these nonlinear embeddings typically requires much data. This contradicts the BO objective of a relatively small evaluation budget."
3) "A Tutorial on Bayesian Optimization" (Frazier, 2018)
"It (Bayesian Optimization) is best-suited for optimization over continuous domains of less than 20"
"The input x is in R-d for a value of d that is not too large. Typically d ≤ 20 in most successful applications of BayesOpt."
My Question : No where in these papers do they explain why "20 Dimensions" seems to be a relevant threshold for deciding the conditions in which the performance of Bayesian Optimization begins to deteriorate.
Can someone please explain why "20 Dimensions" is said to be the maximum threshold for Bayesian Optimization?
Even though some explanations are provided that explain the difficulty of Bayesian Optimization in higher dimensions - can someone please help me understand this in more detail?
References:
High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces (PDF)
A Tutorial on Bayesian Optimization (PDF)
High-dimensional Bayesian optimization using low-dimensional feature spaces (PDF)