-1

According to some epistemologists, what bias leads to the Occam's razor? The Occam's razor seems like a bias, because the thinking that the simplest thing should be favored seem to imply something about the universe itself and that thinking or that bias is not epistemologically justified. Or is it epistemologically justified?

1
  • That the simplest thing should be favored (other things being equal) does not imply anything about the universe, it merely acknowledges our limited capabilities. It is of the same nature as the advice to look under the streetlight. Not that the thing is necessarily there, but looking there is perfectly justified because we can't see in the dark anyway. Simple is where the light is. Not that it doesn't lead to the oversimplification fallacy and the like.
    – Conifold
    Commented Jul 7, 2021 at 1:25

1 Answer 1

1

Well, first of all, the probability of a conjunction of two events is never more, and usually less, than the probability of either event alone. P(A ∩ B) <= P(A) and P(A ∩ B) <= P(B) for any events A and B. Therefore, if a "complex" hypothesis is a conjunction of several simpler ones, it must have lower probability (or at best the same probability) as any of the simpler ones. "Dave went to get groceries and was hit by a truck" is always going to be lower probability than just "Dave was hit by a truck."

Solomonoff induction makes use of Occam's razor in a rigorous way to provide an idealized Bayesian theory of evaluating hypotheses based on observations. It is in essence a solution to the problem of induction, and although somewhat idealized and impractical it provides a simplified model for how we actually think.

In Bayesian inference, when we have a countable set of mutually exclusive hypotheses (as we do in the case of Solomonoff induction), Occam's razor is a result of the fact that any probability mass function over the hypotheses must place almost all of its weight on the relatively "short" ones, in order for the probabilities to sum to 1.

In Bayesian inference, we have a prior distribution over possible hypotheses. We observe evidence, and then update the prior using Bayes' rule, P(h|e) = P(h) P(e|h)/P(e). The hypotheses may be considered to be from a discrete set; for example, each hypothesis may be a proposition written as a sequence of symbols in some language. Since hypotheses are discrete, we may place them in correspondence with the positive integers, by writing out all possible hypotheses as propositions, and numbering the list starting at 1. When the hypotheses are mutually exclusive, to have a prior distribution over hypotheses is to have a prior distribution over the positive integers.

Any probability mass function over the positive integers must place most of its weight on "small" integers. This means that "small" integers must be more likely than "large" ones, in whatever Bayesian prior you choose. "Small" correlates to "simple"; thus any prior favors more simple hypotheses.

We can say this precisely. Say we have a random variable X over the positive integers, with a distribution given by a probability mass function f_X. Given any real ε>0, there is a positive integer N such that P(X < N) > 1-ε. For instance, there is a positive integer N such that the probability that X is smaller than N, is at least 99.99999999999%. Because almost all of the integers are larger than N (even if N is very large), X is strongly biased towards small numbers.

Why is this? Because the probabilities of any probability distribution must sum to 1. As you add up P(X=1) + P(X=2) + P(X=3) + ..., eventually the sum must approach 1 as closely as you wish. If you've summed N terms, this means that P(X<N) gets as close to 1 as you wish.

There is no way to have all the integers be equally probable. If you assign any probability greater than 0 to all the integers, then the sum of probabilities is infinite. If you assign probability 0 to all the integers, you do not have a probability mass function, and you cannot sample from the distribution. (It is conceivable - according to some definitions of "probability space" but not others - that you might have a more general "probability space" - but you can't sample an integer from it, and assigning a probability of exactly 0 to every hypothesis is not useful in Bayesian inference.)

1
  • The problem with this view (be it the Solomonoff ot Bayedian) is that the different statements offered belong to the same class. It can't decide between two cinceotual different statements (theories) given to explain. Commented Jul 7, 2021 at 11:36

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .