108
$\begingroup$

I have trouble understanding the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science.

From the purely mathematical point of view, I think it would be uncontroversial to say that Bayes' theorem does not amount to a particularly sophisticated result. Indeed, the relation $$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B\cap A)P(A)}{P(B)P(A)}=\frac{P(B|A)P(A)}{P(B)}$$ is a one line proof that follows from expanding both sides directly from the definition of conditional probability. Thus, I expect that what people find interesting about Bayes' theorem has to do with its practical applications or implications. However, even in those cases I find the typical examples being used as a justification of this to be a bit artificial.


To illustrate this, the classical application of Bayes' theorem usually goes something like this: Suppose that

  1. 1% of women have breast cancer;
  2. 80% of mammograms are positive when breast cancer is present; and
  3. 10% of mammograms are positive when breast cancer is not present.

If a woman has a positive mammogram, then what is the probability that she has breast cancer?

I understand that Bayes' theorem allows to compute the desired probability with the given information, and that this probability is counterintuitively low. However, I can't help but feel that the premise of this question is wholly artificial. The only reason why we need to use Bayes' theorem here is that the full information with which the other probabilities (i.e., 1% have cancer, 80% true positive, etc.) have been computed is not provided to us. If we have access to the sample data with which these probabilities were computed, then we can directly find $$P(\text{cancer}|\text{positive test})=\frac{\text{number of women with cancer and positive test}}{\text{number of women with positive test}}.$$ In mathematical terms, if you know how to compute $P(B|A)$, $P(A)$, and $P(B)$, then this means that you know how to compute $P(A\cap B)$ and $P(B)$, in which case you already have your answer.


From the above arguments, it seems to me that Bayes' theorem is essentially only useful for the following reasons:

  1. In an adversarial context, i.e., someone who has access to the data only tells you about $P(B|A)$ when $P(A|B)$ is actually the quantity that is relevant to your interests, hoping that you will get confused and will not notice.
  2. An opportunity to dispel the confusion between $P(A|B)$ and $P(B|A)$ with concrete examples, and to explain that these are very different when the ratio between $P(A)$ and $P(B)$ deviates significantly from one.

Am I missing something big about the usefulness of Bayes' theorem? In light of point 2., especially, I don't understand why Bayes' theorem stands out so much compared to, say, the Borel-Kolmogorov paradox, or the "paradox" that $P[X=x]=0$ when $X$ is a continuous random variable, etc.

$\endgroup$
13
  • 20
    $\begingroup$ It’s elementary. How interesting it is is subjective I guess $\endgroup$ Commented Jan 27, 2021 at 16:10
  • 41
    $\begingroup$ Why “Bayes” gets talked about a lot has more to do with Bayesian inference methods and the Bayesian interpretation of probability than just the theorem itself. $\endgroup$ Commented Jan 27, 2021 at 16:17
  • 11
    $\begingroup$ The mammogram example is not artificial. See opinionator.blogs.nytimes.com/2010/04/25/chances-are $\endgroup$ Commented Jan 28, 2021 at 16:06
  • 9
    $\begingroup$ When building a breast cancer test, you'd likely want to test it in a known positive and known negative population, rather than a random sample of the population. If you just tested randomly, you'd need a lot more data, since you'd only be able to evaluate the prognostic value on positive patients in 1% of your entire cohort. The breast cancer example isn't contrived - Bayes theorem will let you apply fixed characteristics of the test to populations with very different characteristics, which is often what's done when optimizing and applying a medical test, for example. $\endgroup$ Commented Jan 28, 2021 at 17:26
  • 13
    $\begingroup$ mandatory xkcd: xkcd.com/1132 $\endgroup$
    – Franky
    Commented Jan 29, 2021 at 13:12

8 Answers 8

107
$\begingroup$

You are mistaken in thinking that what you perceive as "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science" is really "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science." But it's probably not your fault: This usually doesn't get explained very well.

What is the probability of a Caucasian American having brown eyes? What does that question mean? By one interpretation, commonly called the frequentist interpretation of probability, it asks merely for the proportion persons having brown eyes among Caucasian Americans.

What is the probability that there was life on Mars two billion years ago? What does that question mean? It has no answer according to the frequentist interpretation. "The probability of life on Mars two billion years ago is $0.54$" is taken to be meaningless because one cannot say it happened in $54\%$ of all instances. But the Bayesian, as opposed to frequentist, interpretation of probability works with this sort of thing.

The Bayesian interpretation applied to statistical inference is immune to various pathologies afflicting that field.

Possibly you have seen that some people attach massive importance to the Bayesian interpretation of probability and mistakenly thought it was merely massive importance attached to Bayes's theorem. People who do consider Bayesianism important seldom explain this very clearly, primarily because that sort of exposition is not what they care about.

$\endgroup$
19
  • 9
    $\begingroup$ For a more "grounded" example, I like to use "the probability that candidate X wins the election" - elections are deterministic, so if you imagine holding the election many times under identical circumstances, the same people will vote or not-vote in the same ways and you will get the same results, every time. But surely the various polling models we see on Fivethirtyeight and similar sites cannot be completely meaningless (even if we might criticize them for other reasons). So Bayesian probability is required for them to make sense. $\endgroup$
    – Kevin
    Commented Jan 28, 2021 at 7:26
  • 11
    $\begingroup$ Did I miss a joke here or is there a typo in the first sentence? The two phrases in quotes are identical ... Should the second occurence of "Bayes' theorem" be "Bayesian statistics"? $\endgroup$
    – CL.
    Commented Jan 28, 2021 at 9:47
  • 10
    $\begingroup$ @CL.: I don't think there's a typo - I think it's drawing a distinction between "what you think of as the importance given to (etc)" and "the actual importance given to (etc)" $\endgroup$
    – psmears
    Commented Jan 28, 2021 at 11:13
  • 13
    $\begingroup$ So what does “what is the probability that there was life on Mars two billion years ago” mean according to the Bayesian interpretation? $\endgroup$
    – Sweeper
    Commented Jan 28, 2021 at 15:00
  • 11
    $\begingroup$ I'm not a fan of saying something like the life-on-Mars question is meaningless in the frequentist interpretation. When we talk about the probability of life on Mars, it's meant to be understood as the probability given the information we have available, i.e. a conditional probability. Frequentists would interpret this as, out of all possible histories (or historical models) that are consistent with the information we do have, in what fraction of them did life on Mars exist two billion years ago? Not to take anything away from the Bayesian approach, but it's far from meaningless. $\endgroup$
    – David Z
    Commented Jan 29, 2021 at 10:13
72
$\begingroup$

While I agree with Michael Hardy's answer, there is a sense in which Bayes' theorem is more important than any random identity in basic probability. Write Bayes' Theorem as

$$\text{P(Hypothesis|Data)}=\frac{\text{P(Data|Hypothesis)P(Hypothesis)}}{\text{P(Data)}}$$

The left hand side is what we usually want to know: given what we've observed, what should our beliefs about the world be? But the main thing that probability theory gives us is in the numerator on the right side: the frequency with which any given hypothesis will generate particular kinds of data. Probabilistic models in some sense answer the wrong question, and Bayes' theorem tells us how to combine this with our prior knowledge to generate the answer to the right question.

Frequentist methods that try not to use the prior have to reason about the quantity on the left by indirect means or else claim the left side is meaningless in many applications. They work, but frequently confuse even professional scientists. E.g. the common misconceptions about $p$-values come from people assuming that they are a left-side quantity when they are a right-side quantity.

$\endgroup$
2
  • 1
    $\begingroup$ How do we know P(Hypothesis)? $\endgroup$
    – Jonas Frey
    Commented Feb 22, 2021 at 22:15
  • 5
    $\begingroup$ @JonasFrey We don't, which is why frequentist statistics doesn't like it. This formula tells you how to move from a set of old probabilities to a set of new ones, but doesn't tell you where to start. In practice we either treat all hypotheses equally (frequentist) or propose a particular P(Hypothesis) based on past experience as an explicit input into our model (bayesian). You can also imagine this as P(Hypothesis|past data) so that we are chaining new data onto old results. That can be metaphor or literal depending on circumstance. $\endgroup$ Commented Feb 26, 2021 at 2:55
26
$\begingroup$

You might know only $\Pr[A\mid B]$ and not $\Pr[B\mid A]$, not because someone "adversarially told you the wrong one", but because one of those is a natural quantity to compute, and the other is a natural quantity to want to know.

I am about to teach Bayes' theorem in an undergraduate course in probability. The general setting I want to consider is when:

  • We have several competing hypotheses about the world. (Several candidates for $B$.)
  • If we assume one of these hypotheses, then we get a nice and easy probability problem where it's easy to find the probability of $A$: some observations that we've made. (Outside undergraduate probability courses, "nice and easy" is a relative term.)
  • We want to figure out which hypothesis is likelier.

The mammogram example might be natural, but it's less obviously natural because we have to track down where the numbers that are given to us come from, and ask why we couldn't be given the other quantities in the problem. So here are some examples where we have fewer numbers coming to us out of thin air.

  1. Suppose you are communicating over a binary channel which flips bits $10\%$ of the time. (This part is given to us out of nowhere, but it's the natural quantity to ask about first.) Your friend has several possible messages they might send you: these are the hypotheses $B_1, B_2, \dots, B_n$. You receive a message: that's the observation $A$. Then $\Pr[A \mid B_i]$ is just $(0.1)^k (0.9)^{n-k}$ if $B_i$ is an $n$-bit message that differs from the one you received in $k$ places. On the other hand, $\Pr[B_i \mid A]$ is the quantity we want: it will tell us how likely it is that your friend sent each message.
  2. You have a coin, and you don't know anything about its fairness. One possible assumption is that it lands heads with probability $p$, where $p \sim \text{Uniform}(0,1)$, but we could vary this. Then you flip the coin $n$ times and see $k$ heads. There are infinitely many hypotheses $B_p$, one for each possible $p$; under each of them, $\Pr[A \mid B_p]$ is just a binomial probability. Knowing the conditional PDF of $p$, which is what Bayes' theorem tells us, tells us more about how likely the coin is to land heads.
$\endgroup$
20
$\begingroup$

There are two main issues here. One is that on a Bayesian interpretation of probability (this term doesn't reference the theorem, but they're both named for Bayes), probability quantifies how well we know individual events, not detailed available frequency statistics. The best-of-both-worlds hope, if you combine Bayesian and frequentist perspectives, is that past data give us the mammogram values you cited, and an individual woman can be diagnosed based on Bayes's theorem.

The second issue is that $P(A|B)$ need not be remotely close to $P(B|A)$. To wit:

  • A test that's usually right may still have most of its positives be false, which warrants some scepticism, as well as further testing.
  • Conflating $P(A|B)$ with $P(B|A)$ is a danger in the legal system. Will we arrest people based on accuracy, precision etc., even if their guilt is unlikely? Will "this evidence is unlikely if they're innocent" get them convicted, even though it may not mean their innocence is unlikely? And yes, this has had real-world fallout in both policing and court decisions.
  • Statistics tests what probability assumes (e.g. "if this is Gaussian then..."). Statistical tests often boil down to, "we can't measure the probability the null hypothesis is true, but we'll assess it based on the probbaility on the null hypothesis that data at least this surprising would occur". Indeed, which statement gets to be the null hypothesis is more about its facilitating such calculations than its being a "default" or "reasonable" assumption.
$\endgroup$
5
  • 5
    $\begingroup$ I see. So perhaps the second reason I stated (i.e., confusing $P(A|B)$ and $P(B|A)$) is common enough and has dire enough potential consequences that it justifies extended discussion. $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 17:28
  • 1
    $\begingroup$ I agree with the legal system danger of confusing $P(A|B)$ and $P(B|A)$. However, if (for example) you were to stipulate "this evidence is unlikely if the defendant is innocent" and you were also to stipulate that the evidence is not unlikely when the defendant is guilty, then the evidence does provide a math basis for concluding that the defendant is probably (i.e. greater than 50%) guilty. $\endgroup$ Commented Jan 27, 2021 at 18:49
  • 6
    $\begingroup$ @user2661923 But this is not true - or am I misunderstanding your statement? You can have $P(E|I) = 1/100$ (evidence unlikely if the defendant is innocent), $P(E|G) = 99/100$ (evidence not unlikely if the defendant is guilty), yet if the prior probability of guilt is sufficiently low (let's say the defendant was just randomly picked on the street), the conditional probability of guilt given evidence will be low as well - for instance for $P(G) = 1/1000$, we get $P(G|E) \approx 0.09$. $\endgroup$
    – aekmr
    Commented Jan 28, 2021 at 11:05
  • $\begingroup$ @aekmr +1: very good catch - I totally overlooked your analysis - good rebuttal. In defense, I was influenced by the fact that a Police Dept is a political organization that hates to be embarrassed, so they won't generally arrest someone at random. However, your rebuttal certainly stands. $\endgroup$ Commented Jan 28, 2021 at 14:32
  • 2
    $\begingroup$ @user2661923 To be honest, I had to check my calculation a few times to verify I'm not saying something stupid. Which, together with the fact you made that mistake in the first place, seems like a good illustration that using Bayes' theorem in everyday situations isn't something people understand intuitively and that it benefits from a good exposition :). There are many examples of even well educated people getting conditional probabilities very wrong - to take the example of breast cancer screening from op, see here. $\endgroup$
    – aekmr
    Commented Jan 29, 2021 at 8:55
9
$\begingroup$

Let me start by a memory. From my undergraduate days, 30 years ago, I vividly remember the time when Bayes was introduced. We had spent a lot of time and effort on sampling theory and how to know if things could be proved. And to me, at the time, it always ended up that we needed to have a sample size of x (my remembrance was that a sample size of 7 often was the minimum).

To me Bayes represented a totally different approach which to me was more in alignment with my view of reality. In sampling we looked at groups, with Bayes we started with individual things. So for me this was a very eye-opening addition to the field of probability praxis (and theory of course, but that came later for me). The book we had, written by Raiffa I believe, was about decisition theory. 30 years later I still remember the discussion about whether to do one more test drilling in the oil field.

So, just maybe, in your curriculum the importance placed on Bayes is there to show that statistics does have several different branches, not only sampling theory or how present graphs as correct as possible.

$\endgroup$
7
$\begingroup$

You are correct that Bayes' theorem follows trivially from axioms of probability that everyone accepts. The difference between Bayesans and frequentists is a cultural one. The actual mathematical axioms they subscribe to are trivially homologous.

The cultural divide is a pretty stark one though.

  • Frequentists tend to think computation is a dirty word and they dont care to analyse problems that they cannot approach analytically, so basically they would prefer to think that everything is a gaussian. Also some of them tend to do this funny numerology thing where they fetishise numbers like 0.01 and 0.05

  • Bayesians think that if they write down a uniform prior as a formula it looks more like real mathematics and less like a stupid assumption that rarely applies (appeals to 'entropy' make them feel great too); and they delude themselves into thinking that labelling part of their likelihood function a prior makes them special; as if frequentists couldn't multiply different likelihood functions together to get a joint one just fine.

Actual examples where a non-strawman version of either approach to the same problem yields a different result, do not actually exist. Because there are not actually any differences in the fundamental axioms they subscribe to. That being said it is not as if the language, computational tools, and modelling approaches you use are unimportant to guiding your thought process. Itd be better if teaching methods focussed more on said homology though.

$\endgroup$
3
  • 4
    $\begingroup$ Keith Winstein provides an excellent, neutral explanation of the difference between the two approaches. His explanation was my first encounter with a neutral comparison between the two approaches and he has convinced me that neither approach is superior. $\endgroup$
    – Brian
    Commented Jan 28, 2021 at 19:27
  • 2
    $\begingroup$ It's not cultural: Bayesians follow a degree-of-belief interpretation of probability whereas frequentists assign probabilities only when they can be interpreted as relative frequencies. $\endgroup$ Commented Jan 30, 2021 at 22:25
  • 1
    $\begingroup$ Those are nice words. And they might indeed be influential in steering ones mind in one direction or another. The mathematical axioms are homologous though. And ive never met a practicing statistician who strictly abided by those categorisations. $\endgroup$ Commented Jan 31, 2021 at 10:51
1
$\begingroup$

Not exactly an answer to the posted question, but Bayesian ideology is important in many practical problems in artificial intel, including character recognition, medical diagnoses, and more, the key structure being a Bayesian inference network.

$\endgroup$
0
$\begingroup$

First see the comments following this answer, especially the last few comments. I was totally unaware that Bayes Theorem is simply a consequence of axioms around the definition of Conditional Probability. Based on this assertion, I can't refute the idea that the following problem can be solved without Bayes Theorem.


Hard to imagine attacking a conditional probability problem without it. Imagine traveling back in time 1000 years. You are the captain of a ship. You have two sailors, A and B that you independently use to predict rain.

A is right 90% of the time and B is right 80% of the time.
A says it will rain today, and B says it won't rain today.
Absent Bayes Theorem, and absent any info on how often (in general) it rains, how do you (intuitively) determine the chance that it will rain today? Clearly, the problem is well defined, so the problem has a meaningful answer. Absent Bayes Theorem, or anything like it, how do you compute the answer?

$\endgroup$
18
  • 1
    $\begingroup$ (1/2) I accept that some exercises that are given in undergraduate problem sets are stated in such a way that Bayes' theorem is required. One example of this is the mammogram problem I posed in my question, and another is what you stated here. However, I think that this fits into the purely artificial "adversarial" type problem. Who is telling us that A is right 90% of the time and that B is right 80% of the time? In order to compute these two probabilities in the first place, you need to know something about how often it rains in general. $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 16:57
  • 1
    $\begingroup$ (2/2) In short, if it is truly impossible to compute $P(A|B)$ directly, then it should not be possible to compute $P(B|A)$, $P(A)$, and $P(B)$ either. If someone is unable to provide information on how the probability that A\B is right is 90%/80% of the time (information with which you could compute the conditional probabilities you are interested in directly), then why should you trust that these numbers are accurate? $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 17:01
  • 1
    $\begingroup$ I still don't understand where we're going with any of this, and how this relates to Bayes. You have defined two independent events, say $E_A=\{A\text{'s prediction is correct}\}$ and $E_B=\{B\text{'s prediction is correct}\}$, with respective probabilities 0.9 and 0.8. Now you've computed $P(E_A\cap E_B^c)/(P(E_A\cap E_B^c)+P(E_A^c\cap E_B))$. How does this relate to the initial problem and Bayes' Theorem? $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 17:44
  • 2
    $\begingroup$ Yes, I understand where independence and disjointness are coming from. More importantly: the conclusion, then, is that we do not actually need Bayes' theorem in your problem at all. I would submit to you that the content of your answer then should exclusively focus on your claim that "the definition of conditional probability is based on Bayes' theorem." Without additional context this is in fact incorrect. Under the Kolmogorov/frequentist approach to probability, $P(A|B)$ is defined without any mention whatsoever of Bayes' theorem. $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 18:25
  • 2
    $\begingroup$ Bayes’ theorem is then proved using the definition of conditional probability, not the other way around. The problem you posed can be solved entirely without Bayes’ theorem under the standard Kolmogorov axioms. In this case, I would consider editing your answer to focus on your claim that the definition of conditional probability comes from Bayes’ theorem (which, again, without additional information, such as “if we adopt a different axiomatization or philosophy than Kolmogorov’s”, is objectively incorrect.) $\endgroup$
    – user78270
    Commented Jan 27, 2021 at 18:25

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .