12
$\begingroup$

My question comes from the following fact. I have been reading posts, blogs, lectures as well as books on machine learning. My impression is that machine learning practitioners seem to be indifferent to many things that statisticians/econometrics care about. In particular, machine learning practitioners emphasize prediction accuracy over inference.

One such example occurred when I was taking Andrew Ng's Machine Learning on Coursera. When he discusses Simple Linear Model, he mentioned nothing about the BLUE property of the estimators, or how heteroskedasticity would "invalidate" confidence interval. Instead, he focuses on gradient descent implementation and the concept of cross validation/ROC curve. These topics were not covered in my econometrics/statistics classes.

Another example occurred when I participated in Kaggle competitions. I was reading other people's code and thoughts. A large part of the participants just throw everything into SVM/random forest/XGBoost.

Yet another example is about stepwise model selection. This technique is widely used, at least online and on Kaggle. Many classical machine learning textbooks also cover it, such as Introduction to Statistical Learning. However, according to this answer (which is quite convincing), stepwise model selection faces lots of problem especially when it comes down to "discovering the true model". It seems to be that there are only two possibilities: either machine learning practitioners do not know the problem with stepwise, or they do but they do not care.

So here are my questions:

  1. Is it true that (in general) machine learning practitioners focus on prediction and thus do not care about a lot of things which statisticians/economists care about?
  2. If it is true, then what is the reason behind it? Is it because inference is more difficult in some sense?
  3. There are tons of materials on machine learning (or prediction) online. If I am interested in learning about doing inference, however, what are some resources online that I can consult?

Update: I just realized that the word "inference" could potentially mean lots of stuff. What I meant by "inference" refers to questions such as

  1. Did $X$ cause $Y$ or $Y$ caused $X$? Or more generally, what's the causal relations among $X_1,X_2,\cdots,X_n$?

  2. Since "all models are wrong", how "wrong" is our model from the true model?

  3. Given the information of a sample, what can we say about the population and how confident can we say that?

Due to my very limited statistics knowledge, I am not even sure whether those questions fall into the realm of statistics or not. But those are the types of questions which machine learning practitioners do not seem to care about. Perhaps statisticians do not care neither? I don't know.

$\endgroup$
2
  • 2
    $\begingroup$ Brian D Ripley is cited on useR! 2004 with "To paraphrase provocatively, machine learning is statistics minus any checking of models and assumptions." The phrase has become part of the fortunes package on CRAN. This just to say, you are not alone with the Impression, that mathematical rigor is not always the main concern in machine learning. $\endgroup$
    – Bernhard
    Commented Sep 13, 2016 at 6:03
  • 1
    $\begingroup$ Leo Breiman tackles exactly this question in his 2001 paper "Statistical Modeling: the two cultures", which is a great read. $\endgroup$
    – skd
    Commented Sep 16, 2016 at 9:19

1 Answer 1

5
$\begingroup$

First, I would have different perspective for machine learning. What you mentioned, Andrew Ng's Coursera lecture and Kaggle competition are not 100% of machine learning but some branches that targeted on practical applications. Real machine learning research should be the work that invent the random forest / SVM / gradient boosting model, which is fairly close to statistics / math.

I would agree machine learning practitioners focus more on accuracy comparing to statisticians / economists. There are reasons that people interested in getting better accuracy, rather than "inference about the true distribution." The major reason is the way we collect data and use the data has been changed over past decades.

Statistics was established for hundred years, but in the past, no one would think about you have billions of data for training and other billions of data for testing. (For example, number of images on Internet). Therefore, with relatively small amount of data, assumptions from domain knowledge are needed to do the work. Or you can think about to "regularize" the model. Once the assumptions were made, then there are inferences problems about the "true" distribution.

However, if we carefully think about it, can we make sure these assumptions are true, and the inferences are valid? I would like cite George Box:

All models are wrong but some are useful

Now, let's back to think about the practical approach to put more emphasis on accuracy than assumption / inference. It is a good approach, when we have huge amount of data.

Suppose we are building a model for all images contain human faces on pixel level. First, it is very hard to propose the assumptions on pixel level for billion of images: no one has that domain knowledge. Second, we can think about all possible ways to fit the data, and because the data is huge, all the models we have may not be sufficient (almost impossible to over fit).

This is also why, "deep learning / neural network" got popular again. Under the condition of big data, we can pick one model that really complex, and fit it as best as we can, and we may still OK, because our computational resources are limited, comparing to all the real data in the word.

Finally, if the model we built are good in huge testing data set, then they are good and valuable, although we may not know the underline assumption or true distribution.


I want to point out the word "inference" has different meanings in different community.

  • In statistics community, it usually means getting information of the true distribution in parametric or non-parametric way.
  • In machine learning community, it usually means computing certain probabilities from a given distribution. See Murphy's Graphical Models Tutorial for examples.
  • In machine learning, people use the word "learning" to represent "getting the parameters of the true distribution", which is similar to the "inference" in statistics community.

So, you can see, essentially, there are many people in machine learning are also doing "inference".

In addition, you may also think about people in academia like to "re-brand their work and re-sell": coming up with new terms may be helpful to show the novelty of the research. In fact, there are many overlaps among artificial intelligence, data mining and machine learning. And they are closely related to statistics and algorithm design. Again there are no clear boundaries for doing "inference" or not.

$\endgroup$
1
  • 3
    $\begingroup$ I can see where you are coming from. An alternate take might be: prediction = focus on observed variables, inference = focus on hidden variables. So in a sense inference is trying to produce new types of measurements, while prediction is more about new realizations of measurements that could in principle be observed? (This is compatible with your answer, of course) $\endgroup$
    – GeoMatt22
    Commented Sep 13, 2016 at 3:50

Not the answer you're looking for? Browse other questions tagged or ask your own question.