496
$\begingroup$

Last year, I read a blog post from Brendan O'Connor entitled "Statistics vs. Machine Learning, fight!" that discussed some of the differences between the two fields. Andrew Gelman responded favorably to this:

Simon Blomberg:

From R's fortunes package: To paraphrase provocatively, 'machine learning is statistics minus any checking of models and assumptions'. -- Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004) :-) Season's Greetings!

Andrew Gelman:

In that case, maybe we should get rid of checking of models and assumptions more often. Then maybe we'd be able to solve some of the problems that the machine learning people can solve but we can't!

There was also the "Statistical Modeling: The Two Cultures" paper by Leo Breiman in 2001 which argued that statisticians rely too heavily on data modeling, and that machine learning techniques are making progress by instead relying on the predictive accuracy of models.

Has the statistics field changed over the last decade in response to these critiques? Do the two cultures still exist or has statistics grown to embrace machine learning techniques such as neural networks and support vector machines?

$\endgroup$
10
  • 25
    $\begingroup$ Thanks @robin; made CW. Although I don't entirely see this as "argumentative"; there are two fields which have informed each other (this is a fact), and the question is how much they have evolved together over the last decade. $\endgroup$
    – Shane
    Commented Aug 9, 2010 at 14:17
  • 23
    $\begingroup$ Add a third culture: data mining. Machine learners and data miners speak quite different languages. Usually, the machine learners don't even understand what is different in data mining. To them, it's just unsupervised learning; they ignore the data management aspects and apply the buzzword data mining to machine learning, too, adding further to the confusion. $\endgroup$ Commented Dec 6, 2011 at 12:05
  • 5
    $\begingroup$ There's a similar question on data mining and statistics $\endgroup$
    – naught101
    Commented Mar 22, 2012 at 23:51
  • 2
    $\begingroup$ An interesting discussion in Wasserman's blog. $\endgroup$
    – user10525
    Commented Jun 16, 2012 at 10:43
  • 4
    $\begingroup$ It seems to me that actually the link between ML and statistics is not being emphasized enough. Many CS students ignore learning anything about statistics during their foundational days because they don't understand the critical importance of a sound statistics grounding in carrying out ML tasks. Maybe even a lot of CS departments around the world would be slow to act as well. It would prove to be very costly mistake and I certainly hope there's more awareness about the importance of statistics knowledge in CS. Basically ML = Statistics in a lot of senses. $\endgroup$
    – xji
    Commented Nov 30, 2017 at 15:09

20 Answers 20

232
$\begingroup$

I think the answer to your first question is simply in the affirmative. Take any issue of Statistical Science, JASA, Annals of Statistics of the past 10 years and you'll find papers on boosting, SVM, and neural networks, although this area is less active now. Statisticians have appropriated the work of Valiant and Vapnik, but on the other side, computer scientists have absorbed the work of Donoho and Talagrand. I don't think there is much difference in scope and methods any more. I have never bought Breiman's argument that CS people were only interested in minimizing loss using whatever works. That view was heavily influenced by his participation in Neural Networks conferences and his consulting work; but PAC, SVMs, Boosting have all solid foundations. And today, unlike 2001, Statistics is more concerned with finite-sample properties, algorithms and massive datasets.

But I think that there are still three important differences that are not going away soon.

  1. Methodological Statistics papers are still overwhelmingly formal and deductive, whereas Machine Learning researchers are more tolerant of new approaches even if they don't come with a proof attached;
  2. The ML community primarily shares new results and publications in conferences and related proceedings, whereas statisticians use journal papers. This slows down progress in Statistics and identification of star researchers. John Langford has a nice post on the subject from a while back;
  3. Statistics still covers areas that are (for now) of little concern to ML, such as survey design, sampling, industrial Statistics etc.
$\endgroup$
6
  • 28
    $\begingroup$ Great post! Note that Vapnick had a PhD in statistics. I'm not sure there are a lot of computer scientist that know the name Talagrand and I'm sure 0.01% of them can state by memory one result of talagrand :) can you ? I don't know the work of Valiant :) $\endgroup$ Commented Jul 29, 2010 at 11:30
  • $\begingroup$ I see the different answers when it comes to academic research and applications. I think that you answered in the context of the former. In applications I think the biggest difference is in the way the fields are expanding. ML through data science channel accepts everyone who can code, literally. In statistics you still need a formal degree in stats or near fields to enter the work force. $\endgroup$
    – Aksakal
    Commented Mar 18, 2015 at 12:31
  • 3
    $\begingroup$ Both survey sampling and industrial statistics are multi-billion dollar fields (survey research methods section of the American Statistical Association is the third largest after biometrics and consulting, and the latter includes a great number of industrial statisticians, too. There's a separate section on quality, and there is a yet separate Six-Sigma stuff and other quality control methods out there, not all of them entirely in statsitics). Both have critical shortages of statisticians as the current workforce of baby boomers who came to work in these areas in the 1960s is retiring. $\endgroup$
    – StasK
    Commented Jul 6, 2015 at 15:29
  • 7
    $\begingroup$ While some people get their jobs by posing on the red carpet at conferences, other people find theirs by applying the methods in the real world. The latter folks don't have that much interest in identifying the stars of any kind; they would rather much identify the methods that work, although on many occasions, after a few years in a given field, you are led to the same names over and over again. $\endgroup$
    – StasK
    Commented Jul 6, 2015 at 15:31
  • $\begingroup$ Why would sampling not be of concern to ML? Isn't that quite similar to the problem of having the right labelled training data in ML? $\endgroup$
    – gerrit
    Commented Jun 27, 2019 at 11:22
200
$\begingroup$

The biggest difference I see between the communities is that statistics emphasizes inference, whereas machine learning emphasized prediction. When you do statistics, you want to infer the process by which data you have was generated. When you do machine learning, you want to know how you can predict what future data will look like w.r.t. some variable.

Of course the two overlap. Knowing how the data was generated will give you some hints about what a good predictor would be, for example. However, one example of the difference is that machine learning has dealt with the p >> n problem (more features/variables than training samples) since its infancy, whereas statistics is just starting to get serious about this problem. Why? Because you can still make good predictions when p >> n, but you can't make very good inferences about what variables are actually important and why.

$\endgroup$
11
  • 13
    $\begingroup$ Could this be (overly) simplified as something like the difference between generative and discriminative models? $\endgroup$
    – Wayne
    Commented Feb 14, 2011 at 22:19
  • 5
    $\begingroup$ "One should solve the [classification] problem directly and never solve a more general problem as an intermediate step..." - Vapnik $\endgroup$
    – Wayne
    Commented Feb 14, 2011 at 22:42
  • 3
    $\begingroup$ @mbq: I didn't mean to imply that no inference can be done, just that it's not the main goal and that usually p >> n in ML, making it a lot harder. $\endgroup$
    – dsimcha
    Commented Feb 15, 2011 at 1:00
  • 3
    $\begingroup$ I strongly disagree with this view. It looks wrong. Things like recurrent neural networks also try to infer processes, and even go on and generate new sequences. $\endgroup$
    – caveman
    Commented Apr 14, 2016 at 15:14
  • 2
    $\begingroup$ So what about robotics? Probabilistic robotics is largely focused on inference, and pretty dominant in applications. But still a different "flavor" than statistics (and more engineering compared to machine/learning; i.e. real-time analysis/control) $\endgroup$
    – GeoMatt22
    Commented Aug 19, 2016 at 17:14
167
$\begingroup$

Bayesian: "Hello, Machine Learner!"

Frequentist: "Hello, Machine Learner!"

Machine Learner: "I hear you guys are good at stuff. Here's some data."

F: "Yes, let's write down a model and then calculate the MLE."

B: "Hey, F, that's not what you told me yesterday! I had some univariate data and I wanted to estimate the variance, and I calculated the MLE. Then you pounced on me and told me to divide by $n-1$ instead of by $n$."

F: "Ah yes, thanks for reminding me. I often think that I'm supposed to use the MLE for everything, but I'm interested in unbiased estimators and so on."

ML: "Eh, what's this philosophizing about? Will it help me?"

F: " OK, an estimator is a black box, you put data in and it gives you some numbers out. We frequentists don't care about how the box was constructed, about what principles were used to design it. For example, I don't know how to derive the $\div(n-1)$ rule."

ML: " So, what do you care about?"

F: "Evaluation."

ML: "I like the sound of that."

F: "A black box is a black box. If somebody claims a particular estimator is an unbiased estimator for $\theta$, then we try many values of $\theta$ in turn, generate many samples from each based on some assumed model, push them through the estimator, and find the average estimated $\theta$. If we can prove that the expected estimate equals the true value, for all values, then we say it's unbiased."

ML: "Sounds great! It sounds like frequentists are pragmatic people. You judge each black box by its results. Evaluation is key."

F: "Indeed! I understand you guys take a similar approach. Cross-validation, or something? But that sounds messy to me."

ML: "Messy?"

F: "The idea of testing your estimator on real data seems dangerous to me. The empirical data you use might have all sorts of problems with it, and might not behave according the model we agreed upon for evaluation."

ML: "What? I thought you said you'd proved some results? That your estimator would always be unbiased, for all $\theta$."

F: "Yes. While your method might have worked on one dataset (the dataset with train and test data) that you used in your evaluation, I can prove that mine will always work."

ML: "For all datasets?"

F: "No."

ML: "So my method has been cross-validated on one dataset. You haven't test yours on any real dataset?"

F: "That's right."

ML: "That puts me in the lead then! My method is better than yours. It predicts cancer 90% of the time. Your 'proof' is only valid if the entire dataset behaves according to the model you assumed."

F: "Emm, yeah, I suppose."

ML: "And that interval has 95% coverage. But I shouldn't be surprised if it only contains the correct value of $\theta$ 20% of the time?"

F: "That's right. Unless the data is truly i.i.d Normal (or whatever), my proof is useless."

ML: "So my evaluation is more trustworthy and comprehensive? It only works on the datasets I've tried so far, but at least they're real datasets, warts and all. There you were, trying to claim you were more 'conservative' and 'thorough' and that you were interested in model-checking and stuff."

B: (interjects) "Hey guys, Sorry to interrupt. I'd love to step in and balance things up, perhaps demonstrating some other issues, but I really love watching my frequentist colleague squirm."

F: "Woah!"

ML: "OK, children. It was all about evaluation. An estimator is a black box. Data goes in, data comes out. We approve, or disapprove, of an estimator based on how it performs under evaluation. We don't care about the 'recipe' or 'design principles' that are used."

F: "Yes. But we have very different ideas about which evaluations are important. ML will do train-and-test on real data. Whereas I will do an evaluation that is more general (because it involves a broadly-applicable proof) and also more limited (because I don't know if your dataset is actually drawn from the modelling assumptions I use while designing my evaluation.)"

ML: "What evaluation do you use, B?"

F (interjects): "Hey. Don't make me laugh. He doesn't evaluate anything. He just uses his subjective beliefs and runs with it. Or something."

B: "That's the common interpretation. But it's also possible to define Bayesianism by the evaluations preferred. Then we can use the idea that none of us care what's in the black box, we care only about different ways to evaluate."

B continues: "Classic example: Medical test. The result of the blood test is either Positive or Negative. A frequentist will be interested in, of the Healthy people, what proportion get a Negative result. And similarly, what proportion of Sick people will get a Positive. The frequentist will calculate these for each blood testing method that's under consideration and then recommend that we use the test that got the best pair of scores."

F: "Exactly. What more could you want?"

B: "What about those individuals that got a Positive test result? They will want to know 'of those that get a Positive result, how many will get Sick?' and 'of those that get a Negative result, how many are Healthy?' "

ML: "Ah yes, that seems like a better pair of questions to ask."

F: "HERESY!"

B: "Here we go again. He doesn't like where this is going."

ML: "This is about 'priors', isn't it?"

F: "EVIL".

B: "Anyway, yes, you're right ML. In order to calculate the proportion of Positive-result people that are Sick you must do one of two things. One option is to run the tests on lots of people and just observe the relevant proportions. How many of those people go on to die of the disease, for example."

ML: "That sounds like what I do. Use train-and-test."

B: "But you can calculate these numbers in advance, if you are willing to make an assumption about the rate of Sickness in the population. The frequentist also makes his calculations in advance, but without using this population-level Sickness rate."

F: "MORE UNFOUNDED ASSUMPTIONS."

B: "Oh shut up. Earlier, you were found out. ML discovered that you are just as fond of unfounded assumptions as anyone. Your 'proven' coverage probabilities won't stack up in the real world unless all your assumptions stand up. Why is my prior assumption so different? You call me crazy, yet you pretend your assumptions are the work of a conservative, solid, assumption-free analysis."

B (continues): "Anyway, ML, as I was saying. Bayesians like a different kind of evaluation. We are more interested in conditioning on the observed data, and calculating the accuracy of our estimator accordingly. We cannot perform this evaluation without using a prior. But the interesting thing is that, once we decide on this form of evaluation, and once we choose our prior, we have an automatic 'recipe' to create an appropriate estimator. The frequentist has no such recipe. If he wants an unbiased estimator for a complex model, he doesn't have any automated way to build a suitable estimator."

ML: "And you do? You can automatically build an estimator?"

B: "Yes. I don't have an automatic way to create an unbiased estimator, because I think bias is a bad way to evaluate an estimator. But given the conditional-on-data estimation that I like, and the prior, I can connect the prior and the likelihood to give me the estimator."

ML: "So anyway, let's recap. We all have different ways to evaluate our methods, and we'll probably never agree on which methods are best."

B: "Well, that's not fair. We could mix and match them. If any of us have good labelled training data, we should probably test against it. And generally we all should test as many assumptions as we can. And some 'frequentist' proofs might be fun too, predicting the performance under some presumed model of data generation."

F: "Yeah guys. Let's be pragmatic about evaluation. And actually, I'll stop obsessing over infinite-sample properties. I've been asking the scientists to give me an infinite sample, but they still haven't done so. It's time for me to focus again on finite samples."

ML: "So, we just have one last question. We've argued a lot about how to evaluate our methods, but how do we create our methods."

B: "Ah. As I was getting at earlier, we Bayesians have the more powerful general method. It might be complicated, but we can always write some sort of algorithm (maybe a naive form of MCMC) that will sample from our posterior."

F (interjects): "But it might have bias."

B: "So might your methods. Need I remind you that the MLE is often biased? Sometimes, you have great difficulty finding unbiased estimators, and even when you do you have a stupid estimator (for some really complex model) that will say the variance is negative. And you call that unbiased. Unbiased, yes. But useful, no!"

ML: "OK guys. You're ranting again. Let me ask you a question, F. Have you ever compared the bias of your method with the bias of B's method, when you've both worked on the same problem?"

F: "Yes. In fact, I hate to admit it, but B's approach sometimes has lower bias and MSE than my estimator!"

ML: "The lesson here is that, while we disagree a little on evaluation, none of us has a monopoly on how to create estimator that have properties we want."

B: "Yes, we should read each other's work a bit more. We can give each other inspiration for estimators. We might find that other's estimators work great, out-of-the-box, on our own problems."

F: "And I should stop obsessing about bias. An unbiased estimator might have ridiculous variance. I suppose all of us have to 'take responsibility' for the choices we make in how we evaluate and the properties we wish to see in our estimators. We can't hide behind a philosophy. Try all the evaluations you can. And I will keep sneaking a look at the Bayesian literature to get new ideas for estimators!"

B:"In fact, a lot of people don't really know what their own philosophy is. I'm not even sure myself. If I use a Bayesian recipe, and then proof some nice theoretical result, doesn't that mean I'm a frequentist? A frequentist cares about above proofs about performance, he doesn't care about recipes. And if I do some train-and-test instead (or as well), does that mean I'm a machine-learner?"

ML: "It seems we're all pretty similar then."

$\endgroup$
8
  • 12
    $\begingroup$ For readers who will read this response to the end I would suggest to add a brief take-away message (and to provide appropriate citation if it applies). $\endgroup$
    – chl
    Commented Oct 18, 2013 at 20:30
  • 1
    $\begingroup$ With -2 votes so far, I think there's not much I can do to save it :) I think the ending, where they all agree with each other, and admit they can use each others methods without worry about each others philosophy, is a 'take-away message'. $\endgroup$ Commented Oct 18, 2013 at 21:45
  • 13
    $\begingroup$ No citation required. I just made it up myself. It's probably not very well informed, it's based on my own (mis)-interpretations of arguments I've had with a small number of colleagues over the years. $\endgroup$ Commented Oct 18, 2013 at 21:46
  • 3
    $\begingroup$ I've seen such dialogue (shorter, though) in the past, and I find them interesting. I was also concerned by the downvotes, hence my suggestion to put a brief summary at the top so as to motivate readers to read the rest of your post. $\endgroup$
    – chl
    Commented Oct 19, 2013 at 5:59
  • 4
    $\begingroup$ 13/10 would argue again $\endgroup$ Commented Mar 21, 2017 at 19:22
79
$\begingroup$

In such a discussion, I always recall the famous Ken Thompson quote

When in doubt, use brute force.

In this case, machine learning is a salvation when the assumptions are hard to catch; or at least it is much better than guessing them wrong.

$\endgroup$
2
  • 2
    $\begingroup$ With the increased computational capabilities these years and autoencoders and associated techniques, this is more true than ever. $\endgroup$
    – Firebug
    Commented Aug 16, 2016 at 17:46
  • $\begingroup$ To solve a problem ,engineers use formulas , techniques and procedures , which they have used before and are sure of their success...Ordinarily , it is called the use Brute Force or the use of Thumb Rules ...New formulas ,techniques and procedures are used in a step by step process ...Engineering activities are group activities --where Engineers , Technicians and manual Laborers work together . When a new procedure is introduced , it takes time to train the Technicians and Laborers with this procedure . So modernisation is introduced in an Evolutionary process. $\endgroup$
    – b.sahu
    Commented Jan 30, 2017 at 19:59
70
$\begingroup$

What enforces more separation than there should be is each discipline's lexicon.

There are many instances where ML uses one term and Statistics uses a different term--but both refer to the same thing--fine, you would expect that, and it doesn't cause any permanent confusion (e.g., features/attributes versus expectation variables, or neural network/MLP versus projection-pursuit).

What's much more troublesome is that both disciplines use the same term to refer to completely different concepts.

A few examples:

Kernel Function

In ML, kernel functions are used in classifiers (e.g., SVM) and of course in kernel machines. The term refers to a simple function (cosine, sigmoidal, rbf, polynomial) to map non-linearly separable to a new input space, so that the data is now linearly separable in this new input space. (versus using a non-linear model to begin with).

In statistics, a kernel function is weighting function used in density estimation to smooth the density curve.

Regression

In ML, predictive algorithms, or implementations of those algorithms that return class labels "classifiers" are (sometimes) referred to as machines--e.g., support vector machine, kernel machine. The counterpart to machines are regressors, which return a score (continuous variable)--e.g., support vector regression.

Rarely do the algorithms have different names based on mode--e.g., a MLP is the term used whether it returns a class label or a continuous variable.

In Statistics, regression, if you are attempting to build a model based on empirical data, to predict some response variable based on one or more explanatory variables or more variables--then you are doing regression analysis. It doesn't matter whether the output is a continuous variable or a class label (e.g., logistic regression). So for instance, least-squares regression refers to a model that returns a continuous value; logistic regression on the other hand, returns a probability estimate which is then discretized to a class labels.

Bias

In ML, the bias term in the algorithm is conceptually identical to the intercept term used by statisticians in regression modeling.

In Statistics, bias is non-random error--i.e., some phenomenon influenced the entire data set in the same direction, which in turn means that this kind of error cannot be removed by resampling or increasing the sample size.

$\endgroup$
7
  • 21
    $\begingroup$ In statistics, bias is not the same as error. Error is purely random, bias is not. You have bias when you know that the expected value of your estimate is not equal to the true value. $\endgroup$
    – Joris Meys
    Commented Sep 8, 2010 at 21:30
  • 2
    $\begingroup$ (@Joris Or even if you don't know it! Sounds trite, but just figuring out if there's bias can be a considerable practical problem. From the data alone, how sure can you be that an estimated regression parameter is free of omitted variable bias?) It's a common misconception that bias is a feature of the data, not a property of an estimator; I wonder if it stems from non-technical usage like "that survey is biased!" Statisticians also aren't always consistent about terms like "error": mean square error (of an estimator) includes a bias-squared component, so that "error" isn't "purely random". $\endgroup$
    – Silverfish
    Commented Oct 31, 2013 at 2:42
  • 2
    $\begingroup$ I think the term "machine" in SVMs should be attributed to the personal taste of Vladimir Vapnic. Nowadays, I don't think it is not used to name any other classifier. $\endgroup$
    – iliasfl
    Commented Nov 3, 2014 at 5:25
  • 3
    $\begingroup$ Many of these aren't consistent with the usage I've seen in the ML community. Both types of kernels are in wide use (though Hilbert space kernels being more common), "machine" is basically only used for SVMs (as iliasfl notes), and "bias" usually means $\mathbb{E}[\hat{X} - X]$ (perhaps conditioned on something) which is not the same thing as an intercept. $\endgroup$
    – Danica
    Commented Nov 24, 2014 at 10:13
  • 1
    $\begingroup$ The statement "logistic regression on the other hand, returns a class labels." is wrong. Logistic regression returns continues values in $[0, 1]$ that are estimates for the probability to belong to the class coded as $1$. $\endgroup$
    – random_guy
    Commented Feb 29, 2016 at 22:03
37
$\begingroup$

The largest differences I've been noticing in the past year are:

  • Machine learning experts do not spend enough time on fundamentals, and many of them do not understand optimal decision making and proper accuracy scoring rules. They do not understand that predictive methods that make no assumptions require larger sample sizes than those that do.
  • We statisticians spend too little time learning good programming practice and new computational languages. We are too slow to change when it comes to computing and adopting new methods from the statistical literature.
$\endgroup$
1
  • 8
    $\begingroup$ Another note is that us statisticians tend to limit ourselves to methods we can prove with math that will work well (under a set of maybe ridiculous assumptions), especially when it comes to publications. Machine learning people are very happy to use methods that empirically work well on a few datasets. As a result, I think the ML literature moves much faster but also requires more sifting through silliness. $\endgroup$
    – Cliff AB
    Commented Jan 26, 2017 at 19:23
28
$\begingroup$

Machine Learning seems to have its basis in the pragmatic - a Practical observation or simulation of reality. Even within statistics, mindless "checking of models and assumptions" can lead to discarding methods that are useful.

For example, years ago, the very first commercially available (and working) Bankruptcy model implemented by the credit bureaus was created through a plain old linear regression model targeting a 0-1 outcome. Technically, that's a bad approach, but practically, it worked.

$\endgroup$
6
  • 5
    $\begingroup$ it's similar to using planet gravitational models to urban traffic. I find it absurd, but it works quiet accurately actually $\endgroup$
    – dassouki
    Commented Jul 21, 2010 at 14:25
  • 5
    $\begingroup$ I am interested in the last statement: "the very first commercially available (and working) Bankruptcy model implemented by the credit bureaus was created through a plain old linear regression model targeting a 0-1 outcome". Which model was it? I believe that the first model was RiskCalc by Moody's, and even the first version was a logistic regression model. The developers of that model were not CS people with a background in ML, but rather in econometrics. $\endgroup$
    – gappy
    Commented Jul 25, 2010 at 2:58
  • 2
    $\begingroup$ I bet they used discriminant analysis before logistic regression, as DA was invented well before LR $\endgroup$ Commented Jul 26, 2010 at 22:56
  • 1
    $\begingroup$ @gappy I'm thinking of the MDS Consumer Bankruptcy model for individual credit bureau records.RiskCalc was a credit risk assessment for companies. The MDS Bankruptcy model differed from the FICO risk models of the time in that the target was Bankruptcy and NOT credit delinquency (such as FICO's original scores). My comment was less about the specifics of ML in that context (because it was barely in use -if at all- at the time the BK model was first built), but related to the fact that practical effectiveness is not necessarily at all related to theoretic restrictions or assumption violations. $\endgroup$ Commented Mar 28, 2012 at 21:56
  • $\begingroup$ Just curious why was it technically a bad approach though. Because it made too many simplifying assumptions that would vastly differ from the reality? $\endgroup$
    – xji
    Commented Nov 30, 2017 at 15:07
28
$\begingroup$

I disagree with this question as it suggests that machine learning and statistics are different or conflicting sciences.... when the opposite is true!

machine learning makes extensive use of statistics... a quick survey of any Machine learning or data mining software package will reveal Clustering techniques such as k-means also found in statistics.... will also show dimension reduction techniques such as Principal components analysis also a statistical technique... even logistic regression yet another.

In my view the main difference is that traditionally statistics was used to proove a pre conceived theory and usually the analysis was design around that principal theory. Where with data mining or machine learning the opposite approach is usually the norm in that we have the outcome we just want to find a way to predict it rather than ask the question or form the theory is this the outcome!

$\endgroup$
24
$\begingroup$

The real problem is that this question is misguided. It is not machine learning vs statistics, it is machine learning against real scientific advance. If a machine learning device gives the right predictions 90% of the time but I cannot understand "why", what is the contribution of machine learning to science at large? Imagine if machine learning techniques were used to predict the positions of planets: there would be a lot of smug people thinking that they can accurately predict a number of things with their SVMs, but what would they really know about the problem they have in their hands? Obviously, science does not really advance by numerical predictions, it advances by means of models (mental, mathematical) who let us see far beyond than just numbers.

$\endgroup$
1
  • 1
    $\begingroup$ +1 This reminds me of the use of models in economics. Econometric models are built for a couple of purposes; namely, policy analysis and forecasting. In general, nobody really cares about forecasting - it's the policy simulations that matter most. As David Hendry has been saying, the best forecasting model is not necessarily the best model for policy analysis - and vice versa. Need to step back and think... What is the purpose of the model? What questions are we trying to answer? And how this fits in with making empirical discoveries. $\endgroup$ Commented Dec 25, 2016 at 16:37
21
$\begingroup$

I have spoken on this at a different forum the ASA Statistical Consulting eGroup. My response was more specifically to data mining but the two go hand in hand. We statisticians have snubbed our noses at data miners, computer scientists, and engineers. It is wrong. I think part of the reason it happens is because we see some people in those fields ignoring the stochastic nature of their problem. Some statisticians call data mining data snooping or data fishing. Some people do abuse and misuse the methods but statisticians have fallen behind in data mining and machine learning because we paint them with a broad brush. Some of the big statistical results have come from outside the field of statistics. Boosting is one important example. But statisticians like Breiman, Friedman, Hastie, Tibshirani, Efron, Gelman and others got it and their leadership has brought statisticians into the analysis of microarrays and other large scale inference problems. So while the cultures may never mesh there is now more cooperation and collaboration between the computer scientists, engineers and statisticians.

$\endgroup$
19
$\begingroup$

Statistical learning (AKA Machine Learning) has its origins in the quest to create software by "learning from examples". There are many tasks that we would like computers to do (e.g., computer vision, speech recognition, robot control) that are difficult to program but for which it is easy to provide training examples. The machine learning/statistical learning research community developed algorithms to learn functions from these examples. The loss function was typically related to the performance task (vision, speech recognition). And of course we had no reason to believe there was any simple "model" underlying these tasks (because otherwise we would have coded up that simple program ourselves). Hence, the whole idea of doing statistical inference didn't make any sense. The goal is predictive accuracy and nothing else.

Over time, various forces started driving machine learning people to learn more about statistics. One was the need to incorporate background knowledge and other constraints on the learning process. This led people to consider generative probabilistic models, because these make it easy to incorporate prior knowledge through the structure of the model and priors on model parameters and structure. This led the field to discover the rich statistical literature in this area. Another force was the discovery of the phenomenon of overfitting. This led the ML community to learn about cross-validation and regularization and again we discovered the rich statistical literature on the subject.

Nonetheless, the focus of most machine learning work is to create a system that exhibits certain performance rather than the make inferences about an unknown process. This is the fundamental difference between ML and statistics.

$\endgroup$
0
16
$\begingroup$

Ideally one should have a thorough knowledge of both statsitics and machine learning before attempting to answer his question. I am very much a neophyte to ML, so forgive me if wat I say is naive.

I have limited experience in SVMs and regression trees. What strikes me as lacking in ML from a stats point of view is a well developed concept of inference.

Inference in ML seems to boil down almost exclusively to the predictice accuracy, as measured by (for example) mean classification error (MCE), or balanced error rate (BER) or similar. ML is in the very good habit of dividing data randomly (usually 2:1) into a training set and a test set. Models are fit using the training set and performance (MCE, BER etc) is assessed using the test set. This is an excellent practice and is only slowly making its way into mainstream statistics.

ML also makes heavy use of resampling methods (especially cross-validation), whose origins appear to be in statistics.

However, ML seems to lack a fully developed concept of inference - beyond predictive accuracy. This has two results.

1) There does not seem to be an appreciation that any prediction (parameter estimation etc.) is subject to a random error and perhaps systemmatics error (bias). Statisticians will accept that this is an inevitable part of prediction and will try and estimate the error. Statistical techniques will try and find an estimate that has minimum bias and random error. Their techniques are usually driven by a model of the data process, but not always (eg. Bootstrap).

2) There does not seem to be a deep understanding in ML of the limits of applying a model to new data to a new sample from the same population (in spite of what I said earlier about the training-test data set approach). Various statistical techniques, among them cross validation and penalty terms applied to likelihood-based methods, guide statisticians in the trade-off between parsimony and model complexity. Such guidelines in ML seem much more ad hoc.

I've seen several papers in ML where cross validation is used to optimise a fitting of many models on a training dataset - producing better and better fit as the model complexity increases. There appears little appreciation that the tiny gains in accuracy are not worth the extra complexity and this naturally leads to over-fitting. Then all these optimised models are applied to the test set as a check on predictive performance and to prevent overfitting. Two things have been forgotten (above). The predictive performance will have a stochastic component. Secondly multiple tests against a test set will again result in over-fitting. The "best" model will be choisen by the ML practitioner without a full appreciation he/she has cherry picked from one realisation of many possible outomes of this experiment. The best of several tested models will almost certainly not reflect the true performance on new data.

Any my 2 cents worth. We have much to learn from each other.

$\endgroup$
2
  • 2
    $\begingroup$ your comment about The "best" model will be choisen by the ML practitioner... applies equally well to mainstream statistics as well. For in most model selection procedures, one simply conditions on the final model as if no search of the model space had been done (given that model averaging is fairly new). So I don't think you can use that as a "club" to beat the ML practitioner with, so to speak. $\endgroup$ Commented May 29, 2011 at 16:03
  • $\begingroup$ As a ML practitioner, I don't recognise the picture you are painting. The ML literature is almost all about variations of regularisation, MDL, Bayesian, SRM and other approaches of controlling the complexity of the model. From where I sit, it seems that stat's methods of controlling complexity are less structured, but that is bias for you. $\endgroup$ Commented Aug 11, 2011 at 2:25
14
$\begingroup$

This question can also be extended to the so-called super-culture of data science in 2015 David Donoho paper 50 years of Data Science, where he confronts different points of view from statistics and computer science (including machine learning), for instance direct standpoints (from different persons) such that:

  • Why Do We Need Data Science When We've Had Statistics for Centuries?
  • Data Science is statistics.
  • Data Science without statistics is possible, even desirable.
  • Statistics is the least important part of data science.

and assorted with historical, philosophical considerations, for instance:

It is striking how, when I review a presentation on today's data science, in which statistics is super cially given pretty short shrift, I can't avoid noticing that the underlying tools, examples, and ideas which are being taught as data science were all literally invented by someone trained in Ph.D. statistics, and in many cases the actual software being used was developed by someone with an MA or Ph.D. in statistics. The accumulated e orts of statisticians over centuries are just too overwhelming to be papered over completely, and can't be hidden in the teaching, research, and exercise of Data Science.

This essay has generated many responses and contributions to the debate.

$\endgroup$
3
  • 3
    $\begingroup$ This looks like a paper that would be worth mentioning in this recent popular thread stats.stackexchange.com/questions/195034, I think nobody mentioned it there. $\endgroup$
    – amoeba
    Commented Mar 21, 2016 at 21:12
  • 1
    $\begingroup$ I think if you post a new answer there summarizing this paper, it will be great. $\endgroup$
    – amoeba
    Commented Mar 21, 2016 at 21:15
  • $\begingroup$ I will, and need to summarize all the given answers for myself first $\endgroup$ Commented Mar 21, 2016 at 21:19
12
$\begingroup$

I don't really know what the conceptual/historical difference between machine learning and statistic is but I am sure it is not that obvious... and I am not really interest in knowing if I am a machine learner or a statistician, I think 10 years after Breiman's paper, lots of people are both...

Anyway, I found interesting the question about predictive accuracy of models. We have to remember that it is not always possible to measure the accuracy of a model and more precisely we are most often implicitly making some modeling when measuring errors.

For Example, mean absolute error in time series forecast is a mean over time and it measures the performance of a procedure to forecast the median with the assumption that performance is, in some sense, stationary and shows some ergodic property. If (for some reason) you need to forecast the mean temperature on earth for the next 50 years and if your modeling performs well for the last 50 years... it does not means that...

More generally, (if I remember, it is called no free lunch) you can't do anything without modeling... In addition, I think statistic is trying to find an answer to the question : "is something significant or not ", this is a very important question in science and can't be answered through a learning process. To state John Tukey (was he a statistician ?) :

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data

Hope this helps !

$\endgroup$
11
$\begingroup$

Clearly, the two fields clearly face similar but different problems, in similar but not identical ways with analogous but not identical concepts, and work in different departments, journals and conferences.

When I read Cressie and Read's Power Divergence Statistic it all snapped into place for me. Their formula generalizes commonly used test statistics into one that varies by one exponent, lambda. There are two special cases, lambda=0 and lambda=1.

Computer Science and Statistics fit along a continuum (that presumably could include other points). At one value of lambda, you get statistics commonly cited in Statistics circles, and at the other you get statistics commonly cited in Comp Sci circles.

Statistics

  • Lambda = 1
  • Sums of squares appear a lot
  • Variance as a measure of variability
  • Covariance as a measure of association
  • Chi-squared statistic as a measure of model fit

Computer science:

  • Lambda = 0
  • Sums of logs appear a lot
  • Entropy as a measure of variability
  • Mutual information as a measure of association
  • G-squared statistic as a measure of model fit
$\endgroup$
10
$\begingroup$

You run a fancy computer algorithm once -- and you get a CS conference presentation/statistics paper (wow, what a fast convergence!). You commercialize it and run it 1 million times -- and you go broke (ouch, why am I getting useless and irreproducible results all the time???) unless you know how to employ probability and statistics to generalize the properties of the algorithm.

$\endgroup$
4
  • 3
    $\begingroup$ I've downvoted this answer. Although with a question such as this it will inevitably involve some personal opinions, IMO we should strive for some more substantive critique. This just comes off as a rant. $\endgroup$
    – Andy W
    Commented May 6, 2012 at 14:58
  • $\begingroup$ @AndyW, this is, of course, an exaggeration of what I see around. A failure to think ahead statistically is true of academic world, too: the replicability of published results in psychology or medical sciences is at most 25% (see, e.g., simplystatistics.tumblr.com/post/21326470429/…) rather than the nominal 95%. The OP wanted statistics to embrace computer science; maybe computer science should embrace some statistics, and I gave the reasons why. $\endgroup$
    – StasK
    Commented May 7, 2012 at 3:25
  • 5
    $\begingroup$ @StasK I think you make some important points, why not try to make them a bit less aggressively? $\endgroup$
    – Gala
    Commented Jun 7, 2013 at 6:42
  • 2
    $\begingroup$ I enjoyed this pithy answer. $\endgroup$ Commented Jul 1, 2015 at 0:05
6
$\begingroup$

There is an area of application of statistics where focus on the data generating model makes a lot of sense. In designed experiments, e.g., animal studies, clinical trials, industrial DOEs, statisticians can have a hand in what the data generating model is. ML tends not to spend much time on this very important problem as ML usually focuses on another very important problem of prediction based on “large” observational data. That is not to say that ML can’t be applied to “large” designed experiments, but it is important to acknowledge that statistics has particular expertise on “small” data problems arising from resource constrained experiments.

At the end of the day I think we can all agree to use what works best to solve the problem at hand. E.g., we may have a designed experiment that produces very wide data with the goal of prediction. Statistical design principles are very useful here and ML methods could be useful to build the predictor.

$\endgroup$
5
$\begingroup$

I think machine learning needs to be a sub-branch under statistics, just like, in my view, chemistry needs to be a sub-branch under physics.

I think physics-inspired view into chemistry is pretty solid (I guess). I don't think there is any chemical reaction whose equivalent is not known in physical terms. I think physics has done an amazing job by explaining everything we can see at a chemistry level. Now the physicists' challenge seems to be explaining the tiny mysteries at the quantum level, under extreme conditions that are not observable.

Now back to machine learning. I think it too should be a sub-branch under statistics (just how chemistry is a sub-branch of physics).

But it seems to me that, somehow, either the current state of machine learning, or statistics, is not mature enough to perfectly realize this. But in the long run, I think one must become a sub-branch of the other. I think it's ML that will to get under statistics.

I personally think that "learning" and "analyzing samples" to estimate/infer functions or predictions are all essentially a question of statistics.

$\endgroup$
3
  • 3
    $\begingroup$ Should biology, psychology, and sociology also be "sub-branches" of physics? $\endgroup$
    – amoeba
    Commented Apr 14, 2016 at 15:33
  • $\begingroup$ Right.. Psychology is just input/output involving highly complicated biological machines. One day we may need to send our cars to a psychologist to diagnose its errors (the psychologist itself might be a computer). $\endgroup$
    – caveman
    Commented Apr 14, 2016 at 15:35
  • 1
    $\begingroup$ It looks to me like Mathematics is the father of all. From there we have applied mathematics, from which physics and other things come. Statistics is one of those. I think ML need not be a branch on its own and instead get blended into statistics. But if ML becomes a branch of its own, I prefer it to be a child/sub-branch of statistics. $\endgroup$
    – caveman
    Commented Apr 14, 2016 at 15:41
5
$\begingroup$

From the Coursera course "Data Science in real life" by Brian Caffo

Machine Learning

  • Emphasize predictions
  • Evaluates results via prediction performance
  • Concern for overfitting but not model complexity per se
  • Emphasis on performance
  • Generalizability is obtained through performance on novel datasets
  • Usually, no superpopulation model specified
  • Concern over performance and robustness

Traditional statistical analysis

  • Emphasizes superpopulation inference
  • Focuses on a-priori hypotheses
  • Simpler models preferred over complex ones (parsimony), even if the more complex models perform slightly better
  • Emphasis on parameter interpretability
  • Statistical modeling or sampling assumptions connects data to a population of interest
  • Concern over assumptions and robustness
$\endgroup$
-6
$\begingroup$

As as Computer Scientist, I am always intrigued when looking to statistical approaches. To me many times it looks like the statistical models used in the statistical analysis are way too complex for the data in many situations!

For example there is a strong link between data compression and statistics. Basically one needs a good statistical model which is able to predict the data well and this brings a very good compression of the data. In computer science when compressing the data always the complexity of the statistical model and the accuracy of the prediction are very important. Nobody wants to get have EVER a data file (containing sound data or image data or video data) becoming bigger after the compression!

I find that there are more dynamic things in computer science regarding statistics, like for example Minimum Description Length and Normalized Maximum Likelihood.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.