Why is deep learning hyped despite bad VC dimension?

Question

The Vapnik–Chervonenkis (VC)-dimension formula for neural networks ranges from $O(E)$ to $O(E^2)$, with $O(E^2V^2)$ in the worst case, where $E$ is the number of edges and $V$ is the number of nodes. The number of training samples needed to have a strong guarantee of generalization is linear with the VC-dimension.

This means that for a network with billions of edges, as in the case of successful deep learning models, the training dataset needs billions of training samples in the best case, to quadrillions in the worst case. The largest training sets currently have about a hundred billion samples. Since there is not enough training data, it is unlikely deep learning models are generalizing. Instead, they are overfitting the training data. This means the models will not perform well on data that is dissimilar to the training data, which is an undesirable property for machine learning.

Given the inability of deep learning to generalize, according to VC dimensional analysis, why are deep learning results so hyped? Merely having a high accuracy on some dataset does not mean much in itself. Is there something special about deep learning architectures that reduces the VC-dimension significantly?

If you do not think the VC-dimension analysis is relevant, please provide evidence/explanation that deep learning is generalizing and is not overfitting. I.e. does it have good recall AND precision, or just good recall? 100% recall is trivial to achieve, as is 100% precision. Getting both close to 100% is very difficult.

As a contrary example, here is evidence that deep learning is overfitting. An overfit model is easy to fool since it has incorporated deterministic/stochastic noise. See the following image for an example of overfitting.

Also, see lower ranked answers to this question to understand the problems with an overfit model despite good accuracy on test data.

Some have responded that regularization solves the problem of a large VC dimension. See this question for further discussion.

Comments are not for extended discussion; this conversation has been moved to chat. — D.W., Commented May 15, 2017 at 3:37
I don't think questions why is something "hyped" are good. The answer is "because people". People take interest in things because of plethora of reasons, including marketing. — luk32, Commented May 16, 2017 at 13:49
Deep learning works in practice. It might be overfiting. It might be completely unjustified. It might be learning secrets of the universe from an eldritch deity. But the hype is coming from practioners who are suddenly able to write 30 lines on code and teach a camera to scan signatures and match them with stored ones to validate bank transactions. Or tag unknown people in photographs. Etc. Maybe you've heard the line "it's not an insult if its true"? Well it's not hype if it works. There are lots of problems it doesn't work on and excessive popular hype. But it works in real life application. — Stella Biderman, Commented Jun 27, 2019 at 15:51
@yters The fact that adversarial examples exist is somewhat to the side of generalization. I can train a neural network on 10k data points and then test it on 100k data points and it'll do amazingly. That is generalization. The fact that we can deliberately concoct adversarial examples doesn't mean that the network doesn't work any more than the fact that we can concoct optical illusions doesn't mean that your vision doesn't work. Your entire question is predicated on the idea that neural networks don't generalize. But it's an empirical fact that they do. — Stella Biderman, Commented Jun 27, 2019 at 20:07
@StellaBiderman You are conflating good out-of-sample accuracy with generalization. While generalization implies the former, the former does not imply the latter. One example I've heard about is when a neural network "learned" to distinguish criminals from non-criminals by memorizing the white background of the prison photos. While the NN had great accuracy on out-of-sample dataset, it did not learn some general principle about criminal facial features. This is why we can concoct such absurd adversarial examples for NNs, because the NN is identifying 'cat' with incidental features. — yters, Commented Jun 30, 2019 at 20:39

Martin Berger · Accepted Answer · 2021-03-05 12:52:48Z

"If the map and the terrain disagree, trust the terrain."

It's not really understood why deep learning works as well as it does, but certainly old concepts from learning theory such as VC dimensions appear not to be very helpful.

The matter is hotly debated, see e.g.:

H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning work so well?
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning Requires Rethinking Generalization.
D. Krueger, B. Ballas, S. Jastrzebski, D. Arpit, M. S. Kanwal, T. Maharaj, E. Bengio, A. Fischer, A. Courville, Deep Nets Dont Learn Via Memorization.

Regarding the issue of adversarial examples, the problem was discovered in:

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions.

It is further developed in:

I. Goodfellow, J. Shlens, C. Szegedy, Explaining And Harnessing Adversarial Examples.

There is a lot of follow-on work.

Update March 2020. A new hypothesis that appears to explain some of the mismatch between clear over-parameterisation of modern (feed-forward) NNs and good recognition performance is Frankle and Carbin's Lottery Ticket Hypothesis from 2018:

J. Frankle, M. Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.

The claim is that a "randomly-initialised, dense [feed-forward] neural network contains a subnetwork that is initialised such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations." Regarding the original question, the Lottery Ticket Hypothesis might be understood as saying that:

Training by stochastic gradient descent searches for small subnetworks that work well and deemphasises the rest of the overparameterised network's learning capacity.
The bigger the original network, the more likely it is to contain a small subnetwork with good performance on the task at hand.

This has found empirical support, e.g. in

H. Zhou, J. Lan, R. Liu, J. Yosinski, Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.

and theoretical support in:

E. Malach, G. Yehudai, S. Shalev-Shwartz, O. Shamir, Proving the Lottery Ticket Hypothesis: Pruning is All You Need.

As far as I'm aware, it has not yet been possible to generalise the Lottery Ticket Hypothesis to recurrent NNs.

Update March 2021. The original 2016 paper Understanding Deep Learning Requires Rethinking Generalization has been updated. Here is the new version:

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning (Still) Requires Rethinking Generalization.

Comments are not for extended discussion; this conversation has been moved to chat. — D.W., Commented May 15, 2017 at 3:36
When you say "There is a lot of follow-on work" are you referring to the last 2014 paper? The first two papers you mention are fairly recent. Could you update with the papers you're referring to?. — VF1, Commented Jun 17, 2017 at 18:55
Strong +1 for "If the map and the terrain disagree, trust the terrain." The models work extremely well in practice regardless of if the math says they should. From a scientific POV, this happens all the time and if anything makes problems more interesting. Nobody read Razborov and Rudich's work on Natural Proofs and went "well, I guess P vs NP isn't an interesting question after all." They went and figured out that it might be possible to use algebraic geometry to do complexity theory. From the point of view of science, problems that transcend our understanding are better, not worse. — Stella Biderman, Commented Jun 27, 2019 at 15:45
Thank your for this nice reference list and (+1) for the quote. I would like to update and add this discussion: discuss.huggingface.co/t/… — mathoverflowUser, Commented Jul 9, 2021 at 13:30

D.W. · Accepted Answer · 2019-11-05 17:52:17Z

"Given the inability of Deep Learning to generalize, according to VC dimensional analysis [...]"

No, that's not what VC dimensional analysis says. VC dimensional analysis gives some sufficient conditions under which generalization is guaranteed. But the converse ain't necessarily so. Even if you fail to meet those conditions, the ML method still might generalize.

Put another way: deep learning works better than VC dimensional analysis would lead you to expect (better than VC analysis "predicts"). That's a shortcoming of VC dimensional analysis, not a shortcoming of deep learning. It doesn't imply that deep learning is flawed. Rather, it means that we don't know why deep learning works as well as it does -- and VC analysis is unable to provide any useful insights.

High VC dimension does not imply that deep learning can be fooled. High VC dimension doesn't guarantee anything at all about whether it can be fooled in practical situations. VC dimension provides a unidirectional, worst-case bound: if you meet these conditions, then good things happen, but if you don't meet these conditions, we don't know what will happen (maybe good things will still happen anyway, if nature behaves better than the worst possible case; VC analysis doesn't promise that good things can't/won't happen).

It could be that the VC dimension of the model space is large (it includes very complex patterns as possible), but nature is explained by simple patterns, and the ML algorithm learns the simple pattern present in nature (e.g., because of regularization) -- in this case, VC dimension would be high but the model would generalize (for the particular pattern that is present in nature).

That said... there is growing evidence that deep learning can be fooled by adversarial examples. But be careful about your chain of reasoning. The conclusions you are drawing don't follow from the premises you started with.

High VC dimension does imply its harder to generalize (in some sense, at least when dealing with arbitrary distributions). The $\Omega\left(\sqrt{\frac{d}{n}}\right)$ generalization error lower bound exactly means that for number of samples small compared to the VC dimension, there exists a distribution such that relative to it any algorithm will experience high generalization error (with high probability). — Ariel, Commented May 14, 2017 at 6:49
-1 for "High VC dimensional doesn't guarantee anything at all." This is not true: high VC-dimension implies sample complexity lower bounds for PAC learning. A good answer should address worst-case vs "real-life" distributions. — Sasho Nikolov, Commented May 15, 2017 at 4:19
This post was in low quality review. Given content, length, votes and quality, this is ridiculous, pointing this here, but it may need meta, because something is really wrong. — Evil, Commented Sep 25, 2019 at 0:02

Ariel · Accepted Answer · 2017-05-13 14:00:20Z

Industry people have no regard for VC dimension, hooligans...

On a more serious note, although the PAC model is an elegant way to think about learning (in my opinion at least), and is complex enough to give rise to interesting concepts and questions (such as VC dimension and its connection to sample complexity), it has very little to do with real life situations.

Remember that in the PAC model you are required to handle arbitrary distributions, this means that your algorithm should handle adversarial distributions. When trying to learn some phenomena in the real world, no one is giving you "adversarial data" to mess up your results, so requiring a concept class to be PAC learnable might be way too strong. Sometimes you can bound the generalization error independently of the VC dimension, for a specific class of distributions. This is the case of margin bounds, who are formulated independently of the VC dimension. They can promise low generalization error if you can guarantee high empirical margin (which of course, cannot happen for all distributions, e.g. take two close points on the plane with opposite tags, and focus the distribution on them).

So, putting the PAC model and VC dimension aside, I think the hype comes from the fact they just seem to work, and succeed in tasks that were previously not possible (one of the latest achievements that comes to mind is AlphaGo). I know very little about neural nets, so I hope someone with more experience will pitch in, but to my knowledge there are no good guarantees yet (definitely not like in the PAC model). Perhaps under the right assumptions one could justify formally the success of neural nets (I assume there are works around the formal treatment of neural nets and "deep learning", so I'm hoping people with more knowledge on the subject could link some papers).

Comments are not for extended discussion; this conversation has been moved to chat. — D.W., Commented May 15, 2017 at 3:37

Martin Thoma · Accepted Answer · 2017-05-16 17:29:08Z

15

Given the inability of Deep Learning to generalize,

I don't know where you take that from. Empirically, generalization is seen as the score (e.g. accuracy) on unseen data.

The answer why CNNs are used is simple: CNNs work much better than anything else. See ImageNet 2012 for example:

CNNs: 15.315% (that was an early example. CNNs are much better now. At about 4% top-5 error)
Best non-CNN: 26.172% Top-5-error (source - up to my knowledge techniques which do not use CNNs didn't get below 25% top-5 error)

Create a classifier which is better and people will shift to that.

UPDATE: I will award an answer to anyone providing published evidence that machine learning in general is easily fooled, like this evidence for Deep Learning.

This is not the case. You can create a classifier which is extremely simple on a simple dataset. It will not be possible to fool it (it doesn't even matter what "easy" means) , but it also is not interesting.

edited May 16, 2017 at 17:29

answered May 14, 2017 at 11:15

Martin Thoma

2,3601 gold badge22 silver badges41 bronze badges

3

$\begingroup$ A low error does not imply generalization. It is a necessary, but not sufficient, condition. $\endgroup$
– yters
Commented May 15, 2017 at 2:05
3

$\begingroup$ @yters Please define generalization then. $\endgroup$
– Martin Thoma
Commented May 16, 2017 at 6:39
6

$\begingroup$ @yters, this comment makes me think you haven't read much about Machine Learning. Martin said accuracy on unseen data. You're talking about accuracy on training data. You're basically correct about what generalization is, but please realize that everyone else here understands that too. $\endgroup$
– Ken Williams
Commented May 16, 2017 at 15:32
2

$\begingroup$ @yters I am pretty sure Ken (and many people on this site, including myself) knows this. If your test set, however, does not represent your dataset you can't make any statement about generalization. While it is worth to keep this in mind, I do not see how this helps you in any way for this question. You just have to assume / make sure that your test set does represent your data at production time. In fact, it is really easy to show that you can make any classifier arbitrary bad if the training samples do not represent the distribution. $\endgroup$
– Martin Thoma
Commented May 16, 2017 at 19:26
3

$\begingroup$ That's obvious. You can't expect a model to generalize well if it's trained on validated on the wrong data. You need better data, not a better model. $\endgroup$
– Emre
Commented May 18, 2017 at 5:24

| Show 8 more comments

David Khoo · Accepted Answer · 2017-05-15 02:26:43Z

12

The one word answer is "regularization". The naive VC dimension formula does not really apply here because regularization requires that the weights not be general. Only a tiny (infinitesimal?) proportion of weight combinations have acceptable loss after regularization. The true dimension is many orders of magnitude less as a result, so generalization can occur with the training sets we have. The real life results bear out that overfitting isn't generally happening.

answered May 15, 2017 at 2:26

David Khoo

1212 bronze badges

3

$\begingroup$ I've seen the repeated claim that real life results show deep learning generalizes. What exactly are the results that show generalization? All I've seen so far is that DL achieves low error rates on particular datasets, which does not in itself mean that DL generalizes. $\endgroup$
– yters
Commented May 16, 2017 at 1:44
4

$\begingroup$ it shows good results ("good" = better than other ML methods) on data that it was not trained on. im not sure how else you want to practically measure generalization. $\endgroup$
– lvilnis
Commented May 16, 2017 at 13:53
$\begingroup$ Concise and correct answer. $\endgroup$
– Ryan
Commented May 13, 2021 at 14:06

Add a comment |

Charles Martin · Accepted Answer · 2017-11-27 20:42:48Z

We address the paper: Understanding Deep Learning Requires Rethinking Generalization. in

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior Charles H. Martin and Michael W. Mahoney

See: https://arxiv.org/pdf/1710.09553.pdf

Basically, we argue that the VC bounds are too loose because the fundamental approach and how the statistical limit that is taken is unrealistic.

A better approach lies in Statistical Mechanics, which considers a class of data dependent functions, takes the Thermodynamic limit (not just the limit of large numbers)

Moreover, we also point out how the natural discontinuities in deep need lead to a phase transitions in the learning curve, which we believe is being observed in the Google paper (above)

With regard to the limits, see section 4.2 of our paper

"Clearly, if we fix the sample size m and let [the size of the function class] N → ∞ , [or vise versa, fix N, let m → ∞] the we should not expect a non-trivial result, since [N] is becoming larger but the sample size is fixed. Thus, [in Statistical Mechanics] one typically considers the case that m, N → ∞ such that α = m/N is a fixed constant."

That is, very rarely would we just add more data (m) to a deep net. We always increase the size of the net (N) too, because we know that we can capture more detailed features / information from the data. Instead we do in practice what we argue for in the paper--take the limit of large size, with the ratio m/N fixed (as opposed to say fixing m and let N increase).

These results are well known in the Statistical Mechanics of Learning. The analysis is more complicated, but the results lead to a much richer structure that explains many phenomena in deep learning.

Also, and in particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).

The paper we wrote tries to explain the salient ideas to a computer science audience.

Vapnik himself realized that his theory was not really applicable to neural networks...way back in 1994

"The extension of [the VC dimension] to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network...[because] it is likely...the search will be confined to a subset of [these] functions...The capacity of this set can be much lower than the capacity of the whole set...[and] may change with the number of observations. This may require a theory that considers the notion of a non-constant capacity with an 'active' subset of functions"
Vapnik, Levin, and LeCun 1994

http://yann.lecun.com/exdb/publis/pdf/vapnik-levin-lecun-94.pdf

While not easy to treat with VC theory, this is not an issue for stat mech..and what they describe looks very much like Energy Landscape Theory of protein folding. (which will be the topic of a future paper)

This sounds interesting, but I'm not sure I follow your argument. Can you elaborate on the first sentence, i.e., on how the fundamental approach / statistical limit is unrealistic, in a self-contained way that doesn't require understanding statistical mechanics? What assumptions do VC bounds make, and why are they unrealistic? Perhaps you can edit your answer to include that information? — D.W., Commented Nov 26, 2017 at 18:40
I added a reference to the original work by Vapnik and LeCun (1994) that discusses the issue. — Charles Martin, Commented Nov 27, 2017 at 19:47

Yan King Yin · Accepted Answer · 2018-08-05 17:44:09Z

2

No one seems to have pointed out in the above answers, that the VC dimension formula quoted is only for a 1-layer neural network. My guess is that the VC dimension actually grows exponentially as the number of layers L increases. My reasoning is based on considering deep neural networks where the activation function is replaced by polynomial ones. Then the degree of the composed polynomials grows exponentially as the layers increase.

answered Aug 5, 2018 at 17:44

Yan King Yin

2192 silver badges6 bronze badges

Add a comment |

Stack Exchange Network

Why is deep learning hyped despite bad VC dimension?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
machine-learning
vc-dimension
or ask your own question.

Linked

Hot Network Questions

Why is deep learning hyped despite bad VC dimension?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged machine-learningvc-dimension or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
vc-dimension
or ask your own question.