30
$\begingroup$

In the context of machine learning, what is the difference between

  • unsupervised learning
  • supervised learning and
  • semi-supervised learning?

And what are some of the main algorithmic approaches to look at?

$\endgroup$
2
  • 8
    $\begingroup$ First, two lines from wiki: "In computer science, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data)." Does that help? $\endgroup$
    – user28
    Commented Jul 22, 2010 at 16:25
  • $\begingroup$ What do you have in mind with "Algorithmic approaches"? I gave some examples of applications in my answer, is that what you are looking for? $\endgroup$
    – Peter Smit
    Commented Jul 22, 2010 at 16:49

3 Answers 3

24
$\begingroup$

Generally, the problems of machine learning may be considered variations on function estimation for classification, prediction or modeling.

In supervised learning one is furnished with input ($x_1$, $x_2$, ...,) and output ($y_1$, $y_2$, ...,) and are challenged with finding a function that approximates this behavior in a generalizable fashion. The output could be a class label (in classification) or a real number (in regression)-- these are the "supervision" in supervised learning.

In the case of unsupervised learning, in the base case, you receives inputs $x_1$, $x_2$, ..., but neither target outputs, nor rewards from its environment are provided. Based on the problem (classify, or predict) and your background knowledge of the space sampled, you may use various methods: density estimation (estimating some underlying PDF for prediction), k-means clustering (classifying unlabeled real valued data), k-modes clustering (classifying unlabeled categorical data), etc.

Semi-supervised learning involves function estimation on labeled and unlabeled data. This approach is motivated by the fact that labeled data is often costly to generate, whereas unlabeled data is generally not. The challenge here mostly involves the technical question of how to treat data mixed in this fashion. See this Semi-Supervised Learning Literature Survey for more details on semi-supervised learning methods.

In addition to these kinds of learning, there are others, such as reinforcement learning whereby the learning method interacts with its environment by producing actions $a_1$, $a_2$, . . .. that produce rewards or punishments $r_1$, $r_2$, ...

$\endgroup$
6
  • 1
    $\begingroup$ Your answer kind of implies that supervised learning is preferable to semi-supervised learning, where ever feasible. Is that correct? If not, when might semi-supervised learning be better? $\endgroup$
    – naught101
    Commented Aug 26, 2013 at 3:25
  • $\begingroup$ @naught101 How do you read that from his answer? I agree with what John says, but I would say the opposite of what you say, namely that semi-supervised learning is preferable to supervised learning wherever possible. That is, if you have some labeled data and some unlabeled data (usually much more than the amount of labeled data), you'd do better if you could make use of all data than if you could only make use of the labeled data. The whole point of using semi-supervised learning is to surpass the performance obtained by doing either supervised learning or unsupervised learning. $\endgroup$ Commented Jun 7, 2017 at 20:31
  • $\begingroup$ @HelloGoodbye: because the only benefit specified for semi-supervised learning is that it's cheaper in some cases, but it's got the added draw-back of being more challenging. It seems reasonable to me that fully supervised learning would be easier, and more accurate (all other things being equal), given that more ground truth data is supplied. So I was just asking for examples where, given the choice between the two, semi-supervised would be preferred. You comment does make sense, but is there a case where all data is labeled and you'd still prefer semi-supervised? $\endgroup$
    – naught101
    Commented Jun 8, 2017 at 1:01
  • $\begingroup$ @naught101 I guess if all data is labeled, you don't win very much by using semi-supervised learning instead of using normal supervised learning. When you have a lot of unlabeled data and do semi-supervised learning, the main reason you see improved performance is because you do transfer learning and are able to draw experience from the unlabeled data as well. $\endgroup$ Commented Jun 9, 2017 at 21:05
  • $\begingroup$ @naught101 However, by giving the network the task of reproducing the input data as good as possible from the output data (i.e. implementing an autoencoder, which is a kind of unsupervised learning), the network is forced to learn good representations of the data. This may act as a kind of regularisation, which in turn also can prove beneficial. So there could perhaps be a small win of using semi-supervised learning instead of normal supervised learning, even if all data would be labeled. How big this effect is though, I don't know. $\endgroup$ Commented Jun 9, 2017 at 21:05
14
$\begingroup$

Unsupervised Learning

Unsupervised learning is when you have no labeled data available for training. Examples of this are often clustering methods.

Supervised Learning

In this case your training data exists out of labeled data. The problem you solve here is often predicting the labels for data points without label.

Semi-Supervised Learning

In this case both labeled data and unlabeled data are used. This for example can be used in Deep belief networks, where some layers are learning the structure of the data (unsupervised) and one layer is used to make the classification (trained with supervised data)

$\endgroup$
0
8
$\begingroup$

I don't think that supervised/unsupervised is the best way to think about it. For basic data mining, it's better to think about what you are trying to do. There are four main tasks:

  1. prediction. if you are predicting a real number, it is called regression. if you are predicting a whole number or class, it is called classification.

  2. modeling. modeling is the same as prediction, but the model is comprehensible by humans. Neural networks and support vector machines work great, but do not produce comprehensible models [1]. decision trees and classic linear regression are examples of easy-to-understand models.

  3. similarity. if you are trying to find natural groups of attributes, it is called factor analysis. if you are trying to find natural groups of observations, it is called clustering.

  4. association. it's much like correlation, but for enormous binary datasets.

[1] Apparently Goldman Sachs created tons of great neural networks for prediction, but then no one understood them, so they had to write other programs to try to explain the neural networks.

$\endgroup$
3
  • $\begingroup$ can u give more information on the GS story? (not sure why i cant comment directly on your comment) $\endgroup$
    – Y A
    Commented Jul 19, 2011 at 8:50
  • $\begingroup$ i can't remember exactly where i read that, but here is some more info on AI @ GS: hplusmagazine.com/2009/08/06/… $\endgroup$ Commented Jul 19, 2011 at 22:47
  • $\begingroup$ I have this feeling that 1,2 describe learning in a supervised setting and 3,4 reside in an unsupervised setting. Also: what if you look for similarities in order to predict? Is that considered modelling? $\endgroup$ Commented May 11, 2017 at 7:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.