32
$\begingroup$

I've read those words in quite a lot of publications and I would like to have some nice definitions for those terms which make it clear what the difference between object detection vs semantic segmentation vs localization is. It would be nice if you could give sources for your definitions.

$\endgroup$
1

5 Answers 5

19
$\begingroup$

I read a lot of papers about, Object Detection, Object Recognition, Object Segmentation, Image Segmentation and Semantic Image Segmentation and here's my conclusions which could be not true:

Object Recognition: In a given image you have to detect all objects (a restricted class of objects depend on your dataset), Localized them with a bounding box and label that bounding box with a label. In below image you will see a simple output of a state of the art object recognition.

object recognition

Object Detection: it's like Object recognition but in this task you have only two class of object classification which means object bounding boxes and non-object bounding boxes. For example Car detection: you have to Detect all cars in a given image with their bounding boxes.

Object Detection

Object Segmentation: Like object recognition you will recognize all objects in an image but your output should show this object classifying pixels of the image.

object segmentation

Image Segmentation: In image segmentation you will segment regions of the image. your output will not label segments and region of an image that consistent with each other should be in same segment. Extracting super pixels from an image is an example of this task or foreground-background segmentation.

image segmentation

Semantic Segmentation: In semantic segmentation you have to label each pixel with a class of objects (Car, Person, Dog, ...) and non-objects (Water, Sky, Road, ...). I other words in Semantic Segmentation you will label each region of image.

semantic segmenation

$\endgroup$
3
  • $\begingroup$ nice answer. I will note that cs231n.stanford.edu/slides/winter1516_lecture8.pdf slide 8 uses a different definition of object detection that detects multiple classes and multiple instances within the same class (I do not know if there is a single accepted definition or not, so this may just be due to ambiguity). $\endgroup$
    – Keith
    Commented Apr 29, 2016 at 15:59
  • 1
    $\begingroup$ instance segmentation, like semantic segmentation but one has to label the cows as separate $\endgroup$
    – titus
    Commented Jun 20, 2016 at 19:44
  • 2
    $\begingroup$ The slides from the first comment are here now :- cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf $\endgroup$
    – Shatu
    Commented Jun 7, 2017 at 8:57
11
$\begingroup$

Since this issue is still not quite clear even now in 2019, and it might help new ML-Learners choose, here is a very good image showing the differences:

(localisation is the bounding box around the "sheep" class, after a classification of the image has been done) source: https://towardsdatascience.com/detection-and-segmentation-through-convnets-47aa42de27ea source: Towardsdatascience.com

$\endgroup$
3
$\begingroup$

I believe just "localization" means "single object classification + localization using a 2D or 3D bounding box".

"Object detection" is localizing + classifying all instances of known object classes in question.

Semantic Segmentation is basically per-pixel classification.

Also wrt involved metrics (source: https://devblogs.nvidia.com/parallelforall/deep-learning-object-detection-digits/ )

Precision is the ratio of the accurately identified objects to the total number of predicted objects (ratio of true positives to true positives plus false positives).

Recall is the ratio of the accurately identified objects to the total number of actual objects in the images (ratio of true positives to true positives plus true negatives).

mAP: a simplified mean Average Precision score based on the product of the precision and recall for DetectNet. It’s a good combined measure for how sensitive the network is to objects of interest and how well it avoids false alarms.

$\endgroup$
0
2
$\begingroup$

The term localization is unclear. I will therefore discuss the terms object detection and semantic segmentation.

In object detection, each image pixel is classified whether it belongs to a particular class (e.g. face) or not. In practice, this is simplified by grouping pixels together to form bounding boxes therefore reducing the problem to deciding if the bounding box is a tight fit around the object. As pixels can belong to multiple objects (e.g. face, eye), they can hold multiple labels at the same time.

On the other hand, semantic segmentation involves assigning class labels to each image pixel. While they allow for better localizing accuracy as they do not incorporate the bounding box simplification, they strictly enforce a single label per pixel.

$\endgroup$
-2
$\begingroup$

Semantic segmentation: It is the task of clustering parts of images together which belong to the same object class. eg:detecting road signs

$\endgroup$
1
  • 2
    $\begingroup$ But detecting road signs is object detection. Can you explain the difference? $\endgroup$ Commented Mar 22, 2017 at 10:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.