SlideShare a Scribd company logo
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
Lecture 1 -Fei-Fei Li & Justin Johnson & Serena Yeung
Computer
Vision
Neuroscience
Machine learning
Speech, NLP
Information retrieval
Mathematics
Computer

Science
Biology
Engineering
Physics
Robotics
Cognitive
sciences
Psychology
graphics, algorithms,
theory,…
Image
processing
4/4/20174
systems,
architecture, …
optics
最近の研究情勢についていくために - Deep Learningを中心に -
R-CNN
Piotr Doll´ar Ross Girshick
search (FAIR)
RoIAlignRoIAlign
class
box
convconv convconv
Figure 1. The MaskR-CNN framework for instance segmentation.
a fixed set of categories without differentiating object in-
stances.1
Given this, one might expect a complex method
is required to achieve good results. However, we show that
a surprisingly simple, flexible, and fast system can surpass
Show and Tell: A Neural Image Caption Generator
Oriol Vinyals
Google
vinyals@google.com
Alexander Toshev
Google
toshev@google.com
Samy Bengio
Google
bengio@google.com
Dumitru Erhan
Google
dumitru@google.com
Abstract
Automatically describing the content of an image is a
fundamental problem in artificial intelligence that connects
computer vision and natural language processing. In this
paper, we present a generative model based on a deep re-
current architecture that combines recent advances in com-
puter vision and machine translation and that can be used
to generate natural sentences describing an image. The
model is trained to maximize the likelihood of the target de-
scription sentence given the training image. Experiments
on several datasets show the accuracy of the model and the
fluency of the language it learns solely from image descrip-
tions. Our model is often quite accurate, which we verify
both qualitatively and quantitatively. For instance, while
the current state-of-the-art BLEU-1 score (the higher the
A group of people
shopping at an
outdoor market.
!
There are many
vegetables at the
fruit stand.
Vision!
Deep CNN
Language !
Generating!
RNN
Figure 1. NIC, our model, is based end-to-end on a neural net-
work consisting of a vision CNN followed by a language gener-
ating RNN. It generates complete sentences in natural language
from an input image, as shown on the example above.
existing solutions of the above sub-problems, in order to go
from an image to its description [6, 16]. In contrast, we
Perceptual Generative Adversarial Networks for Small Object Detection
Jianan Li Xiaodan Liang Yunchao Wei Tingfa Xu Jiashi Feng Shuicheng Yan
Abstract
Detecting small objects is notoriously challenging due
to their low resolution and noisy representation. Exist-
ing object detection pipelines usually detect small objects
through learning representations of all the objects at multi-
ple scales. However, the performance gain of such ad hoc
architectures is usually limited to pay off the computational
cost. In this work, we address the small object detection
problem by developing a single architecture that internally
lifts representations of small objects to “super-resolved”
ones, achieving similar characteristics as large objects and
thus more discriminative for detection. For this purpose,
we propose a new Perceptual Generative Adversarial Net-
work (Perceptual GAN) model that improves small object
Perceptual
GAN
Features For
Small Instance
Super-resolved
Features
Features For
Large Instance
≈
Figure 1. Large and small objects exhibit different representation
from high-level convolutional layers of a CNN detector. The repr
sentations of large objects are discriminative while those of sma
objects are of low resolution, which hurts the detection accurac
In this work, we introduce the Perceptual GAN model to enhanc
the representations for small objects to be similar to real large ob
jects, thus improve detection performance on the small objects.
cs.CV]20Jun2017
and Cityscapes (bottom) using a single ResNet-101-FPN network.
PQ PQTh
PQSt
mIoU AP
DIN [1] 53.8 42.5 62.1 - 28.6
Panoptic FPN 58.1 52.0 62.5 75.7 33.0
O (top) and Cityscapes (bottom) using a single ResNet-101-FPN network.
PQSt
PQ PQTh
PQSt
mIoU AP
Features for Amodal 3D Object Detection
Zhixin Wang and Kui Jia
Abstract— In this work, we propose a novel method termed
Frustum ConvNet (F-ConvNet) for amodal 3D object detection
from point clouds. Given 2D region proposals in a RGB image,
our method first generates a sequence of frustums for each
region proposal, and uses the obtained frustums to group local
points. F-ConvNet aggregates point-wise features as frustum-
level feature vectors, and arrays these feature vectors as a
feature map for use of its subsequent component of fully
convolutional network (FCN), which spatially fuses frustum-
level features and supports an end-to-end and continuous
estimation of oriented boxes in the 3D space. We also propose
component variants of L-ConvNet, including a FCN variant
that extracts multi-resolution frustum features, and a refined
use of L-ConvNet over a reduced 3D space. Careful ablation
studies verify the efficacy of these component variants. L-
ConvNet assumes no prior knowledge of the working 3D envi-
ronment, and is thus dataset-agnostic. We present experiments
on both the indoor SUN-RGBD and outdoor KITTI datasets. L-
ConvNet outperforms all existing methods on SUN-RGBD, and
at the time of submission it outperforms all published works on
the KITTI benchmark. We will make the code of L-ConvNet
publicly available.
I. INTRODUCTION
Detection of object instances in 3D sensory data has
tremendous importance in many applications including au-
tonomous driving, robotic object manipulation, and aug-
mented reality. Among others, RGB-D images and LiDAR
point clouds are the most representative formats of 3D
Fig. 1: Illustration for how a sequence of frustums are
generated for a region proposal in a RGB image.
or volumes, these methods suffer from loss of critical 3D
information in the projection or quantization process.
With the progress of point set deep learning [11], [12],
recent methods [13], [14] resort to learning features directly
from raw point clouds. For example, the seminal work of
F-PointNet [13] first finds local points corresponding to
pixels inside a 2D region proposal, and then uses PointNet
[11] to segment from these local points the foreground
ones; the amodal 3D box is finally estimated from the
foreground points. Performance of this method is limited
due to the reasons that (1) it is not of end-to-end learning,
.01864v1[cs.CV]5Mar2019
Method
MV3D [5]
VoxelNet [14]
F-PointNet [13]
AVOD-FPN [6]
SECOND [15]
IPOD [22]
PointPillars [16]
PointRCNN-v1.1 [23]
Ours
TABLE
Fig. 7: Qualitative results on the
different categories, with green f
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion
Chen Wang2
Danfei Xu1
Yuke Zhu1
Roberto Mart´ın-Mart´ın1
Cewu Lu2
Li Fei-Fei1
Silvio Savarese1
1
Department of Computer Science, Stanford University
2
Department of Computer Science, Shanghai Jiao Tong University
Abstract
A key technical challenge in performing 6D object pose
estimation from RGB-D image is to fully leverage the two
complementary data sources. Prior works either extract in-
formation from the RGB image and depth separately or use
costly post-processing steps, limiting their performances in
highly cluttered scenes and real-time applications. In this
work, we present DenseFusion, a generic framework for
estimating 6D pose of a set of known objects from RGB-
D images. DenseFusion is a heterogeneous architecture
that processes the two data sources individually and uses a
novel dense fusion network to extract pixel-wise dense fea-
ture embedding, from which the pose is estimated. Further-
more, we integrate an end-to-end iterative pose refinement
RGB-D
DenseFusion
Figure 1. We develop an end-to-end deep network model for 6D
1[cs.CV]15Jan2019
Deep Learning for Generic Object Detection: A Survey
Li Liu 1,2
· Wanli Ouyang 3
· Xiaogang Wang 4
·
Paul Fieguth 5
· Jie Chen 2
· Xinwang Liu 1
· Matti Pietik¨ainen 2
Received: 12 September 2018
Abstract Generic object detection, aiming at locating object in-
stances from a large number of predefined categories in natural
images, is one of the most fundamental and challenging problems
in computer vision. Deep learning techniques have emerged in re-
cent years as powerful methods for learning feature representations
directly from data, and have led to remarkable breakthroughs in
the field of generic object detection. Given this time of rapid evo-
lution, the goal of this paper is to provide a comprehensive sur-
vey of the recent achievements in this field brought by deep learn-
ing techniques. More than 250 key contributions are included in
this survey, covering many aspects of generic object detection re-
search: leading detection frameworks and fundamental subprob-
lems including object feature representation, object proposal gen-
eration, context information modeling and training strategies; eval-
uation issues, specifically benchmark datasets, evaluation metrics,
and state of the art performance. We finish by identifying promis-
ing directions for future research.
Keywords Object detection · deep learning · convolutional neural
networks · object recognition
1 Introduction
As a longstanding, fundamental and challenging problem in com-
puter vision, object detection has been an active area of research
for several decades. The goal of object detection is to determine
whether or not there are any instances of objects from the given
categories (such as humans, cars, bicycles, dogs and cats) in some
Li Liu (li.liu@oulu.fi)
Wanli Ouyang (wanli.ouyang@sydney.edu.au)
Xiaogang Wang (xgwang@ee.cuhk.edu.hk)
Paul Fieguth (pfieguth@uwaterloo.ca)
Jie Chen (jie.chen@oulu.fi)
Xinwang Liu (xinwangliu@nudt.edu.cn)
Matti Pietik¨ainen (matti.pietikainen@oulu.fi)
1 National University of Defense Technology, China
2 University of Oulu, Finland
3 University of Sydney, Australia
4 Chinese University of Hong Kong, China
ILSVRC yearVOC year Results on VOC2012 Data
(a) (b)
Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result
Fig. 1 Recent evolution of object detection performance. We can observe sig-
nificant performance (mean average precision) improvement since deep learn-
ing entered the scene in 2012. The performance of the best detector has been
steadily increasing by a significant amount on a yearly basis. (a) Results on the
PASCAL VOC datasets: Detection results of winning entries in the VOC2007-
2012 competitions (using only provided training data). (b) Top object detection
competition results in ILSVRC2013-2017 (using only provided training data).
given image and, if present, to return the spatial location and ex-
tent of each object instance (e.g., via a bounding box [53, 179]).
As the cornerstone of image understanding and computer vision,
object detection forms the basis for solving more complex or high
level vision tasks such as segmentation, scene understanding, ob-
ject tracking, image captioning, event detection, and activity recog-
nition. Object detection has a wide range of applications in many
areas of artificial intelligence and information technologies, in-
cluding robot vision, consumer electronics, security, autonomous
driving, human computer interaction, content based image retrieval,
intelligent video surveillance, and augmented reality.
Recently, deep learning techniques [81, 116] have emerged as
powerful methods for learning feature representations automati-
cally from data. In particular, these techniques have provided sig-
nificant improvement for object detection, a problem which has
attracted enormous attention in the last five years, even though it
has been studied for decades by psychophysicists, neuroscientists,
and engineers.
Object detection can be grouped into one of two types [69,
240]: detection of specific instance and detection of specific cat-
egories. The first type aims at detecting instances of a particular
object (such as Donald Trump’s face, the Pentagon building, or my
arXiv:1809.02165v1[cs.CV]6Sep2018
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
Deep Learning for Generic Object Detection: A Survey
Li Liu 1,2
· Wanli Ouyang 3
· Xiaogang Wang 4
·
Paul Fieguth 5
· Jie Chen 2
· Xinwang Liu 1
· Matti Pietik¨ainen 2
Received: 12 September 2018
Abstract Generic object detection, aiming at locating object in-
stances from a large number of predefined categories in natural
images, is one of the most fundamental and challenging problems
in computer vision. Deep learning techniques have emerged in re-
cent years as powerful methods for learning feature representations
directly from data, and have led to remarkable breakthroughs in
the field of generic object detection. Given this time of rapid evo-
lution, the goal of this paper is to provide a comprehensive sur-
vey of the recent achievements in this field brought by deep learn-
ing techniques. More than 250 key contributions are included in
this survey, covering many aspects of generic object detection re-
search: leading detection frameworks and fundamental subprob-
lems including object feature representation, object proposal gen-
eration, context information modeling and training strategies; eval-
uation issues, specifically benchmark datasets, evaluation metrics,
and state of the art performance. We finish by identifying promis-
ing directions for future research.
Keywords Object detection · deep learning · convolutional neural
networks · object recognition
1 Introduction
As a longstanding, fundamental and challenging problem in com-
puter vision, object detection has been an active area of research
for several decades. The goal of object detection is to determine
whether or not there are any instances of objects from the given
categories (such as humans, cars, bicycles, dogs and cats) in some
Li Liu (li.liu@oulu.fi)
Wanli Ouyang (wanli.ouyang@sydney.edu.au)
Xiaogang Wang (xgwang@ee.cuhk.edu.hk)
Paul Fieguth (pfieguth@uwaterloo.ca)
Jie Chen (jie.chen@oulu.fi)
Xinwang Liu (xinwangliu@nudt.edu.cn)
Matti Pietik¨ainen (matti.pietikainen@oulu.fi)
1 National University of Defense Technology, China
2 University of Oulu, Finland
3 University of Sydney, Australia
4 Chinese University of Hong Kong, China
ILSVRC yearVOC year Results on VOC2012 Data
(a) (b)
Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result
Fig. 1 Recent evolution of object detection performance. We can observe sig-
nificant performance (mean average precision) improvement since deep learn-
ing entered the scene in 2012. The performance of the best detector has been
steadily increasing by a significant amount on a yearly basis. (a) Results on the
PASCAL VOC datasets: Detection results of winning entries in the VOC2007-
2012 competitions (using only provided training data). (b) Top object detection
competition results in ILSVRC2013-2017 (using only provided training data).
given image and, if present, to return the spatial location and ex-
tent of each object instance (e.g., via a bounding box [53, 179]).
As the cornerstone of image understanding and computer vision,
object detection forms the basis for solving more complex or high
level vision tasks such as segmentation, scene understanding, ob-
ject tracking, image captioning, event detection, and activity recog-
nition. Object detection has a wide range of applications in many
areas of artificial intelligence and information technologies, in-
cluding robot vision, consumer electronics, security, autonomous
driving, human computer interaction, content based image retrieval,
intelligent video surveillance, and augmented reality.
Recently, deep learning techniques [81, 116] have emerged as
powerful methods for learning feature representations automati-
cally from data. In particular, these techniques have provided sig-
nificant improvement for object detection, a problem which has
attracted enormous attention in the last five years, even though it
has been studied for decades by psychophysicists, neuroscientists,
and engineers.
Object detection can be grouped into one of two types [69,
240]: detection of specific instance and detection of specific cat-
egories. The first type aims at detecting instances of a particular
object (such as Donald Trump’s face, the Pentagon building, or my
arXiv:1809.02165v1[cs.CV]6Sep2018
Deep Learning for Generic Object Detection: A Survey
Li Liu 1,2
· Wanli Ouyang 3
· Xiaogang Wang 4
·
Paul Fieguth 5
· Jie Chen 2
· Xinwang Liu 1
· Matti Pietik¨ainen 2
Received: 12 September 2018
Abstract Generic object detection, aiming at locating object in-
stances from a large number of predefined categories in natural
images, is one of the most fundamental and challenging problems
in computer vision. Deep learning techniques have emerged in re-
cent years as powerful methods for learning feature representations
directly from data, and have led to remarkable breakthroughs in
the field of generic object detection. Given this time of rapid evo-
lution, the goal of this paper is to provide a comprehensive sur-
vey of the recent achievements in this field brought by deep learn-
ing techniques. More than 250 key contributions are included in
this survey, covering many aspects of generic object detection re-
search: leading detection frameworks and fundamental subprob-
lems including object feature representation, object proposal gen-
eration, context information modeling and training strategies; eval-
uation issues, specifically benchmark datasets, evaluation metrics,
and state of the art performance. We finish by identifying promis-
ing directions for future research.
Keywords Object detection · deep learning · convolutional neural
networks · object recognition
1 Introduction
As a longstanding, fundamental and challenging problem in com-
puter vision, object detection has been an active area of research
for several decades. The goal of object detection is to determine
whether or not there are any instances of objects from the given
categories (such as humans, cars, bicycles, dogs and cats) in some
Li Liu (li.liu@oulu.fi)
Wanli Ouyang (wanli.ouyang@sydney.edu.au)
Xiaogang Wang (xgwang@ee.cuhk.edu.hk)
Paul Fieguth (pfieguth@uwaterloo.ca)
Jie Chen (jie.chen@oulu.fi)
Xinwang Liu (xinwangliu@nudt.edu.cn)
Matti Pietik¨ainen (matti.pietikainen@oulu.fi)
1 National University of Defense Technology, China
2 University of Oulu, Finland
3 University of Sydney, Australia
4 Chinese University of Hong Kong, China
5 University of Waterloo, Canada
ILSVRC yearVOC year Results on VOC2012 Data
(a) (b)
Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result
Fig. 1 Recent evolution of object detection performance. We can observe sig-
nificant performance (mean average precision) improvement since deep learn-
ing entered the scene in 2012. The performance of the best detector has been
steadily increasing by a significant amount on a yearly basis. (a) Results on the
PASCAL VOC datasets: Detection results of winning entries in the VOC2007-
2012 competitions (using only provided training data). (b) Top object detection
competition results in ILSVRC2013-2017 (using only provided training data).
given image and, if present, to return the spatial location and ex-
tent of each object instance (e.g., via a bounding box [53, 179]).
As the cornerstone of image understanding and computer vision,
object detection forms the basis for solving more complex or high
level vision tasks such as segmentation, scene understanding, ob-
ject tracking, image captioning, event detection, and activity recog-
nition. Object detection has a wide range of applications in many
areas of artificial intelligence and information technologies, in-
cluding robot vision, consumer electronics, security, autonomous
driving, human computer interaction, content based image retrieval,
intelligent video surveillance, and augmented reality.
Recently, deep learning techniques [81, 116] have emerged as
powerful methods for learning feature representations automati-
cally from data. In particular, these techniques have provided sig-
nificant improvement for object detection, a problem which has
attracted enormous attention in the last five years, even though it
has been studied for decades by psychophysicists, neuroscientists,
and engineers.
Object detection can be grouped into one of two types [69,
240]: detection of specific instance and detection of specific cat-
egories. The first type aims at detecting instances of a particular
object (such as Donald Trump’s face, the Pentagon building, or my
dog Penny), whereas the goal of the second type is to detect differ-
ent instances of predefined object categories (for example humans,
arXiv:1809.02165v1[cs.CV]6Sep2018
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
🍆
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -

More Related Content

最近の研究情勢についていくために - Deep Learningを中心に -

  • 15. Lecture 1 -Fei-Fei Li & Justin Johnson & Serena Yeung Computer Vision Neuroscience Machine learning Speech, NLP Information retrieval Mathematics Computer
 Science Biology Engineering Physics Robotics Cognitive sciences Psychology graphics, algorithms, theory,… Image processing 4/4/20174 systems, architecture, … optics
  • 17. R-CNN Piotr Doll´ar Ross Girshick search (FAIR) RoIAlignRoIAlign class box convconv convconv Figure 1. The MaskR-CNN framework for instance segmentation. a fixed set of categories without differentiating object in- stances.1 Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass Show and Tell: A Neural Image Caption Generator Oriol Vinyals Google vinyals@google.com Alexander Toshev Google toshev@google.com Samy Bengio Google bengio@google.com Dumitru Erhan Google dumitru@google.com Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep re- current architecture that combines recent advances in com- puter vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target de- scription sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descrip- tions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the A group of people shopping at an outdoor market. ! There are many vegetables at the fruit stand. Vision! Deep CNN Language ! Generating! RNN Figure 1. NIC, our model, is based end-to-end on a neural net- work consisting of a vision CNN followed by a language gener- ating RNN. It generates complete sentences in natural language from an input image, as shown on the example above. existing solutions of the above sub-problems, in order to go from an image to its description [6, 16]. In contrast, we Perceptual Generative Adversarial Networks for Small Object Detection Jianan Li Xiaodan Liang Yunchao Wei Tingfa Xu Jiashi Feng Shuicheng Yan Abstract Detecting small objects is notoriously challenging due to their low resolution and noisy representation. Exist- ing object detection pipelines usually detect small objects through learning representations of all the objects at multi- ple scales. However, the performance gain of such ad hoc architectures is usually limited to pay off the computational cost. In this work, we address the small object detection problem by developing a single architecture that internally lifts representations of small objects to “super-resolved” ones, achieving similar characteristics as large objects and thus more discriminative for detection. For this purpose, we propose a new Perceptual Generative Adversarial Net- work (Perceptual GAN) model that improves small object Perceptual GAN Features For Small Instance Super-resolved Features Features For Large Instance ≈ Figure 1. Large and small objects exhibit different representation from high-level convolutional layers of a CNN detector. The repr sentations of large objects are discriminative while those of sma objects are of low resolution, which hurts the detection accurac In this work, we introduce the Perceptual GAN model to enhanc the representations for small objects to be similar to real large ob jects, thus improve detection performance on the small objects. cs.CV]20Jun2017
  • 18. and Cityscapes (bottom) using a single ResNet-101-FPN network. PQ PQTh PQSt mIoU AP DIN [1] 53.8 42.5 62.1 - 28.6 Panoptic FPN 58.1 52.0 62.5 75.7 33.0 O (top) and Cityscapes (bottom) using a single ResNet-101-FPN network. PQSt PQ PQTh PQSt mIoU AP Features for Amodal 3D Object Detection Zhixin Wang and Kui Jia Abstract— In this work, we propose a novel method termed Frustum ConvNet (F-ConvNet) for amodal 3D object detection from point clouds. Given 2D region proposals in a RGB image, our method first generates a sequence of frustums for each region proposal, and uses the obtained frustums to group local points. F-ConvNet aggregates point-wise features as frustum- level feature vectors, and arrays these feature vectors as a feature map for use of its subsequent component of fully convolutional network (FCN), which spatially fuses frustum- level features and supports an end-to-end and continuous estimation of oriented boxes in the 3D space. We also propose component variants of L-ConvNet, including a FCN variant that extracts multi-resolution frustum features, and a refined use of L-ConvNet over a reduced 3D space. Careful ablation studies verify the efficacy of these component variants. L- ConvNet assumes no prior knowledge of the working 3D envi- ronment, and is thus dataset-agnostic. We present experiments on both the indoor SUN-RGBD and outdoor KITTI datasets. L- ConvNet outperforms all existing methods on SUN-RGBD, and at the time of submission it outperforms all published works on the KITTI benchmark. We will make the code of L-ConvNet publicly available. I. INTRODUCTION Detection of object instances in 3D sensory data has tremendous importance in many applications including au- tonomous driving, robotic object manipulation, and aug- mented reality. Among others, RGB-D images and LiDAR point clouds are the most representative formats of 3D Fig. 1: Illustration for how a sequence of frustums are generated for a region proposal in a RGB image. or volumes, these methods suffer from loss of critical 3D information in the projection or quantization process. With the progress of point set deep learning [11], [12], recent methods [13], [14] resort to learning features directly from raw point clouds. For example, the seminal work of F-PointNet [13] first finds local points corresponding to pixels inside a 2D region proposal, and then uses PointNet [11] to segment from these local points the foreground ones; the amodal 3D box is finally estimated from the foreground points. Performance of this method is limited due to the reasons that (1) it is not of end-to-end learning, .01864v1[cs.CV]5Mar2019 Method MV3D [5] VoxelNet [14] F-PointNet [13] AVOD-FPN [6] SECOND [15] IPOD [22] PointPillars [16] PointRCNN-v1.1 [23] Ours TABLE Fig. 7: Qualitative results on the different categories, with green f DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion Chen Wang2 Danfei Xu1 Yuke Zhu1 Roberto Mart´ın-Mart´ın1 Cewu Lu2 Li Fei-Fei1 Silvio Savarese1 1 Department of Computer Science, Stanford University 2 Department of Computer Science, Shanghai Jiao Tong University Abstract A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract in- formation from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB- D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense fea- ture embedding, from which the pose is estimated. Further- more, we integrate an end-to-end iterative pose refinement RGB-D DenseFusion Figure 1. We develop an end-to-end deep network model for 6D 1[cs.CV]15Jan2019
  • 19. Deep Learning for Generic Object Detection: A Survey Li Liu 1,2 · Wanli Ouyang 3 · Xiaogang Wang 4 · Paul Fieguth 5 · Jie Chen 2 · Xinwang Liu 1 · Matti Pietik¨ainen 2 Received: 12 September 2018 Abstract Generic object detection, aiming at locating object in- stances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in re- cent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evo- lution, the goal of this paper is to provide a comprehensive sur- vey of the recent achievements in this field brought by deep learn- ing techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection re- search: leading detection frameworks and fundamental subprob- lems including object feature representation, object proposal gen- eration, context information modeling and training strategies; eval- uation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promis- ing directions for future research. Keywords Object detection · deep learning · convolutional neural networks · object recognition 1 Introduction As a longstanding, fundamental and challenging problem in com- puter vision, object detection has been an active area of research for several decades. The goal of object detection is to determine whether or not there are any instances of objects from the given categories (such as humans, cars, bicycles, dogs and cats) in some Li Liu (li.liu@oulu.fi) Wanli Ouyang (wanli.ouyang@sydney.edu.au) Xiaogang Wang (xgwang@ee.cuhk.edu.hk) Paul Fieguth (pfieguth@uwaterloo.ca) Jie Chen (jie.chen@oulu.fi) Xinwang Liu (xinwangliu@nudt.edu.cn) Matti Pietik¨ainen (matti.pietikainen@oulu.fi) 1 National University of Defense Technology, China 2 University of Oulu, Finland 3 University of Sydney, Australia 4 Chinese University of Hong Kong, China ILSVRC yearVOC year Results on VOC2012 Data (a) (b) Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result Fig. 1 Recent evolution of object detection performance. We can observe sig- nificant performance (mean average precision) improvement since deep learn- ing entered the scene in 2012. The performance of the best detector has been steadily increasing by a significant amount on a yearly basis. (a) Results on the PASCAL VOC datasets: Detection results of winning entries in the VOC2007- 2012 competitions (using only provided training data). (b) Top object detection competition results in ILSVRC2013-2017 (using only provided training data). given image and, if present, to return the spatial location and ex- tent of each object instance (e.g., via a bounding box [53, 179]). As the cornerstone of image understanding and computer vision, object detection forms the basis for solving more complex or high level vision tasks such as segmentation, scene understanding, ob- ject tracking, image captioning, event detection, and activity recog- nition. Object detection has a wide range of applications in many areas of artificial intelligence and information technologies, in- cluding robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality. Recently, deep learning techniques [81, 116] have emerged as powerful methods for learning feature representations automati- cally from data. In particular, these techniques have provided sig- nificant improvement for object detection, a problem which has attracted enormous attention in the last five years, even though it has been studied for decades by psychophysicists, neuroscientists, and engineers. Object detection can be grouped into one of two types [69, 240]: detection of specific instance and detection of specific cat- egories. The first type aims at detecting instances of a particular object (such as Donald Trump’s face, the Pentagon building, or my arXiv:1809.02165v1[cs.CV]6Sep2018
  • 22. Deep Learning for Generic Object Detection: A Survey Li Liu 1,2 · Wanli Ouyang 3 · Xiaogang Wang 4 · Paul Fieguth 5 · Jie Chen 2 · Xinwang Liu 1 · Matti Pietik¨ainen 2 Received: 12 September 2018 Abstract Generic object detection, aiming at locating object in- stances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in re- cent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evo- lution, the goal of this paper is to provide a comprehensive sur- vey of the recent achievements in this field brought by deep learn- ing techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection re- search: leading detection frameworks and fundamental subprob- lems including object feature representation, object proposal gen- eration, context information modeling and training strategies; eval- uation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promis- ing directions for future research. Keywords Object detection · deep learning · convolutional neural networks · object recognition 1 Introduction As a longstanding, fundamental and challenging problem in com- puter vision, object detection has been an active area of research for several decades. The goal of object detection is to determine whether or not there are any instances of objects from the given categories (such as humans, cars, bicycles, dogs and cats) in some Li Liu (li.liu@oulu.fi) Wanli Ouyang (wanli.ouyang@sydney.edu.au) Xiaogang Wang (xgwang@ee.cuhk.edu.hk) Paul Fieguth (pfieguth@uwaterloo.ca) Jie Chen (jie.chen@oulu.fi) Xinwang Liu (xinwangliu@nudt.edu.cn) Matti Pietik¨ainen (matti.pietikainen@oulu.fi) 1 National University of Defense Technology, China 2 University of Oulu, Finland 3 University of Sydney, Australia 4 Chinese University of Hong Kong, China ILSVRC yearVOC year Results on VOC2012 Data (a) (b) Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result Fig. 1 Recent evolution of object detection performance. We can observe sig- nificant performance (mean average precision) improvement since deep learn- ing entered the scene in 2012. The performance of the best detector has been steadily increasing by a significant amount on a yearly basis. (a) Results on the PASCAL VOC datasets: Detection results of winning entries in the VOC2007- 2012 competitions (using only provided training data). (b) Top object detection competition results in ILSVRC2013-2017 (using only provided training data). given image and, if present, to return the spatial location and ex- tent of each object instance (e.g., via a bounding box [53, 179]). As the cornerstone of image understanding and computer vision, object detection forms the basis for solving more complex or high level vision tasks such as segmentation, scene understanding, ob- ject tracking, image captioning, event detection, and activity recog- nition. Object detection has a wide range of applications in many areas of artificial intelligence and information technologies, in- cluding robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality. Recently, deep learning techniques [81, 116] have emerged as powerful methods for learning feature representations automati- cally from data. In particular, these techniques have provided sig- nificant improvement for object detection, a problem which has attracted enormous attention in the last five years, even though it has been studied for decades by psychophysicists, neuroscientists, and engineers. Object detection can be grouped into one of two types [69, 240]: detection of specific instance and detection of specific cat- egories. The first type aims at detecting instances of a particular object (such as Donald Trump’s face, the Pentagon building, or my arXiv:1809.02165v1[cs.CV]6Sep2018
  • 23. Deep Learning for Generic Object Detection: A Survey Li Liu 1,2 · Wanli Ouyang 3 · Xiaogang Wang 4 · Paul Fieguth 5 · Jie Chen 2 · Xinwang Liu 1 · Matti Pietik¨ainen 2 Received: 12 September 2018 Abstract Generic object detection, aiming at locating object in- stances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in re- cent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evo- lution, the goal of this paper is to provide a comprehensive sur- vey of the recent achievements in this field brought by deep learn- ing techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection re- search: leading detection frameworks and fundamental subprob- lems including object feature representation, object proposal gen- eration, context information modeling and training strategies; eval- uation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promis- ing directions for future research. Keywords Object detection · deep learning · convolutional neural networks · object recognition 1 Introduction As a longstanding, fundamental and challenging problem in com- puter vision, object detection has been an active area of research for several decades. The goal of object detection is to determine whether or not there are any instances of objects from the given categories (such as humans, cars, bicycles, dogs and cats) in some Li Liu (li.liu@oulu.fi) Wanli Ouyang (wanli.ouyang@sydney.edu.au) Xiaogang Wang (xgwang@ee.cuhk.edu.hk) Paul Fieguth (pfieguth@uwaterloo.ca) Jie Chen (jie.chen@oulu.fi) Xinwang Liu (xinwangliu@nudt.edu.cn) Matti Pietik¨ainen (matti.pietikainen@oulu.fi) 1 National University of Defense Technology, China 2 University of Oulu, Finland 3 University of Sydney, Australia 4 Chinese University of Hong Kong, China 5 University of Waterloo, Canada ILSVRC yearVOC year Results on VOC2012 Data (a) (b) Turning Point in 2012: Deep Learning Achieved Record Breaking Image Classification Result Fig. 1 Recent evolution of object detection performance. We can observe sig- nificant performance (mean average precision) improvement since deep learn- ing entered the scene in 2012. The performance of the best detector has been steadily increasing by a significant amount on a yearly basis. (a) Results on the PASCAL VOC datasets: Detection results of winning entries in the VOC2007- 2012 competitions (using only provided training data). (b) Top object detection competition results in ILSVRC2013-2017 (using only provided training data). given image and, if present, to return the spatial location and ex- tent of each object instance (e.g., via a bounding box [53, 179]). As the cornerstone of image understanding and computer vision, object detection forms the basis for solving more complex or high level vision tasks such as segmentation, scene understanding, ob- ject tracking, image captioning, event detection, and activity recog- nition. Object detection has a wide range of applications in many areas of artificial intelligence and information technologies, in- cluding robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality. Recently, deep learning techniques [81, 116] have emerged as powerful methods for learning feature representations automati- cally from data. In particular, these techniques have provided sig- nificant improvement for object detection, a problem which has attracted enormous attention in the last five years, even though it has been studied for decades by psychophysicists, neuroscientists, and engineers. Object detection can be grouped into one of two types [69, 240]: detection of specific instance and detection of specific cat- egories. The first type aims at detecting instances of a particular object (such as Donald Trump’s face, the Pentagon building, or my dog Penny), whereas the goal of the second type is to detect differ- ent instances of predefined object categories (for example humans, arXiv:1809.02165v1[cs.CV]6Sep2018
  • 28. 🍆