文档库 最新最全的文档下载
当前位置:文档库 › DeepID-Net multi-stage and deformable deep convolutional neural networks

DeepID-Net multi-stage and deformable deep convolutional neural networks

a r X i v :1409.3505v 1 [c s .C V ] 11 S e p 2014

DeepID-Net:multi-stage and deformable deep convolutional neural networks

for object detection

Wanli Ouyang,Ping Luo,Xingyu Zeng,Shi Qiu,Yonglong Tian,Hongsheng Li,Shuo Yang,

Zhe Wang,Yuanjun Xiong,Chen Qian,Zhenyao Zhu,Ruohui Wang,

Chen-Change Loy,Xiaogang Wang,Xiaoou Tang

the Chinese University of Hong Kong

wlouyang,xgwang@https://www.wendangku.net/doc/078239195.html,.hk

Abstract

In this paper,we propose multi-stage and deformable deep convolutional neural networks for object detection.This new deep learning object detection diagram has in-novations in multiple aspects.In the proposed new deep architecture,a new deformation constrained pooling (def-pooling)layer models the deformation of object parts with geometric constraint and penalty.With the proposed multi-stage training strategy,multiple classi?ers are jointly opti-mized to process samples at different dif?culty levels.A new pre-training strategy is proposed to learn feature represen-tations more suitable for the object detection task and with good generalization capability.By changing the net struc-tures,training strategies,adding and removing some key components in the detection pipeline,a set of models with large diversity are obtained,which signi?cantly improves the effectiveness of modeling averaging.The proposed ap-proach ranked #2in ILSVRC 2014.It improves the mean averaged precision obtained by RCNN,which is the state-of-the-art of object detection,from 31%to 45%.Detailed component-wise analysis is also provided through extensive experimental evaluation.

1.Introduction

Object detection is a one of the fundamental challenges in computer vision.It has attracted a great deal of research interest [9,48,20].The main challenges of this task are caused by the intra-class variation in appearance,lighting,backgrounds,and deformation.In order to handle these challenges,a group of interdependent components in the pipeline of object detection are important.First,features should capture the most discriminative information of ob-ject classes.Well-known features include hand-crafted fea-tures such as Haar-like features [55],SIFT [32],HOG [9],and learned deep CNN features [46,29,23].Second,de-

formation models should handle the deformation of object parts,e.g.torso,head,and legs of human.The state-of-the-art deformable part-based model (DPM)in [20]allows ob-ject parts to deform with geometric constraint and penalty.Finally,a classi?er decides whether a candidate window shall be detected as enclosing an object.SVM [9],Latent SVM [20],multi-kernel classi?ers [52],generative model [35],random forests [14],and their variations are widely used.

In this paper,we propose multi-stage deformable DEEP generIc object Detection convolutional neural NETwork (DeepID-Net).In DeepID-Net,we learn the following key components:1)feature representations for a large number of object categories,2)deformation models of object parts,3)contextual information for objects in an image.We also investigate many aspects in effectively and ef?ciently train-ing and aggregating the deep models,including bounding box rejection,training schemes,objective function of the deep model,and model averaging.The proposed new di-agram signi?cantly advances the state-of-the-art for deep learning based generic object detection,such as the well known RCNN [23]framework.With this new pipeline,our method ranks #2in object detection on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)2014.This paper also provides detailed component-wise experimental results on how our approach can improve the mean Aver-aged Precision (AP)obtained by RCNN [23]from 31.0%to mean AP 45%step-by-step on the ImageNet object de-tection challenge validation 2dataset.

The contributions of this paper are as follows:

1.A new deep learning diagram for object detection.It ef-fectively integrates feature representation learning,part deformation learning,sub-box feature extraction,con-text modeling,model averaging,and bounding box lo-cation re?nement into the detection system.

2.A new scheme for pretraining the deep CNN model.We propose to pretrain the deep model on the ImageNet 1

image classi?cation dataset with1000-class object-level annotations instead of with image-level annotations, which are commonly used in existing deep learning ob-ject detection[23].Then the deep model is?ne-tuned on the ImageNet object detection dataset with200classes, which are the targeting object classes of the ImageNet object detection challenge.

3.A new deformation constrained pooling(def-pooling)

layer,which enriches the deep model by learning the de-formation of visual patterns of parts.The def-pooling layer can be used for replacing the max-pooling layer and learning the deformation properties of parts at any information abstraction level.

4.We show the effectiveness of the multi-stage training

scheme in generic object detection.With the proposed deep architecture,the classi?er at each stage handles samples at a different dif?cult level.All the classi?ers at multiple stages are jointly optimized.The proposed new stage-by-stage training procedure adds regulariza-tion constraints to parameters and better solves the over-?tting problem compared with the standard BP.

5.A new model averaging strategy.Different from exist-

ing works of combining deep models learned with the same structure and training strategy,we obtain multiple models by using different network structures and train-ing strategies,adding or removing different types of lay-ers and some key components in the detection pipeline.

Deep models learned in this way have large diversity on the200object classes in the detection challenge,which makes model averaging more effective.It is observed that different deep models varies a lot across different object categories.It motivates us to select and com-bine models differently for each individual class,which is also different from existing works[62,46,25]of using the same model combination for all the object classes.

2.Related Work

It has been proved that deep models are potentially more capable than shallow models in handling complex tasks[4]. Deep models have achieved spectacular progress in com-puter vision[26,27,43,28,30,37,29,63,33,50,18,42]. Because of its power in learning feature representation, deep models have been widely used for object recognition and object detection in the recent years[46,62,25,47,67, 24,31,23].In existing deep CNN models,max pooling and average pooling are useful in handling deformation but cannot learn the deformation penalty and geometric model of object parts.The deformation layer was?rst proposed in our earlier work[38]for pedestrian detection.In this pa-per,we extend it to general object detection on ImageNet. In[38],the deformation layer was constrained to be placed after the last convolutional layer,while in this work the def-pooling layer can be placed after all the convolutional lay-ers to capture geometric deformation at all the information abstraction levels.All different from[38],the def-pooling layer in this paper can be used for replacing all the pooling layers.In[38],it was assumed that a pedestrian only has one instance of a body part,so each part?lter only has one optimal response in a detection window.In this work,it is assumed that an object has multiple instances of a body part (e.g.a car has many wheels),so each part?lter is allowed to have multiple response peaks in a detection window.This new model is more suitable for general object detection.

Since some objects have non-rigid deformation,the abil-ity to handle deformation improves detection performance. Deformable part-based models were used in[20,65,41,39] for handling translational movement of parts.To handle more complex articulations,size change and rotation of parts were modeled in[21],and mixture of part appearance and articulation types were modeled in[6,60,10].In these approaches,features are manually designed,Deformation and features are not jointly learned.

The widely used classi?cation approaches include vari-ous boosting classi?ers[14,15,56],linear SVM[9],his-togram intersection kernel SVM[34],latent SVM[20], multiple kernel SVM[53],structural SVM[65],and prob-abilistic models[3,36].In these approaches,classi?ers are adapted to training data,but features are designed manually. If useful information has been lost at feature extraction,it cannot be recovered during classi?cation.Ideally,classi-?ers should guide feature learning.

Researches on visual cognition,computer vision and cognitive neuroscience have shown that the ability of hu-man and computer vision systems in recognizing objects is affected by the contextual information like non-target ob-jects and contextual scenes.The context information inves-tigated in previous works includes regions surrounding ob-jects[9,12,22],object-scene interaction[13],and the pres-ence,location,orientation and size relationship among ob-jects[3,57,58,11,41,22,49,13,61,12,59,40,10,45,51]. In this paper,we utilize the image classi?cation result from the deep model as the contextual information.

In summary,previous works treat the components in-dividually or sequentially.This paper takes a global view of these components and is an important step towards joint learning of them for object detection.

3.Dataset overview

The ImageNet Large Scale Visual Recognition Chal-lenge(ILSVRC)2014[44]contains two different datasets: 1)the classi?cation and localization dataset and2)the de-tection dataset.

The classi?cation and localization(Cls-Loc)dataset is split into three subsets,train,validation(val),and test data. The train data contains1.2million images with labels of 1,000categories.The val and test data consist of150,000

photographs,collected from?ickr and other search engines, hand labeled with the presence or absence of1,000object categories.The1,000object categories contain both inter-nal nodes and leaf nodes of ImageNet,but do not overlap with each other.A random subset of50,000of the images with labels are used as val data and released with labels of the1,000categories.The remaining100,000images are used as the test data and are released without labels at test time.The val and test data does not have overlap with the train data.

The detection(Det)dataset contains200object cate-gories and is split into three subsets,train,validation(val), and test data,which separately contain395,918,20,121 and40,152images.The manually annotated object bound-ing boxes on the train and val data are released,while those on the test data are not.The train data is drawn from the Cls-Loc data.In the Det val and test subsets,images from the CLS-LOC dataset where the target object is too large (greater than50%of the image area)are excluded.There-fore,the Det val and test data have similar distribution. However,the distribution of Det train is different from the distributions of Det val and test.For a given object class, the train data has extra negative images that does not con-tain any object of this class.These extra negative images are not used in this paper.We follow the RCNN[23]in split-ting the val data into val1and val2.Val1is used for training models while val2is used for validating the performance of models.The val1/val2split is the same as that in[23]. 4.Method

4.1.The RCNN approach

A brief description of the RCNN approach is provided for giving the context of our approach.RCNN uses the selective search in[48]for obtaining candidate bounding boxes from both training and testing images.An overview of this approach is shown in Fig.1.

At the testing stage,the AlexNet in[29]is used for extracting features from bounding boxes,then200one-versus-all linear classi?ers are used for deciding the exis-tence of object in these bounding boxes.Each classi?er provides the classi?cation score on whether a bounding box contains a speci?c object class or not,e.g.person or non-person.The bounding box locations are re?ned using the AlexNet in order to reduce localization errors.

At the training stage,the ImageNet Cls-Loc dataset with 1,000object classes is used to pretrain the AlexNet,then the ImageNet Det dataset with200object classes is used to?ne-tune the AlexNet.The features extracted by the AlexNet are then used for learning200one-versus-all SVM classi?ers for200classes.Based on the features extracted by the AlexNet,a linear regressor is learned to re?ne bound-ing box

location.

Image

Proposed

bounding boxes

Selective

search

AlexNet

+SVM

Bounding box

regression

horse

Detection

results

Refined

bounding boxes Figure1.Overview of RCNN in[23].Selective search[48]is used for proposing candidate bounding boxes that may contain objects. AlexNet is used to extract features from the cropped bounding box regions.Based on the extracted features,SVM is used to decide the existence of objects.Bounding box regression is used to re?ne bounding box location and reduce localization errors.

4.2.Overview of the proposed approach

An overview of our proposed approach is shown in Fig.

2.In this model:

1.The selective search in[48]is used for obtaining candi-

date bounding boxes.Details are given in Section4.3.

2.An existing detector is used for rejecting bounding boxes

that are most likely to be background.Details are given in Section4.4.

3.The remaining bounding boxes are cropped and warped

into227×227images.The227×227cropped image goes through the DeepID-Net in order to obtain200de-tection scores.Each detection score measures the con?-dence on the cropped image containing one speci?c ob-ject class,e.g.person.Details are given in Section5. 4.The1000-class image classi?cation scores of a deep

model on the whole image are used as the contextual in-formation for re?ning the200detection scores of each candidate bounding box.Details are given in Section

5.7.

5.Average of multiple deep model outputs is used to im-

prove the detection accuracy.Details are given in Sec-tion6.

6.The bounding box regression in RCNN is used to reduce

localization errors.

4.3.Bounding box proposal by selective search

Many approaches have been proposed to generate class-independent bounding box proposals.The recent ap-proaches include objectness[1],selective search[48],cat-egory independent object proposals[16],constrained para-metric min-cuts[7],combinatorial grouping[2],binarized normed gradients[8],deep learning[17],and edge boxes [66].The selective search approach in[48]is adopted in order to have fair comparison with the RCNN in[23].We strictly followed the RCNN in using the selective search, where selective search was run in fast mode on each im-age in val1,val2and test,and each image was resized to a?xed width(500pixels)before running selective search. In this way,selective search resulted in an average of2403 bounding box proposals per image with a91.6%recall of all ground-truth bounding boxes by choosing Intersection over Union(IoU)threshold as0.5.

Image

Proposed bounding boxes

Selective

search DeepID-Net

Pretrain, def-pooling layer, sub-box,

hinge-loss

Model averaging

Bounding box regression

horse

Box rejection

Context modeling

horse

horse

horse

Remaining

bounding boxes

Figure 2.Overview of DeepID-Net.Selective search is used for proposing candidate bounding boxes that may contain objects.Then RCNN is used for rejecting 94%candidate bounding boxes.Each remaining bounding box goes through the DeepID-Net in or-der to obtain 200detection scores.Each score measures the con-?dence on whether the bounding box contains a speci?c object class,e.g.person,or not.After that,context is used for re?ning the 200scores of each bounding box.Model averaging and bound-ing box regression are then used to improve the accuracy.Texts in red highlights the steps that are not present in RCNN [23].

4.4.Bounding box rejection

On the val data,selective search generates 2403bound-ing boxes per image.On average,10.24seconds per image are required using the Titan GPU (about 12seconds per im-age using GTX670)for extracting features from bounding boxes.Features in val and test should be extracted for train-ing SVM or validating performance.This feature extraction takes around 2.4days on the val dataset and around 4.7days on the test dataset.The feature extraction procedure is time consuming and slows down the training and testing of new models.In order to speed up the feature extraction for new models,we use an existing approach,RCNN [23]in our implementation,for rejecting bounding boxes that are most likely to be background.Denote by s i the detection scores for 200classes of the i th bounding box.The i th bounding box is rejected if the following rejection condition is satis-?ed:

||s i ||∞

where ||s i ||∞=max j {s i,j },s i,j is the j th element in s i .Since the elements in s i are SVM scores,negative sam-ples with scores smaller than ?1are not support vectors for SVM.When ||s i ||∞

age on Titan GPU,about 1/9of the 10.24seconds per im-age required for the 100%bounding boxes.In terms of de-tection accuracy,bound boxing rejection can improve the mean AP by around 1%.

5.Bounding box classi?cation by DeepID-Net

5.1.Overview of DeepID-Net

An overview of the DeepID-Net is given in Fig.3.This deep model contains four parts:

(a)The baseline deep model.The input is the image region

cropped by a candidate bounding box.The input image region is warped to 227×227.The Clarifai-fast in [62]is used as the baseline deep model in our best-performing single model.The Clarifai-fast model contains 5con-volutional layers (conv1-conv5)and two fully connected layers (fc6and fc7).conv1is the result of convolving its previous layer,the input image,with learned ?lters.Similarly for conv2-conv5,fc6,and fc7.Max pooling layers,which are not shown in Fig.3,are used after conv1,conv2and conv5.

(b)Fully connected layers learned by the multi-stage train-ing scheme,which is detailed in Section 5.3.The in-put of these layers is the pooling layer after conv5of the baseline model.

(c)Layers with def-pooling layer.The input of these layers

is the conv5of the baseline model.The conv5layer is convolved by ?lters with variable sizes and then the pro-posed def-pooling layer in Section 5.4.2is used for learn-ing the deformation constraint of these part ?lters.Parts (a)-(c)outputs the 200-class object detection scores.For the example in Fig.3,ideal output will have a high score for the object class horse but low scores for other classes for the cropped image region that contains a horse.

(d)The deep model (Clarifai-fast)for obtaining the image

classi?cation scores of 1000classes.The input is the whole image.The image classi?cation scores are used as contextual information for re?ning the scores of the bounding boxes.Detail are given in Section 5.7.Parts (a)-(d)are learned by back-propagation (BP).

5.2.New pretraining strategy

The training scheme of the RCNN in [23]is as follows:1.Pretrain the deep model by using the image classi?cation task,https://www.wendangku.net/doc/078239195.html,ing image-level annotations of 1000classes from the ImageNet Cls-Loc train data.

2.Fine-tune the deep model for the object detection task,https://www.wendangku.net/doc/078239195.html,ing object-level annotations of 200classes from the ImageNet Det train and val 1data.

The deep model structures at the pretraining and ?ne-tuning stages are only different in the last fully connected layer for predicting labels (1,000classes vs.200classes).Except for the last fully connected layers for classi?cation,the pa-

3scores

Candidate region

Image

Input:Training set:Warped images and their labels from the?ne-tuning training data

ParametersΘfor the baseline deep model

obtained by pretraining.

Output:ParametersΘfor the baseline deep model,

Parameters W l,t,l=6,7,8,t=1,···,T for

the extra layers.

1Set elements in W l,t to be0;

2BP to?ne-tuneΘ,while keeping W l,t as0;

3for t=1to T do

general object detection on ImageNet.

Denotations.The pooling layer after conv5is denoted by pool5.As shown in Fig.4,besides fc6,pool5is con-nected to T extra fully connected layers of sizes4096. Denote the T extra layers connected the pool5layer as fc61,fc62,···,fc6T.Denote fc71,fc72,···,fc7T as the T layers separately connected to the layers fc61,fc62,···,fc6T.Denote the weight connected to fc l T by W l,t, l=6,7,t=1,···,T.Denote the weights from fc7t to classi?cation scores as W8,t,t=1,···,T.The path from pool5,fc6t,fc7t to classi?cation scores can be considered as the extra classi?er at stage t.

The multi-stage training procedure is summarized in Al-gorithm1.It consists of two steps.

?Step1(2in Algorithm1):BP is used for?ne-tuning all the parameters in the baseline deep model.

?Step2.1(4in Algorithm1):parameters W l,t,t=6,7 are randomly initialized at stage t in order to search for extra discriminative information in the next step.

?Step2.2(5-6in Algorithm1):multi-stage classi?ers W l,t for l=6,7,t=1,···,T are trained using BP stage-by-stage.In stage t,classi?ers W l,t up to t are jointly updated.

The baseline deep model is?rst trained by excluding extra classi?ers to reach a good initialization point.Training this simpli?ed model avoids over?tting.Then the extra classi-?ers are added stage-by-stage.At stage t,all the existing classi?ers up to layer t are jointly optimized.Each round of optimization?nds a better local minimum around the good initialization point reached in the previous training stages.

Figure4.The baseline deep model and fully connected layers with multi-stage training.The layer pool5is result of max pooling over the conv5layer in Fig.3.Different stages of classi?ers deal with samples of different dif?culty levels.

In the stage-by-stage training procedure,classi?ers at the previous stages jointly work with the classi?er at the current stage in dealing with misclassi?ed samples.Existing cas-caded classi?ers only pass a single score to the next stage, while our deep model uses multiple hidden nodes to transfer information.

Detailed analysis on the multi-stage training scheme is provided in[64].A brief summary is given as follows: First,it simulates the soft-cascade structure.A new clas-si?er is introduced at each stage to help deal with misclas-si?ed samples while the correctly classi?ed samples have no in?uence on the new classi?er.Second,the cascaded classi?ers are jointly optimized at stage t in step2.2,such that these classi?ers can better cooperate with each other. Third,the whole training procedure helps to avoid over?t-ting.The supervised stage-by-stage training can be consid-ered as adding regularization constraints to parameters,i.e. some parameters are constrained to be zeros in the early training strategies.At each stage,the whole network is initialized with a good point reached by previous training strategies and the additional classi?ers deal with misclassi-?ed samples.It is important to set W l,t=0in the previous training strategies;otherwise,it become standard BP.With standard BP,even an easy training sample can in?uence any classi?er.Training samples will not be assigned to different classi?ers according to their dif?culty levels.The parameter space of the whole model is huge and it is easy to over?t.

5.4.The def-pooling layer

5.4.1Generating the part detection map

Since object parts have different sizes,we design?lters with variable sizes and convolve them with the conv5layer in the baseline model.Fig.5shows the layers with def-pooling layers.It contains the following four parts:

(a)The conv5layer is convolved by?lters of sizes3×3,

5×5,and9×9separately in order to obtain the part

Figure5.The baseline deep model and def-pooling layers.

detection maps of128channels,which are denoted by conv61,conv62,and conv63as shown in Fig.5.In com-parison,the path from conv5,fc6,fc7to classi?cation score can be considered as a holistic model.

(b)Part detection maps are separately fed into the def-

pooling layers denoted by def61,def62,and def63in or-der to learn their deformation constraints.

(c)The output of def-pooling layers,i.e.def61,def62,and

def63,are separately convolved with?lters of sizes1×1 with128channels to produce outputs conv71,conv72, and conv73,which can be considered as fully connected layers over the128channels for each location.

(d)The fc7in the Clarifai-fast and the output of layers

conv71,conv72,and conv73are used for estimating the class label of the candidate bounding box.

5.4.2Learning the deformation

Motivation.The effectiveness of learning deformation con-straints of object parts has been proved in object detection by many existing non-deep-learning detectors,e.g.[20]. However,it is missed in current deep learning models.In deep CNN models,max pooling and average pooling are useful in handling deformation but cannot learn the defor-mation constraint and geometric model of object parts.We design the def-pooling layer for deep models so that the de-formation constraint of object parts can be learned by deep models.

Denote M of size V×H as the result of the convolu-tional layer,e.g.conv61.The def-pooling layer takes small blocks of size(2R+1)×(2R+1)from the M and sub-

samples M to B of size V

k y to produce single output

from each block as follows:

b(x,y)=max

i,j∈{?R,···,R}{m(k x·x+i,k y·y+j)?

N

n=1

c n

d i,j n},(2)

where(k x·x,k y·y)is the center of the

block,

k

x

and

k y

are subsmpling steps,b(x,y)is the(x,y)th element of B.c n

and d i,j n are deformation parameters

to be

learned.

Example1.Suppose c n

=0,then there is no penalty for

placing a part with center(k x·x,k y·y)to any location in

filter

input

result

Deformation

penalty

Output b

Global

max

Figure6.The deformation layer when deformation map is de-?ned in(3).Part detection map M and deformation constraint are summed up to obtain the summed map?M.Global max pooling is then performed on?M to obtain the score b.

{(k x·x+i,k y·y+j)|i,j=?R,...R}.In this case,the def-pooling layer degenerates to max-pooling layer with sub-sampling step(k x,k y)and kernel size(2R+1)×(2R+1). Therefore,the difference between def-pooling and max-pooling is the term? N n=1c n d i,j n in(2),which is the de-formation constraint learned by def-pooling.In short,def-pooling is max-pooling with deformation constraint.

Example2.Suppose V=k y,H=k x,i=1,···,V, and j=1,···,H,then the def-pooling layer degenerates to the deformation layer in[38].There is only one output for M in this case.The deformation layer can represent the widely used quadratic deformation constraint in the de-formable part-based model[20].Details are given in Ap-pendix A.Fig.6illustrates this example.

Example3.Suppose N=1and c n=1,then the defor-mation constraint d i,j1is learned for each displacement bin (i,j)from the center location(k x·x,k y·y).In this case,d i,j1 is the deformation cost of moving an object part from the center location(k x·x,k y·y)to location(k x·x+i,k y·y+j). As an example,if d0,01=0and d i,j1=∞for(i,j)=(0,0), then the part is not allowed to move from the center loca-tion(k x·x,k y·y)to anywhere.As the second example, if d i,j1=0for j<=0and d i,j1=∞for j>0,then the part can move freely upward but should not move down-ward.As the third example,if d0,01=0and d i,j1=1for (i,j)=(0,0),then the part has no penalty at the center lo-cation(k x·x,k y·y)but has penalty1elsewhere.The R in controls the movement range.Objects are only allowed to move within the horizontal and vertical range[?R R]from the center location.

The deformation layer was proposed in our recently pub-

lished work[38],which showed signi?cant improvement in pedestrian detection.The def-pooling layer in this paper is different from the deformation layer in[38]in the following aspects.

1.The work in[38]only allows for one output,while this

paper is block-wise pooling and allows for multiple out-put at different spatial locations.Because of this differ-ence,the deformation layer can only be put after the?-nal convolutional layer,while the def-pooling layer can be put after any convolutional layer like the max-pooling layer.Therefore,the def-pooling layer can capture geo-metric deformation at all the levels of abstraction,while the deformation layer was only applied to a single layer corresponding to pedestrian body parts.

2.It was assumed in[38]that a pedestrian only has one

instance of a body part,so each part?lter only has one optimal response in a detection window.In this work, it is assumed that an object has multiple instances of its part(e.g.a building has many windows,a traf?c light has many light bulbs),so each part?lter is allowed to have multiple response peaks.This new model is more suitable for general object detection.For example,the traf?c light can have three response peaks to the light bulb in Fig.7for the def-pooling layer but only one peak in Fig.6for the deformation layer in[38].

3.The approach in[38]only considers one object class,

e.g.pedestrians.In this work,we consider200object

classes.The patterns can be shared across different ob-ject classes.As shown in Fig.8,circular patterns are shared in wheels for cars,light bulb for traf?c lights, wheels for carts and keys for ipods.Similarly,the pat-tern of instrument keys is shared in accordion and pi-ano.In this work,our design of the deep model in Fig.

7considers this property and learns the shared patterns through the layers conv61,conv62and conv63and use these shared patterns for200object classes.

5.5.Fine-tuning the deep model with hinge-loss

RCNN?ne-tunes the deep model with softmax loss,then ?xes the deep model and uses the hidden layers fc7as fea-tures to learn200one-versus-all SVM classi?ers.This scheme results in extra time required for extracting fea-tures from training data.With the bounding box rejection, it still takes around60hours to prepare features from the ILSVRC2013Det train and val1for SVM training.In our approach,we replace the softmax loss of the deep model by hinge loss when?ne-tuning deep models.The deep model ?ne-tuning and SVM learning steps in RCNN are merged into one step in our approach.In this way,the extra train-ing time required for extracting features is saved in our ap-proach.

r

filter

input

result

Deformation

penalty

Output B

Max

pooling

Figure7.The def-pooling layer.Part detection map and deforma-tion

constraint

are

summed up.Block-wise

max

pooling is then

performed on the

summed map to

obtain the output B of size H

k x

.

(a)

(b)

Figure8.The circular patterns(a)and musical instrument key pat-terns(b)shared across different object classes.

5.6.Sub-box features

A bounding box denoted by r0can be divided into N sub-boxes r1,···,r N,N=4in our implementation.r0is called the root box in this paper.For example,the bound-ing box for cattle in Fig.9can be divided into4sub-boxes corresponding to head,torso,forelegs and hind legs.The features of these sub-boxes can be used to improve the ob-

(a)(b)

Figure9.A box r0with its four sub-boxes r1,···,r4(a)and ex-amples for the bounding boxes on cattle(b).

ject detection accuracy.In our implementation,sub-boxes have half the width and height of the root box r0.The four sub-boxes locate at the four corners of the root box r0.De-note B s as the set of bounding boxes generated by selective search.The features for these bounding boxes have been generated by deep model.The following steps are used for obtaining the sub-box features:

1.For a sub-box r n,n=1,···,4,its overlap with the the

boxes in B s is calculated.The box in B s having the largest IoU with r n is used as the selected box b s,n for the sub-box r n.

2.The features of the selected box b s,n are used as the fea-

tures f n for sub-box r n.

3.Element-wise max-pooling over the four feature vectors

f n for n=1,2,3,4is used for obtainin

g max-pooling

feature vector f max,i.e.f i,max=max4n=1f i,n,where

f i,max is the i th element in f max and f i,n is the i th ele-

ment in f n.

4.Element-wise average-pooling over the four feature vec-

tors f n for n=1,2,3,4is used for obtaining average-pooling feature vector f avg,i.e.f i,avg=1

Table 1.Models used for model averaging submitted to ILSVRC2014.The result of mAP is on val2.For net design,A de-notes AlexNet,C denotes Clarifai-fast,D-D denotes DeepID-Net with def-pooling layers,D-MS denotes DeepID-Net with multi-stage training.In A and C,only the baseline deep model(Clarifai-fast or AlexNet)is used without def-pooling layers or multi-stage training.In D-D and S-MS,the baseline deep model is chosen as Clarifai-fast,and extra layers from def-pooling or multi-stage training are included.For pretrain,[23]denotes the pretraining scheme of RCNN,1denotes the Scheme1in Section5.2,2de-notes the Scheme2in Section5.2.

mAP(%)31.031.232.133.635.336.037.037.037.137.4 Averaging scheme all-cls all-cls all-cls per-cls

After deadline n n y y

evaluation data val2test val2val2

are shown in Table3.The mAP on val2is42.4%.

In existing works and the model averaging approach de-scribed above,the same model combination is applied to all the200classes in detection.However,we observe that the effectiveness of different models varies a lot across different object categories.Therefore,it is better to do model selec-tion for each class separately.With this strategy,we achieve mAP45%on val2.

7.Experimental Results

The ImageNet Det val2data is used for evaluating sepa-rate components and the ImageNet Det test data is used for evaluating the overall performance.The RCNN approach in [23]is used as the baseline for comparison.The source code provided by the authors are used for repeating their results. Without bounding box regression,we obtain mean AP29.9 on val2,which is close to the29.7reported in[23].Table2 summarizes the results from ILSVRC2014object detection challenge.It includes the best results on test data submitted

to ILSVRC2014from our team,GoogleNet,DeepInsignt, UvA-Euvision,and Berkeley Vision,which ranked top?ve among all the teams participating in the challenge.It also includes our most recent results on test data obtained after the competition deadline.All these best results were ob-tained with model averaging.Table4.Ablation study of bounding box(bbox)rejection and base-line deep model on ILSVRC2014val2.

mAP(%)29.930.931.8

meadian AP(%)28.929.430.5

Table2.Experimental results on ILSVRC2014for top ranked approaches.

ours ours new

40.945

40.7n/a

net structure A-net A-net A-net C-net C-net

bbox rejection n n n y y

class number2001000100010001000

annotation level image image object image object

Table6.Ablation study of the two pretraining schemes in Section 5.2on ILSVRC2014val2.Scheme1uses the image-level annota-tion while scheme2does not.

mAP(%)31.234.333.436.0

meadian AP(%)29.733.433.134.9

net structure A-net C-net D-MS D-Def

bbox rejection n y y y

pretraining scheme2222 annotation.As shown in Table6,Scheme2performs better than Scheme1by2.6%mAP.This experiment shows that image-level annotation is not needed in pretraining deep model when object-level annotation is available.

7.1.3Investigation on deep model designs

Based on the pretraining scheme2in Section5.2,different deep model structures are investigated and results are shown in Table7.Our DeepID-Net that uses multi-stage training for multiple fully connected layers in Fig.4is denoted as D-MS.Our DeepID-Net that uses def-pooling layers as shown in Fig.5is denoted as https://www.wendangku.net/doc/078239195.html,ing the C-net as baseline deep moel,the DeepID-Net that uses multi-stage training in Fig.4improves mAP by1.5%.Using the C-net as baseline deep moel,the DeepID-Net that uses def-pooling layer in Fig.5improves mAP by2.5%.This experiment shows the effectiveness of the multi-stage training and def-pooling layer for generic object detection.

7.1.4Investigation on the overall pipeline

Table8and Table9summarize how performance gets im-proved by adding each component step-by-step into our pipeline.RCNN has mAP29.9%.With bounding box re-jection,mAP is improved by about1%,denoted by~1%. Based on that,changing A-net to C-net improves mAP

by

Our approach

RCNN

Figure11.Object detection result for RCNN and our approach.

~1%.Replacing image-level annotation by object-level annotation for pretraining,mAP increases by~4%.The def-pooling layer further improves mAP by2.5%.After adding the contextual information from image classi?cation scores,mAP increases by~1%.Bounding box regression improves mAP by~1%.With model averaging,the best result is45%.Table9summarizes the contributions of dif-ference components.More results on the test data will be available in the next version soon.

8.Appedix A:Relationship between the defor-

mation layer and the DPM in[20]

The quadratic deformation constraint in[20]can be rep-resented as follows:

?m(i,j)=m(i,j)?c1(i?a i+

c3

2c2

)2,(3)

where m(i,j)is the(i,j)th element of the part detection map M,(a i,a j)is the prede?ned anchor location of the p th part.They are adjusted by c3/2c1and c4/2c2,which are automatically learned.c1and c2(3)decide the defor-mation cost.There is no deformation cost if c1=c2=0. Parts are not allowed to move if c1=c2=∞.(a i,a j) and(c3

2c2

)jointly decide the center of the part.The quadratic constraint in Eq.(3)can be represented using Eq.

(2)as follows:

?m(i,j)=m(i,j)?c1d(i,j)

1

?c2d(i,j)

2

?c3d(i,j)

3

?c4d(i,j)

4

?c5,

d(i,j)

1

=(i?a i)2,d(i,j)

2

=(j?a j)2,d(i,j)

3

=i?a i,

d(i,j)

4

=j?a j,c5=c32/(4c1)+c42/(4c2).(4) In this case,c1,c2,c3and c4are parameters to be learned and

d(i,j)

n

for n=1,2,3,4are prede?ned.c5is the same in all loca-tions and need not be learned.The?nal output is:

b=max

(i,j)

?m(i,j),(5) where?m(i,j)is the(i,j)th element of the matrix?M in(3).

9.Conclusion

This paper proposes a deep learning diagram that learns four components–feature extraction,deformation handling,con-text modeling and classi?cation–for generic object detection.

Table8.Ablation study of the overall pipeline for single model tested on ILSVRC2014val2.It shows the mean AP after adding each key component step-by-step.

mAP(%)29.930.931.836.038.539.240.1

meadian AP(%)28.929.430.534.937.438.740.3

detection pipeline RCNN+bbox A-net image to bbox+Def+context+bbox

averaging

45%

[26]G.E.Hinton,S.Osindero,and Y.Teh.A fast learning al-

gorithm for deep belief nets.Neural Computation,18:1527–1554,2006.2

[27]G. E.Hinton and R.R.Salakhutdinov.Reducing the

dimensionality of data with neural networks.Science, 313(5786):504–507,July2006.2

[28]K.Jarrett,K.Kavukcuoglu,M.Ranzato,and Y.LeCun.

What is the best multi-stage architecture for object recog-nition?In CVPR,2009.2

[29] A.Krizhevsky,I.Sutskever,and G.Hinton.Imagenet clas-

si?cation with deep convolutional neural networks.In NIPS, 2012.1,2,3,9,10

[30]Q.V.Le,M.Ranzato,R.Monga,M.Devin,K.Chen,G.S.

Corrado,J.Dean,and A.Y.Ng.Building high-level features using large scale unsupervised learning.In ICML,2012.2 [31]M.Lin,Q.Chen,and https://www.wendangku.net/doc/078239195.html,work in network.ICLR,

2014.2,12

[32] D.Lowe.Distinctive image features from scale-invarian key-

points.IJCV,60(2):91–110,2004.1

[33]P.Luo,X.Wang,and X.Tang.Hierarchical face parsing via

deep learning.In CVPR,2012.2

[34]S.Maji,A.C.Berg,and J.Malik.Classi?cation using inter-

section kernel support vector machines is ef?cient.In CVPR, 2008.2

[35]K.Mikolajczyk,B.Leibe,and B.Schiele.Multiple object

class detection with a generative model.In CVPR,volume1, pages26–36.IEEE,2006.1

[36]K.Mikolajczyk,B.Leibe,and B.Schiele.Multiple object

class detection with a generative model.In CVPR,2006.2 [37]M.Norouzi,M.Ranjbar,and G.Mori.Stacks of convolu-

tional restricted boltzmann machines for shift-invariant fea-ture learning.In CVPR,2009.2

[38]W.Ouyang and X.Wang.Joint deep learning for pedestrian

detection.In ICCV,2013.2,7,8

[39]W.Ouyang and X.Wang.Single-pedestrian detection aided

by multi-pedestrian detection.In CVPR,2013.2

[40]W.Ouyang,X.Zeng,and X.Wang.Modeling mutual vis-

ibility relationship in pedestrian detection.In CVPR,2013.

2

[41] D.Park,D.Ramanan,and C.Fowlkes.Multiresolution mod-

els for object detection.In ECCV,2010.2

[42]H.Poon and P.Domingos.Sum-product networks:A new

deep architecture.In UAI,2011.2

[43]M.Ranzato,F.J.Huang,Y.-L.Boureau,and Y.Lecun.Un-

supervised learning of invariant feature hierarchies with ap-plications to object recognition.In CVPR,2007.2

[44]O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,

S.Ma,Z.Huang,A.Karpathy,A.Khosla,M.Bernstein,

A.C.Berg,and L.Fei-Fei.Imagenet large scale visual recog-

nition challenge,2014.2

[45]M.A.Sadeghi and A.Farhadi.Recognition using vi-

sual phrases.In Computer Vision and Pattern Recogni-tion(CVPR),2011IEEE Conference on,pages1745–1752.

IEEE,2011.2

[46]P.Sermanet,D.Eigen,X.Zhang,M.Mathieu,R.Fergus,

and Y.LeCun.Overfeat:Integrated recognition,localization and detection using convolutional networks.arXiv preprint arXiv:1312.6229,2013.1,2[47]K.Simonyan, A.Vedaldi,and A.Zisserman.Deep in-

side convolutional networks:Visualising image classi?ca-tion models and saliency maps.In ICLR,2014.2

[48] A.Smeulders,T.Gevers,N.Sebe,and C.Snoek.Segmen-

tation as selective search for object recognition.In ICCV, 2011.1,3,4

[49]Z.Song,Q.Chen,Z.Huang,Y.Hua,and S.Yan.Contex-

tualizing object detection and classi?cation.In CVPR,2011.

2

[50]Y.Sun,X.Wang,and X.Tang.Hybrid deep learning for

computing face similarities.In ICCV,2013.2

[51]S.Tang,M.Andriluka,https://www.wendangku.net/doc/078239195.html,an,K.Schindler,S.Roth,

and B.Schiele.Learning people detectors for tracking in crowded scenes.ICCV,2013.2

[52] A.Vedaldi,V.Gulshan,M.Varma,and A.Zisserman.Mul-

tiple kernels for object detection.In ICCV,2009.1

[53] A.Vedaldi,V.Gulshan,M.Varma,and A.Zisserman.Mul-

tiple kernels for object detection.In ICCV,2009.2,5 [54]P.Viola and M.Jones.Robust real-time face detection.IJCV,

57(2):137–154,2004.5

[55]P.Viola,M.J.Jones,and D.Snow.Detecting pedestrians

using patterns of motion and appearance.IJCV,63(2):153–161,2005.1

[56] B.Wu and R.Nevatia.Detection of multiple,partially oc-

cluded humans in a single image by bayesian combination of edgelet part detectors.In ICCV,2005.2

[57] B.Wu and R.Nevatia.Detection and tracking of multi-

ple,partially occluded humans by bayesian combination of edgelet based part detectors.IJCV,75(2):247–266,2007.2 [58]J.Yan,Z.Lei,D.Yi,and S.Z.Li.Multi-pedestrian detection

in crowded scenes:A global view.In CVPR,2012.2 [59]Y.Yang,S.Baker,A.Kannan,and D.Ramanan.Recogniz-

ing proxemics in personal photos.In CVPR,2012.2 [60]Y.Yang and D.Ramanan.Articulated pose estimation with

?exible mixtures-of-parts.In CVPR,2011.2

[61] B.Yao and L.Fei-Fei.Modeling mutual context of object

and human pose in human-object interaction activities.In CVPR,2010.2

[62]M. D.Zeiler and R.Fergus.Visualizing and under-

standing convolutional neural networks.arXiv preprint arXiv:1311.2901,2013.2,4,5,9,10

[63]M.D.Zeiler,G.W.Taylor,and R.Fergus.Adaptive decon-

volutional networks for mid and high level feature learning.

In ICCV,2011.2

[64]X.Zeng,W.Ouyang,and X.Wang.Multi-stage contextual

deep learning for pedestrian detection.In ICCV,2013.5,6 [65]L.Zhu,Y.Chen,A.Yuille,and https://www.wendangku.net/doc/078239195.html,tent hier-

archical structural learning for object detection.In CVPR, 2010.2

[66] C.L.Zitnick and P.Doll′a r.Edge boxes:Locating object

proposals from edges.In ECCV,2014.4

[67]W.Y.Zou,X.Wang,M.Sun,and Y.Lin.Generic object

detection with dense neural patterns and regionlets.BMVC, 2014.2

相关文档
相关文档 最新文档