当前位置：文档库 › pose estimation based on 3D Models

pose estimation based on 3D Models

Pose Estimation Based on 3D Models

Chuiwen Ma,Hao Su,Liang Shi

1Introduction

This project aims to estimate the pose of an object in the image.Pose estimation problem is known to be an open problem and also a crucial problem in computer vision ?eld.Many real-world tasks depend heavily on or can be improved by a good pose estimation.For example,by knowing the exact pose of an object,robots will know where to sit on,how to grasp,or avoid collision when walking around.Besides,pose estimation is also applicable to auto-matic driving.With good pose estimation of cars,automatic driving system will know how to manip-ulate itself accordingly.Moreover,pose estimation can also bene?t image searching,3D reconstruction and has a large potential impact on many other ?elds.

Previously,most pose estimation works were implemented on manually captured and labeled dataset,using multi-camera or depth camera.However,to create such a dataset is extremely time-consuming,laborsome,and also error-prone.

Therefore,limited information can be learned from

those datasets,and research-speci?c datasets lead to poor comparability among previous works.In this project,we instead utilized the power of 3D shape models.To be speci?c,we built a large,balanced and precisely labeled training dataset from ShapeNet [5],a large 3D model pool which contains millions of 3D shape models in thousands of object categories.By rendering 3D models into 2D images from di?erent viewpoints,we can easily control the size,the pose distribution,and the precision of the dataset.A learning model trained on this dataset will help us better solve the pose estimation task.In this work,we proposed a pose estimation system based on rendered image training set,which predicts the pose of objects in real image,with knowledge of object category and tight bounding box.Although the approach is generic,we chose chair to be our primary research object.Our system takes a properly cropped chair image as input,and outputs a probability vector on discretized pose space.Given a test image,we ?rst divide it into a N ×N overlapped patch grid.For each patch,a multi-class classi?er is trained to estimate the probability of this patch to be pose v .Then,scores from all patches are combined to generate a probability vector for the whole image.

Although we created a larger and more precise training dataset from rendered images,there is an obvious drawback of this approach —the statistical property of the training set and the test set are dif-ferent.For instance,in the real world,there exists a prior probability distribution of poses,which might be non-uniform.Furthermore,even for feature com-plexity,real image features might be more diverse than rendered image features.In this project,we also focused on information transmission between 2D images and 3D models,therefore proposed a method to iteratively learn from classi?cation results and in return improve classi?cation algorithm.This novel approach revised the in?uence of di?erent prior prob-ability distribution in training and test set.Details and experiment results are shown in the following sec-tions.1.1Related Works

Object pose estimation is a classical problem in computer vision.In general,there are two typical research lines:one based on 2D representation and the other based on 3D information.Among 2D based researches,[6,7,8]rely on point matching,which is now outdated.By linking together diagnostic parts of object from di?erent views,[9]represents an object category as a col-lection of view-invariant regions.Sun et al.[10]and Su et al.[17]used a generative approach to group local features into parts and then learn part

locations across viewpoints.[11]used a SIFT-like

[18]spatial pyramids of histograms feature to train a SVM classi?er for each discrete pose.Inspired by Deformable Part Model’s [12]success,[13]trained a DPM using a semi-latent approach,where the com-ponents correspond to discrete viewpoints.[15]used convolutional neural network features for the task of pose estimation.[16]proposed a Hough Forest based method for simultaneous object detection and continuous pose estimation.Those widely di?erent works above,although gained some achievements,are not learning from structural information of objects like human.

a r X i v :1506.06274v 1 [c s .C V ] 20 J u n 2015

Recently,3D model based approach

achieved

good performance on pose estimation task.[19]

extended deformable part models to3D,where

part appearances and spatial deformations are

represented https://www.wendangku.net/doc/537696071.html,ing an readymade approach,

[20]?rst obtained a rough localization and viewpoint

of the object,and then estimated a continuous

pose by using annotated3D CAD models.Hejrati

et al.[21]estimated poses of cars using an ex-

plicit3D shape model and viewpoint which is

learned from structure-from-motion(SFM).In gen-

eral,methods above rely on sophisticated handling

of3D models,due to the limitation of model amount.

Up to our knowledge,there is no previous work

that utilizes large3D model database to solve pose

estimation task.

2Data Collection and

Processing

2.1Training Data

As we mentioned in Section1,we collected our

training data from ShapeNet,an emerging3D shape

model database.With9,135135semantically anno-

tated3D models in thousands of object categories,

ShapeNet could provide abundant information for

many vision tasks.In our task,we utilized those

5057chair models in ShapeNet.For each model,we

rendered it on16viewpoints,evenly distributed on

the horizontal circle,shown in Figure1.

Figure1:Chair models and rendering process

We chose4000models,accordingly64,000images

to build the training dataset,and leave the rest1057

models to be our rendered image test set(validation

set).Before extracting image features,we?rst re-

size the images to112×112pixels,and then divide

it into6×6overlapped patch grid,with patch size

32×32and patch stride16on both axes.After pre-

processing,we extract a576dimensional HoG fea-

ture[2]from each patch,so the whole image can be

represented by a20736dimensional feature vector.

Those64,000feature vectors constituted our training

dataset.

Figure2:Image preprocessing and feature extraction

2.2Test Data

To comprehensively evaluate the performance of

our learning algorithm,we built three di?erent test

sets with increasing level of test di?culty.They are

rendered image test set,clean background real image

test set and cluttered background real image test set.

Rendered image test set,as we mentioned in Sec.

2.1,consists of1057×16rendered images,which also

comes from ShapeNet.Clean background and clut-

tered background real image test sets are collected

from ImageNet[3],containing1309and1000im-

ages respectively,both with manually labeled ground

truth of viewpoint.Some sample images are shown

in Figure3.Obviously,these three datasets are in-

creasingly noisy and thus di?cult to tackle.

Figure3:Clean background&cluttered background

For image preprocessing and feature extraction on

the test sets,we used the same scheme as the train-

ing set.That is,convert each image into a20736-dimensional HoG feature.

3Model

Rather than using global image feature as the input of classi?cation,our pose estimation model is patch-based.By dividing image into patches and training a classi?er for each patch,our model can be more robust to occlusion and background noise. Also,this approach reduced the feature dimension for each classi?er,thus reduced the sample com-plexity.Actually,we did try global features,while the classi?cation accuracy is30%lower than patch based method on clean background test set,shown in Table1.The mathematical representation of our patch based model is as follows.

De?ne F i as the HoG feature of patch i, I=(F1,···,F N2)to be the HoG feature of the whole image,V={1,···,V}to be the dis-cretized pose space.For each patch,we build a classi?er,which learns from training data,and gives a prediction of the conditional probability P(v|F i). To respresent P(v|I)in P(v|F i),i=1,···,N2,we

assume P(v|I)∝N2

i=1

P(v|F i).So,we can calculate

P(v|I)and the accordingˉv using the following formula:

P(v|I)=

i=1

P(v|F i) V

v=1

i=1

P(v|F i)

ˉv=arg max

P(v|I)

In sum,our model takes F i,i=1,···,N2as input, and outputs P(v|I)andˉv.

4Methods

4.1Learning Algorithms

4.1.1Random Forest

In this project,we choose random forest[1]as a pri-mary classi?cation algorithm based on its following advantages:

?Suitable for multiclass classi?cation.

?Non-parametric,easy to tune.

?Fast,easy to parallel.

?Robust,due to randomized processing.

During classi?cation,36random forest classi?ers are trained for36patches.As a trade o?between spatio-temporal complexity and performance,we set the forest size to be100trees.We also tuned the maximum depth of trees using cross-validation,where the optimal depth is20.As a result,each random forest outputs a probability vector P(v|F i).After Laplace smoothing,we calculated P(v|I),estimated the pose to beˉv=arg max

P(v|I).

4.2Optimization

Constructing training dataset from rendered images has many advantages,but there are also drawbacks. As I mentioned in Section1,the prior probability of pose in real images can be highly di?erent from that in rendered images.As we know,pose distribution in the training set is uniform,however,in real images, there are far more front view chairs than back view. Fortunately,this di?erence can be analyzed and mod-eled as follows.

4.2.1Probability Calibration

In classi?cation step,each classi?er C i will output a probability vector?P(v|F i).Using Bayesian formula, we have:

?P(v|F

?P(v)?P(F

|v)

?P(F

)

Although we may not learn?P(v),?P(F|v)and?P(F) explicitly when training,we can use them to indicate the statistical property of training data.Whereas, the real P(v|F i),which satis?es the following formula, could be di?erent from?P(v|F i).Here,P(v),P(F i|v) and P(F i)are statistical properties of the test set.

P(v|F i)=

P(v)P(F i|v)

P(F i)

Assume the training data and the test data have at least some similarity.Speci?cally speaking,assume P(F i|v)=?P(F i|v),P(F i)=?P(F i),then we have:

P(v|F i)=?P(v|F i)

P(v)

?P(v)

∝?P(v|F i)P(v)

To recover P(v|F i),we just need to achieve a good estimation of P(v).One possible method might be randomly choosing some samples from the test set, and manually label the ground truth of viewpoint,

regard the ground truth pose distribution of samples as an estimation of overall P(v).However,we still need to do some“labor work”—labeling.

Noticing the above formula can also be written as:

P(v|F i) P(v)=

?P(v|F

)

?P(v)

;?P(v)=

,?v∈V

we came up with another idea to automatically im-prove the classi?cation result.For?P(v|F i),we have:

P(v)>1

??P(v|F i)

P(v)<1

??P(v|F i)>P(v|F i)

That means,when testing,frequently appeared poses are underestimated,while uncommon poses are overestimated.Here,we propose an iterative method to counterbalance this e?ect.Basically,we will use?P(v|F i)to generate an estimation?P(v)of the prior distribution;assume P(v)and?P(v)have similar common views and uncommon views(in other words,P(v)and?P(v)have the same trend);smooth ?P(v)to keep the trend while reduce?uctuation range;multiply the original?P(v|F i)by smoothed

?P(v);and iteratively repeat the above steps.Finally, due to the damping e?ect in combination step,?P(v) will converge,and?P(v|F i)gets closer to P(v|F i). Formulation of this iterative algorithm is as follows:

1.Calculate?P(v|I(j)),j=1,···,m.

?P(v|I(j))=

i=1

?P(v|F(j)

) V

v=1

i=1

?P(v|F(j)

)

2.Accumulate?P(v|I(j))on all test samples to cal-

culate?P(v).

?P(v)=1

j=1

?P(v|I(j))

3.Smooth?P(v)by factorα.

?P s (v)=

?P(v)+α

1+16α

4.Estimate P(v|F i)by letting:

ˉP(v|F

)=?P(v|F i)?P s(v)

https://www.wendangku.net/doc/537696071.html,eˉP(v|F i)to re-calculate?P(v|I(j))in step1,

while remain?P(v|F i)in step4unchanged,repeat the above steps.4.2.2Parameter Automatic Selection

After several iterations,the algorithm will converge, and we’ll get a?nal estimationˉP(v|F i)of P(v|F i). However,di?erentαwill lead

to far di?erent converg-ing results,as shown in Figure4.From experiment results in Figure5we observed that ifαis too small, viewpoint with the highest initial probability?P(v) will soon beat other viewpoints,and?P(v)converges to a totally biased distribution.While,ifαis too large,smoothing e?ect is too strong to in?uence ˉP(v|F

),resulting inˉP(v|F i)=?P(v|F i).However, there exists an intermediate value ofαto maximize the classi?cation accuracy and lead to an optimal estimationˉP(v|F i).In Figure4and5,αopt is0.8.

Figure4:Classi?cation accuracy change w.r.t.α

Figure5:Stable distribution?P(v)w.r.t.α

To solve the optimalα,we conducted deep analysis to the relationship between stable?P(v)andα.We found three patterns of relationship between?P(v j) andα,shown in Figure6.For some viewpoints,?P(v) is almost monotonically increasing with respect toα, such as blue curves,some are monotonically decreas-

ing,such as the black curve,while others will decrease after ?rst increase,such as the red curves.Recall

the distribution change with respect to αin Figure 5,

we found ?P

(v )will ?rst approximate P (v )then be smoothed.Therefore,patterns with turning points are good re?ection of this trend.Sum on those com-ponents,we get Figure 7,and take the turning point of the curve as our estimated ˉα.Here ˉαis 1,very close the optimal value αopt =0.8.

Figure 6:?P

(v j )curve with respect to αFigure 7:Estimated α

Results

5.1

Classi?cation Performance

In Table 1,our patch based random forest classi?ca-tion algorithm (denoted as RF)shows a promising

Render Clean Cluttered RF(%)96.1680.6776.80RF opt (%)—88.9078.70RF GT (%)—91.2981.00Global(%)

97.0352.6410.90

Table 1:Classi?cation accuracy on three test sets classi?cation results on all three test sets.Under our scheme,random forest achieves 80%accuracy on clean background real image test set,and 77%on cluttered background test set.After calibrating the

conditional probability ?P

(v |F i )using automatically selected α(denoted as RF opt ),performance on clean test set is boosted by 8%,as well 2%on cluttered set.The relatively low improvement on cluttered test set may result from our assumption of ?P

(F i |v )=P (F i |v )and ?P (F i )=P (F i )are too strong for cluttered images.

Row “RF GT ”shows the result of calibrating ?P (v |F i )using ground truth P (v ).Compared to our optimization approach,the accuracy is only 2%higher,indicating the e?ectiveness of our method.Besides,the “Global”row shows a terrible classi-?cation performance on global image features.Al-though it achieves best result on rendered images,performance drops signi?cantly when testing on real images.One possible explanation might be that global classi?er over?ttingly learned the importance of patches from training set,while patch importance on real images is di?erent.Figure 8veri?ed this hy-pothesis.In contrast,patch based method gives equal importance to all patches,hence reduced over?tting.

Figure 8:Patch importance on rendered,clean,and cluttered test sets.Learned by training a global ran-dom forest classi?er on three datasets.

Figure 9shows the confusion matrix on three test sets respectively.From left to right,as test di?-culty increases,confusion matrix becomes increas-ingly scattered.On rendered image test set,an in-

teresting phenomenon is that some poses are often

misclassi?ed to poses with 90?

di?erence with them,one possible explanation is that the shape of some

chairs are like a square.Also,front view and back-view are often misclassi?ed,because they have similar

appearance in feature

space.

Figure 9:Confusion matrix on rendered,clean,and cluttered test sets

6Conclusion

In this paper,we proposed a novel pose estimation

approach —learn from 3D models.We explained

our model in Bayesian framework,and raised a new

optimization method to transmit information from

2D images to 3D models.The promising experiment results veri?ed the e?ectiveness of our scheme.7

Future Work

We have several ideas for the future work,described as follows:?Take into consideration the foreground and back-ground information in the image,fully utilize the

information in rendered images.

?Further model the di?erence between three

datasets,revise our inaccurate assumption.

?Learn the discriminativeness of patches,give dif-ferent weight for di?erent patches.

?Generalize our algorithm to occluded images,or

di?erent categories,see what will happen.

References [1]Breiman,Leo.“Random forests.”Machine learn-ing 45.1(2001):5-32.

[2]Dalal,Navneet,and Bill Triggs.“Histograms of oriented gradients for human detection.”Com-puter Vision and Pattern Recognition,2005.CVPR 2005.IEEE Computer Society Conference on.Vol.1.IEEE,2005.[3]Deng,Jia,et al.“Imagenet:A large-scale hier-archical image database.”Computer Vision and

Pattern Recognition,2009.CVPR 2009.IEEE

Conference on.IEEE,2009.

[4]Pedregosa,Fabian,et al.“Scikit-learn:Machine

learning in Python.”The Journal of Machine Learning Research 12(2011):2825-2830.[5]Su,Hao,Qixing Huang and Leonidas Guibas.

“Shapenet.”

[6]Lepetit,Vincent,Julien Pilet,and Pascal Fua.

”Point matching as a classi?cation problem for fast and robust object pose estimation.”Com-puter Vision and Pattern Recognition,2004.CVPR 2004.Proceedings of the 2004IEEE Com-puter Society Conference on.Vol.2.IEEE,2004.

[7]Haralick,Robert M.,et al.”Pose estimation from

corresponding point data.”Systems,Man and

Cybernetics,IEEE Transactions on 19.6(1989):

1426-1446.

[8]Gold,Steven,et al.”New algorithms for 2d and 3d point matching::pose estimation and cor-respondence.”Pattern Recognition 31.8(1998):

1019-1031.

[9]Savarese,Silvio,and Fei-Fei Li.”3D generic ob-ject categorization,localization and pose estima-tion.”ICCV.2007.[10]Sun,Min,et al.”A multi-view probabilistic

model for 3d object classes.”Computer Vision and Pattern Recognition,2009.CVPR 2009.IEEE Conference on.IEEE,2009.

[11]Ozuysal,Mustafa,Vincent Lepetit,and Pascal Fua.”Pose estimation for category speci?c multi-view object localization.”Computer Vision and Pattern Recognition,2009.CVPR 2009.IEEE Conference on.IEEE,2009.

[12]Felzenszwalb,Pedro F.,et al.”Object detection with discriminatively trained part-based models.”Pattern Analysis and Machine Intelligence,IEEE

Transactions on 32.9(2010):1627-1645.[13]Lopez-Sastre,Roberto Javier,Tinne Tuytelaars,

and Silvio Savarese.”Deformable part models re-visited:A performance evaluation for object cat-egory pose estimation.”Computer Vision Work-shops (ICCV Workshops),2011IEEE Interna-tional Conference on.IEEE,2011.

[14]Gu,Chunhui,and Xiaofeng Ren.”Discrimina-

tive mixture-of-templates for viewpoint classi?-cation.”Computer VisionECCV2010.Springer Berlin Heidelberg,2010.408-421.

[15]Ghodrati,Amir,Marco Pedersoli,and Tinne

Tuytelaars.”Is2D Information Enough For View-point Estimation?.”Proceedings of the British Machine Vision Conference.BMVA Press.Vol.2.

No.5.2014.

[16]Redondo-Cabrera,Carolina,Roberto Lpez-

Sastre,and Tinne Tuytelaars.”All together now: Simultaneous object detection and continuous pose estimation using a hough forest with prob-abilistic locally enhanced voting.”Proceedings BMVC2014(2014):1-12.

[17]Su,Hao,et al.”Learning a dense multi-view rep-

resentation for detection,viewpoint classi?cation and synthesis of object categories.”Computer Vi-sion,2009IEEE12th International Conference on.IEEE,2009.

[18]Lowe,David G.”Distinctive image features from

scale-invariant keypoints.”International journal of computer vision60.2(2004):91-110.

[19]Pepik,Bojan,et al.”3d2pm3d deformable part

models.”Computer VisionECCV2012.Springer Berlin Heidelberg,2012.356-370.

[20]Zia,M.Zeeshan,et al.”Detailed3d representa-

tions for object recognition and modeling.”Pat-tern Analysis and Machine Intelligence,IEEE Transactions on35.11(2013):2608-2623. [21]Hejrati,Mohsen,and Deva Ramanan.”Analyz-

ing3d objects in cluttered images.”Advances in Neural Information Processing Systems.2012.