文档库 最新最全的文档下载
当前位置:文档库 › Hierarchical feature grouping for multiple object segmentation and tracking

Hierarchical feature grouping for multiple object segmentation and tracking

Hierarchical feature grouping for multiple object segmentation and tracking
Hierarchical feature grouping for multiple object segmentation and tracking

Hierarchical Feature Grouping for Multiple Object Segmentation and Tracking

Moonsub Byeon Department of Electrical Engineering and Computer Science,ASRI Seoul National University

Seoul,Korea msbyeon@snu.ac.kr

Hyung Jin Chang

Department of Electrical

Engineering and Computer

Science,ASRI

Seoul National University

Seoul,Korea

changhj@snu.ac.kr

Jin Y oung Choi

Department of Electrical

Engineering and Computer

Science,ASRI

Seoul National University

Seoul,Korea

jychoi@snu.ac.kr

ABSTRACT

In this paper,we propose a hierarchical feature grouping method for multiple object segmentation and tracking.The proposed method aims to segment and track objects in the object-level without prior knowledge about the scene and object.We?rstly group the motion feature into region-level with the proposed region features which represent a homogeneous region in an object.Object-level groups are achieved by clustering the region-level groups based on fore-ground information and motion similarity.To?nd optimal object-level groups,we formulate energy minimization prob-lem,design its objective functions and solve it using sim-ulated annealing(SA).By this hiearchical feature grouping, the proposed method e?ciently segments and tracks various kinds of objects without object detector,3D model and ge-ometry information.Experimental results on several video clips show that our approach robustly segments and tracks multiple object regardness of camera position and object-class.

Categories and Subject Descriptors

I.4[Image Processing and Computer Vision]:Scene Analysis

Keywords

multiple object segmentation,feature clustering,multiple object tracking,KLT,MSER

1.INTRODUCTION

Multiple object segmentation and tracking aims to local-ize multiple objects in the object-level and track them in temporal space.Object-level segmentation and tracking is required for high-level video surveillance such as trajectory analysis and abnormal behavior detection.However,it is still a challenging problem to reliably segment and track in-Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the?rst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior speci?c permission and/or a fee. IVCNZ’12,November26-282012,Dunedin,New Zealand

Copyright2012ACM978-1-4503-1473-2/12/11...$15.00.teresting objects in all kinds of surveillance scenes.There are various kinds of objects-people,cars,bicycles and buses-captured in a single scene.the size and shape of the ob-ject also dynamically change depending on the position of the camera.Moreover,object occlusion makes it di?cult to segment and track multiple objects robustly.

For decades many multiple object segmentation and track-ing algorithms have been proposed[1,3,12,2,13,14,10]. The algorithms can be roughly divided into two approaches: top-down approach,bottom-up approach.Top-down ap-proach[12,1]is based on the object detector.Recently,this approach has become popular thanks to the improvement of the performance and speed of object detectors.This ap-proach(called tracking-by-detection)segments multiple ob-jects by the object detector and tracks individual objects by associating the object detection results based on simi-larities.Especially,pedestrian detectors have been widely used for indoor people surveillance[12,1].This approach, however,cannot be applied to general surveillance situations because object detectors of each classes are required,and ob-ject deformation makes it hard to guarantee object detection accuracy.

Bottom-up based approaches[3,2,13,14,10]use multiple cues to segment and track objects from a given pixel-level data such as foreground map and motion features.Feature grouping and tracking approach is based on corner features (e.g.KLT[9])extracted and tracked by frames.Cluster-ing the corner features in the object-level and labeling each cluster based on tracked features solve the segmentation and tracking problem simultaneously[3,2,13].Yang et al.[13] proposed a motion feature grouping algorithm for vehicle de-tection and tracking.Objective function is formulated as a MAP problem and is optimized by MCMC sampling.Kim [3]grouped the corner features based on multi-level group-ing algorithm combined with background subtraction in real time.Kanhere et al.[2]estimated real3D world coordi-nate of motion features using multi-level homography and grouped these features in the low-angle camera. However,many of the previous bottom up methods use scene-speci?c or object-speci?c cues,which have to be set di?erently in every new place and do not work well in a new class of object.In other words,most methods need prior knowledge about the scene and object such as geome-try information(e.g.camera calibration or homography)[2, 13,10,3],semi-supervised parameter learning[3],and3D object model[13,10].These information are not easy to calculate automatically or not suitable to segment and track

ture grouping such as foreground map,region feature,and motion feature trajectory.Extracted region features repre-sent stable homogeneous regions where edges and corners are rarely detected.These region features become a scene inde-pendent cue for motion feature grouping,because the homo-geneous region usually belongs to an object.We present an e?cient method to?nd these region features.

Region-level grouping makes groups of motion features with homogeneous regions.The motion features near certain region features are considered to be same region-level.We de?ne a similarity between motion features based on these region features and?nd region-level groups by a normalized cuts[8]based on these similarities.

Object-level grouping is done by clustering the region-level groups which locate in same foreground region and move similar velocity.For object-level grouping,a cost function is de?ned by using foreground map and motion feature tra-jectories.By minimizing the cost function with simulated annealing(SA),object-level groups are obtained.

2.1Feature-level Processing

In order to group the motion features in a bottom-up ap-proach,we use three grouping cues such as foreground map, motion features trajectory,region feature.At each frame, adaptive background modeling method[11]is used to com-pute foreground map.The foreground map represents fore-ground pixels and background pixels by binary image.We use KLT feature points[9]as motion features and propose a region features based on MSER[5,6].The motion fea-tures are tracked at each frames,and their trajectories are achieved.The motion features and region features are only extracted on foreground region for e?ciently computational speed.

?

?

?

?

Figure1:Overal scheme of the proposed methods

MSERs express stable homogeneous region in the image

have extremal means all pixels

region’s boundary have greater(MSER+)or smaller(MSER-)intensity value than on their boundary pixels.In addition, the size of the region is stable within several thresholdings. Figure2shows MSERs extracted in occluded objects.Not all of these MSERs,however,give helpful grouping cues be-cause they have various size and are not extracted in a sin-gle object-level.In order to select informative MSERs which give grouping cues for object-level clustering,we use compo-nent tree to select MSERs for a grouping cue.

The component tree is a data structure that can represents inclusion relationship of MSERs e?ciently.The component tree is a rooted,connected tree and can be bulit for any im-age that consists of a single connected region.Each node of the component tree is the extremal region within the input image and its edges express the inclusion relationship of each connected region.For example,Figure2shows the relation-ship between the component tree and MSERs.MSER is a node of component tree of which the size does not change signi?cantly with threshold g varing.

The MSERs located in leaf nodes of the component tree of MSERs,we call it Leaf-MSER,provide e?ective grouping cues.The Leaf-MSER have the following properties.

1.Leaf-MSER have same properties of MSER because

Leaf-MSER is a subset of MSER.

2.Leaf-MSER+(Leaf-MSER-)does not have child-MSER

located in child node of the component tree of MSER.

3.Each Leaf-MSER+(Leaf-MSER-)is disjoint to each other. Note:Leaf-MSER+can be detected by MSER+and its com-ponent tree of the input image,likewise Leaf-MSER-is

detected by the component tree of the inverted input

image and MSER-.

06(5

06(5 06(5

06(5

06(5

06(5 06(5 06(5 06(5 06(5 06(5 Figure2:Component tree of MSERs

Both MSER+,MSER-and their component tree can be calculated by using[6]in linear time.The Leaf-MSER s can achieved easily by using only these information.

to a homogeneous region in a single object as shown in Figure 3(c).We use this regional property as a cue of grouping motion features.We de?ne a distance between one motion feature f and a Leaf-MSER S as

d(f,S)=min

p∈S f?p 2.(1)

The motion features which have the same nearest Leaf-MSER are clustered.However,applying this grouping rule gives unstable grouping result because repeatability of Leaf-MSER is not guaranteed as shown in Figure4.Rather,it might be merged,split,or even disappeared.To avoid this unstable clustering,we calculate the probability of similarity between features within a temporal sliding window.The similarity between the motion features f i and f j is de?ned as

W(i,j)=#SameLeafMSER(f i,f j)

min(|f i|,|f j|),(2)

where#SameLeafMSER(i,j)means the number of times

when two motion feature f i,f j are involved in a same Leaf-

MSER and|f i|,|f j|are length of trajectory of each motion feature f i,f j.

A normalized cuts[8]is applied to segmentation of region-

level groups.The normalized cut clusters the motion fea-

tures with connected high similarity W(i,j).The graph is

built for the normalized cuts-each node represented by mo-

tion features and each edge represented by similarity W(i,j)

between f i,f j.The normalized cuts?nds graph-cut that

minimizes

Ncut(A,B)=

cut(A,B)

assoc(A,V)

+

cut(A,B)

assoc(B,V)

,(3)

cut(A,B)= a∈A,b∈B W(a,b), assoc(A,V)= a∈A,v∈V W(a,v),

(a)(b)(c)(d)

Figure3:Region-level grouping,(a)Leaf-MSER s(b) Motion features(c)Groups clustered by Leaf-MSER (d)Final region-level result.

)LJ H[DPSOH RI IHDWXUH JURXSLQJ IDLOXUH

/HDI 06(5? ?? ??

Example of feature grouping failure

Leaf-MSER

where A and B are disjoint partitioning of the graph V.

This graph partitioning has continued until Ncut value became more than a threshold or the number of group is the same as the number of Leaf-MSER.Finally,the normalized cuts groups the features that have high edge weight.One region-level group indicates the features which frequently in-volve a same Leaf-MSER.

2.3Object-level Grouping

In order to merge region-level clusters into object-level groups,we?nd an object-level con?guration C.The object-level con?guration represents which region-level clusters should be grouped into an object-level group.This can be under-stood by the concept of partition,i.e.object-level con?gura-tion C means partitions of region-level groups L,satisfying ?i C i∈C,C i?L,C i= L k1,...,L k n i ,(4)

?i C i,C j∈C,C i∩C j=?, ?C k∈C C k=L,(5)

where n i is the number of region-level groups involved in C i.We aim to?nd optimal object-level con?guration C?satisfying energy minimization equation such that

C?=arg min

C

E(C)=arg min

C

(E f(C)+α·E m(C)),(6)

where E m is the motion cost functions(described in Sec.2.3.2) and E f the foreground cost function(described in Sec.2.3.1). We de?ne each cost function as following subsections.

2.3.1Foreground cost function

Foreground cost function calculates how exactly the cur-rent object-level group expresses foreground region.In other words,the more foreground pixels exist in the same object-level groups,the less cost the object-level group has.The

1

L 8

L 4

L 7

L 6

L 5

L 9

L 3L 2

L p

L q

L ?

(b)

function based ground.object-level con?gurations C ={C 1,C 2,C 3}

consists of C 1={L 1,L 3,L 5,L 9},C 2={L 2,L 6,L 7},and C 3={L 4,L 8}.(a)Delaunay triangulation (b)Inner edges represent as red and outer edges as blue.Edge e L 3,L 7has many foreground pixels,so cost value of the edge E f (e L 3,L 7)is high.The cost value E f (e L 2,L 4)is low because edge e L 2,L 4has many background pix-els.

more background pixels exist in di?erent object-level groups,the less cost the object-level group has.

The details of calculating cost function of foreground map are as follows.Figure 5shows an example of cost function based on the foreground.

1.The center of region-level groups are connected by de-launay triangulation [7].The center of a region-level group is mean of euclidean coordinations of associated motion features.

2.Depending on object-level con?guration C ,the edges of delaunay triangulation are divided into inner and outer edges of an object.

3.The energy of foreground cost function is calculated by summing edges costs de?ned by

E f (C )= C i ,C j ∈C L p ∈C i ,L q ∈C j

E f e L p ,L q

,(7)

and the cost of each edge is de?ned by

E f (e L p ,L q )=

(8)

w B N B (e L p ,L q )?w F N F (e L p ,L q )if L p ,L q ∈C i

w F N F (e L p ,L q )?w B N B (e L p ,L q )

otherwise

,

where e L p ,L q means edge between L p and L q ,N B (·),N F (·)are the number of background and foreground pixels,and w B ,w F denote weighting factors of foreground and back-ground.We ?x w B =1and w F =10in experiments.

2.3.2Motion cost function

The cost function of motion is based on motion similar-ity.We assume that region-level groups in a same object-level usually move similar velocity,whereas those in di?erent object-level will move in di?erent velocity.We design that the cost function has high cost when average velocity of a region-level group is much di?erent from those of the other region-level groups in the same object-level group,or the velocity is close to those of region-level groups in di?erent

1

L 4L 7

L 6L 5L 9L 3

L 2

L 8

L 1

L 4

L 7

L 6L 5L 9L 3

L 2

L 8

L If becomes low.On the other hands,if the average velocity of red features is much di?erent from that of blue features in di?erent object,the cost becomes low.

object-level groups.The de?nition of the cost function is described as

E m (C )= C i ,C j ∈C

L p ∈C i ,L q ∈C j

E m (L p ,L q )(9)

where

E m (L p ,L q )=

v L p ?v L q

L p ,L q ∈C i

? v L p ?v L q

L p ∈C i ,L q ∈C j ,i =j

,(10)

and v L i is average velocity of region-level L i that means aver-age velocity of motion features involved in L i .For example,Figure 6shows the edges associated L 1for calculating cost function of motion.

2.3.3Energy minimization by Simulated Annealing

We ?nd optimal object-level con?gurations C ?by using simulated annealing(SA)[4].The SA method searches so-lution space to ?nd an optimal solution similar to random sampling.In order to e?ciently search the solution space,search operations should be e?ectively de?ned.Suppose the current object-level con?gurations is C ,the candidate con-?guration C ′is obtained by the following three operations:1.Merge:Randomly select two neighboring object-level con?gurations C i ,C j ∈C .Merge them into a new object-level con?guration.i.e.C k =C i ∪C j and C ′=C \{C i ,C j }∪{C k }2.Split:Select randomly one object-level con?guration C k ∈C .Split it into two connected object-level con-?gurations.3.Transfer:Randomly select two neighboring object-level con?gurations C i ,C j ∈C ,and transfer one leaf-level cluster from C i to C j .i.e.

C ′=

C \{C i ,C j }∪{C i \{L k },C j ∪{L k }}if L k ∈C i

C \{C i ,C j }∪{C j \{L k },C i ∪{L k }}if L k ∈C j .

By iteratively applying the three operations based on the simulated annealing strategy,we ?nd the optimal solution minimizing the energy function.The solution is expected to have a near optimal object-level con?guration C ?.

Figure 7:Results on our SNUmaingate dataset,i-LIDS AB and PV .Each points indicate motion features in current frame and each bounding boxes are ?nal segmentation and tracking results.

3.EXPERIMENTAL RESULTS

The performance of the proposed method is evaluated us-ing our SNUmaingate sequence and also well-known video sequences i-LIDS AB and PV 1.We implement the feature-level processing on C++code and the others region-level and object-level grouping on MATLAB with Intel Core i5-25003.3GHz CPU and 2GB RAM.OpenCV library 2is used to extract KLT feature points [9]and track them,and also calculate MSER [5].It takes from 30to 40ms to execute feature-level processing for 360×240image sequence.

Performance analysis.The results of the proposed method are shown in Figure 7.Our method e?ciently seg-ments and tracks each objects in object-level without prior knowledge about the scene or object.This is because the re-gion features have a scene-indepedent and object-independent properties which represent a homogeneous region of an ob-ject.SNUmaingate sequence are challenging because various kinds of objects are appeared such as people,cars,buses and bikes.Though the size of buses are over ten times the size of people,each objects are succesfully segmented and tracked.Since i-LIDS AB sequence is captured in low angle camera,short distance of motion features in the image plane does not always means short distance of real world coordinate.It makes it hard to group the motion features into object level.In this reason,other methods need a prior knowledge about the scene such as geometry information [2,13,3].However,

1https://www.wendangku.net/doc/c215271229.html,/~andrea/avss2007_d.html 2

https://www.wendangku.net/doc/c215271229.html,/downloads.html

our method can sucessfully group the motion features by using the scene-independent region features.i-LIDS PV se-quence is outdoor road surveillance scene.There are large variation of size of object in the secne.It makes di?cult to segment and track objects,because approciate distance between features in same object is signi?cantly varing with position of video.Our method successfully segments each cars regardless of distance from camera.

Quantitative evaluation and convergence.We use SNUmaingate sequence for quantitative evaluation.Ground truth are mannually labeled at 20frames apart.The number of labeled frame and object are 213frames,914objects with di?erent class(e.g.people,cars,bicycles and buses).By comparing the ground truth with bounding boxes of results on a frame-by-frame basis,we can calculated precision,recall as shown in Table 1.In Figure 8,we can see an example how object con?gurations changes with iteration proceeding.Figure 9shows the corresponding energy value of energy minimization using simulated annealing at each iteration.For this example,it converges only in hundreds of iteration becase we use region-level groups as the smallest object-level unit.

4.CONCLUSION

A new multiple object segmentation and tracking method based on feature tracking and grouping has been proposed.We introduced a hierarchical feature grouping method through combining background subtraction,region features and mo-tion trajectories.By grouping the features into the homo-

21st iterations55th iterations

87th iterations197th iterations Figure8:Examples of energy minimization process using simulated annealing.

geneous region-level and forming object-level groups by sim-ulated annealing,object-level segmentation was successfully achieved.In consequence,the proposed approach can seg-ment and track the various kinds of objects without prior knowledge such as object detector,3D model and geometry information.

5.ACKNOWLEDGMENTS

The research was supported by SNU Brain Korea21In-formation Technology program and the IT R&D program of MKE&KEIT[10041610,The development of the recogni-tion technology for user identity,behavior and location that has a performance approaching recognition rates of99%on 30people by using perception sensor network in the real environment]

6.REFERENCES

[1]M.Breitenstein,F.Reichlin,B.Leibe,E.Koller-Meier,

and L.Van Gool.Robust tracking-by-detection using a detector con?dence particle?lter.In Computer Vision, 2009IEEE12th International Conference on,pages

1515–1522,292009-oct.22009.

[2]N.Kanhere,S.Pundlik,and S.Birch?eld.Vehicle

segmentation and tracking from a low-angle o?-axis

camera.In Computer Vision and Pattern Recognition, 2005.IEEE Computer Society Conference on,

volume2,pages1152–1157vol.2,june2005.

[3]Z.Kim.Real time object tracking based on dynamic

feature grouping with background subtraction.In

Computer Vision and Pattern Recognition,2008.

IEEE Conference on,pages1–8,june2008.

IEEE Computer Society Conference on,pages593

–600,jun1994.

[10]X.Song and R.Nevatia.Detection and tracking of

moving vehicles in crowded scenes.In Motion and

Video Computing,2007.WMVC’07.IEEE Workshop on,page4,feb.2007.

[11]C.Stau?er and W.Grimson.Adaptive background

mixture models for real-time tracking.In Computer

Vision and Pattern Recognition,1999.IEEE

Computer Society Conference on.,volume2,pages2

vol.(xxiii+637+663),1999.

[12]B.Yang and R.Nevatia.An online learned crf model

for multi-target tracking.In Computer Vision and

Pattern Recognition(CVPR),2012IEEE Conference on,pages2034–2041,june2012.

[13]J.Yang,Y.Wang,G.Ye,A.Sowmya,B.Zhang,and

J.Xu.Feature clustering for vehicle detection and

tracking in road tra?c surveillance.In Image

Processing(ICIP),200916th IEEE International

Conference on,pages1145–1148,nov.2009.

[14]J.Yu,D.Farin,and H.Loos.Multi-cue based visual

tracking in clutter scenes with occlusions.In Advanced Video and Signal Based Surveillance,2009.Sixth

IEEE International Conference on,pages158–163,

sept.2009.

相关文档