当前位置：文档库 › Multiple-Human Tracking by Iterative Data Association and Detection Update

Multiple-Human Tracking by Iterative Data Association and Detection Update

Lu Wang,Nelson Hon Ching Yung,Senior Member,IEEE,and Lisheng Xu,Member,IEEE

Abstract—Multiple-object tracking is an important task in au-tomated video surveillance.In this paper,we present a multiple-human-tracking approach that takes the single-frame human detection results as input and associates them to form trajectories while improving the original detection results by making use of reliable temporal information in a closed-loop manner.It works by ?rst forming tracklets,from which reliable temporal information is extracted,and then re?ning the detection responses inside the tracklets,which also improves the accuracy of tracklets’quanti-ties.After this,local conservative tracklet association is performed and reliable temporal information is propagated across track-lets so that more detection responses can be re?ned.The global tracklet association is done last to resolve association ambiguities. Experimental results show that the proposed approach improves both the association and detection https://www.wendangku.net/doc/8d4774819.html,parison with sev-eral state-of-the-art approaches demonstrates the effectiveness of the proposed approach.

Index Terms—Data association,detection update,multiple-human tracking,video surveillance.

I.I NTRODUCTION

A UTOMATED video surveillance of human objects is an

important aspect in intelligent transportation systems.For example,pedestrian number estimation at road intersections is vital for the design of an adaptive signal control system [1].Trajectory data obtained by human tracking serve as the basic input to the studies of pedestrian?ows[2],which can be applied to human traf?c prediction,transportation infras-tructure design,and evacuation control[3].Human behavior understanding would be helpful for detecting and predicting abnormal/dangerous activities in transit systems such as air-ports,subway terminals,and train stations[4].

This paper deals with the multiple-human-tracking problem, and it is a continuation of our previous work on multiple-human

Manuscript received September9,2013;revised December10,2013; accepted January15,2014.Date of publication February28,2014;date of current version September26,2014.This work was supported in part by the Research Grant Council of the Hong Kong Special Administrative Region, China,under Grant HKU719608E;by the Postgraduate Studentship of The University of Hong Kong;by the National Natural Science Foundation of China under Grant61202258;and by the National Science and Technology Support Program of China under Grant2012BAK24B01.The Associate Editor for this paper was K.Wang.

L.Wang is with the College of Information Science and Engineering,North-eastern University,Shenyang110819,China(e-mail:wanglu@https://www.wendangku.net/doc/8d4774819.html,). N.H.C.Yung is with the Department of Electrical and Electronic Engi-neering,The University of Hong Kong,Pokfulam Road,Hong Kong(e-mail: nyung@eee.hku.hk).

L.Xu is with the Sino-Dutch Biomedical and Information Engineering School,Northeastern University,Shenyang110819,China(e-mail:xuls@bmie. https://www.wendangku.net/doc/8d4774819.html,).

Color versions of one or more of the?gures in this paper are available online at https://www.wendangku.net/doc/8d4774819.html,.

Digital Object Identi?er10.1109/TITS.2014.2303196detection[5].Speci?cally,object tracking in video surveillance aims at extracting objects’spatial–temporal information,which is mandatory for higher level activity recognition.However,it is not trivial due to dif?culties such as low?gure–ground contrast, changes in object appearance over time,and abrupt motions. Multiple-object tracking is even more challenging as interobject occlusions exist prevalently,which may cause identity switches or immature trajectory terminations.

Object tracking approaches can be mainly classi?ed into two types,i.e.,the bottom-up category free tracking[6]–[9]and the top-down association-based tracking by detection[10]–[12]. The?rst type approaches usually require manual labeling of the region to be tracked in the?rst frame,whereas the second type approaches associate detection responses of a pretrained object detector based on appearance similarity,motion con-sistency,https://www.wendangku.net/doc/8d4774819.html,pared with category free tracking,tracking by detection is fully automatic and it can effectively avoid the drifting problem caused by accumulated tracking error.In addition,tracking by detection is robust to occasional detection failures,i.e.,isolated false alarms or missed detections are less likely to lead to tracking failures.Therefore,tracking by detection is more effective for solving the multiple-object-tracking problem.

In this paper,we present an iterative data association and de-tection response update approach for multiple-human tracking in surveillance scenarios,with the assumptions that the camera is static,people walk on a ground plane,and camera parameters can be obtained.Unlike most of the previous data association works that only consider how to ensure correct linking,we also attempt to improve the detections,and hence the tracklets,when reliable temporal information can be obtained.To this end,we ?rst generate tracklets by conservative linking of detections and extract the appearance,size,and position information of those reliable detections that show high temporal and spatial consistency.Then,the extracted information is propagated to detections within the tracklets by re?ning the detections’shape models,resulting in improved accuracy of tracklet quantities in the meantime.After this,local conservative tracklet association based on the Hungarian algorithm[13]is performed so that, in addition to forming longer trajectories,reliable temporal information can be further propagated to improve the accuracy of more detections and tracklets.The iteration stops when there are no new detection updates or new tracklet association. Finally,the Hungarian algorithm is globally applied to resolve ambiguous situations.The whole process ends when neither update nor association can be performed.The outputs of the approach are the updated detection responses and the associated trajectories.

1524-9050?2014IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.

See https://www.wendangku.net/doc/8d4774819.html,/publications_standards/publications/rights/index.html for more information.

In terms of human object representation,our approach be-longs to the explicit-shape-model-based approach,such as[14], where a human body is represented by the combination of several body parts.Our approach differs from[14]in that[14] is performed in an online manner,and there is no differentiation between reliable and unreliable temporal information,whereas our approach is performed by considering multiple frames si-multaneously,and we only propagate the temporal information obtained from reliable detections.

In terms of temporal information propagation,our approach belongs to detection guided tracking.References[15]and[16] are examples of this type of tracking approaches.Our research differs from them in the target position localization aspect. In[15]and[16],detection responses are used to guide a particle?lter,whereas we use a coarse-to-?ne model matching approach similar to[17]to incorporate temporal priors natu-rally,which is the advantage brought by the shape-model-based detection[5].

In terms of data association,our proposed approach is similar to[11],where data association is formulated as a maximum a posteriori(MAP)problem and solved by the Hungarian algorithm.We made necessary improvements over the approach proposed in[11].The?rst one is that we propose a local tracklet association procedure before global association,which is more conservative and less likely to make errors.The second one is that we use the reliable temporal information to recover missed head or tail parts of tracklets,bringing in the advantages of shortening gaps between tracklets,hence enabling associations to be made more robustly and making the resulting tracks more complete.The third one is that we detect tracklets that may vi-olate the?rst-order Markov chain assumption and approximate the second-order Markov chain on them,resulting in reduced number of identity switches.

It should be noted that our approach relies on reliable de-tections,which is a subset of the whole detections,to improve the detection and association performance.If there are no reliable detections,our approach degenerates to an ordinary data association approach.

In summary,our main contributions are threefold:

1)improving the accuracy of human detection by using

reliable temporal information;

2)a new iterative hierarchical data association framework;

3)when performing global data association,explicitly de-

tecting tracklets that may violate the?rst-order Markov chain assumption and approximate the second-order Markov chain on them.

The rest of this paper is organized as follows.In Section II, we review related work about object tracking by detection. Section III describes the proposed iterative data association and detection update approach.In Section IV,we demonstrate the performance of this approach with experimental results on several data sets.Finally,we conclude this paper in Section V.

II.R ELATED W ORK

Tracking by detection approaches can be mainly divided into two classes,i.e.,recursive approaches that perform association between trajectories estimated up to the previous frame and detections in the current frame or approaches that seek optimal global data association over an extended period of time. Recursive approaches are suitable for time-critical appli-cations.Wu and Nevatia[10]use a discriminatively trained part-based human detector for detection and applies a greedy detection–trajectory association strategy.Okuma et al.[15] introduce the detection result into the proposal distribution of the particle?lter to deal with multiple-hockey-player tracking, where a particle?lter is used to jointly represent a varying num-ber of targets.Cai et al.[18]extend Okuma’s approach by using independent particle?lters for each target.Breitenstein et al.

[12]use greedy detection–trajectory association to guide the particle?lter,taking into account the continuous detection con-?dence when formulating the observation model.Li et al.[16] make compromise between category free tracking and tracking by detection,where each sampling procedure of the proposed cascade particle?lter samples a reduced number of particles with a more reliable observation model.Their approach is able to reduce the high computational cost for object detection and meanwhile increase the robustness of visual tracking.For tracking of high-density crowds,in[19],human heads are tracked using particle?lters guided by head detections,and a 3-D head plane is estimated online to reduce false alarms.

As recursive tracking approaches cannot make use of future evidence,they tend to make mistakes when the information provided by the past and current frames is ambiguous.If a time delay is allowed,global data association approaches are likely to give better performance as many frames are considered at the same time.

The classical multiple hypothesis tracking(MHT)method defers making the trajectory and detection association decision by expanding each of the current trajectory hypotheses to a set of new hypotheses when new detections become available. To avoid the exponentially growing number of hypotheses,a sliding window is used where decision has to be made at the rear end of the window.The disadvantage of MHT is that its complexity limits its window size to be very small.

Integer linear programming(ILP)and its related methods have been applied to formulate the global data association problem of multiple target tracking[20]–[24].Jiang et al.

[20]use detections to construct a graph model,and additional nodes are created to handle occluded objects.Then,the data association problem is formulated into an ILP and solved by linear programming(LP)with relaxation.Zhang et al.[21] propose a MAP formulation of multiple people tracking and map it into a cost-?ow network,which is solved by iteratively using the minimal cost-?ow method.Pirsiavash et al.[25] signi?cantly improve the ef?ciency of[21]by proposing a greedy algorithm to solve the minimal cost-?ow problem.To cope with both missed detections and object localization errors, in[22]–[24],continuous detection con?dence is used,whereas the location space is discretized into a?nite number of locations to reduce the size of the hypothesis space,with the assumption that two separate trajectories cannot share the same location. To deal with partial occlusion,Izadinia et al.[26]apply the ?ow network formulation[25]to generate both pedestrian and body part tracklets,where the body part detections are the by-products of the pedestrian detector[27].Then,the body part

Fig.1.Diagram of the proposed data association and detection update approach.In the illustrated sub?gures,dark dots represent detections generated by a human detector,red dots represent reliable detections found by the approach,purple dots represent updated detections using reliable temporal information,and solid lines denote tracklets or trajectories.

tracklets are used to split incorrect pedestrian tracklets and stitch the pedestrian tracklets when occlusion exists.

Markov chain Monte Carlo(MCMC),quadratic Boolean programming(QBP),and gradient descent have been also applied to?nd the optimal global association of detections. For example,Yu et al.[28]apply data-driven MCMC to?nd near optimal solution ef?ciently.Benfold and Reid[29]employ MCMC to track multiple heads where motion is exploited to detect false positive(FP)detections.Leibe et al.[30]propose to perform coupled multiple-object detection and tracking by applying the minimum description length principle,formulate it as a QBP,and solve it by an expectation maximization(EM)-based https://www.wendangku.net/doc/8d4774819.html,an et al.[31]use gradient descent to?nd strong local minima of complex nonconvex energy that captures image evidence and various physical constraints for tracking. As a good compromise between association accuracy and computation complexity,tracklets(i.e.,track fragments)-based approaches[11],[32]–[37]have become more and more popu-lar.In these approaches,tracklets are?rst generated by conser-vative linking of detections of consecutive frames,which helps reduce the possible linking space signi?cantly.Then,given the af?nities between potentially linkable tracklets,the associa-tion problem is solved,typically by the Hungarian algorithm [13].For example,Stauffer[32]associates tracklets using the Hungarian algorithm with an extended transition matrix that considers the likelihood that each tracklet being the initializa-tion and termination of trajectories.This approach performs iterative tracklet association and scene entrance/exit estimation using EM.Perera et al.[33]adapt the Hungarian algorithm to deal with merging and splitting of tracklets in multiple-object tracking when long-time occlusion exists.Both[32]and[33] de?ne the tracklet af?nity only once,which may not be accurate enough due to the errors introduced by inaccurate localization in the detection phase.Huang et al.[11]propose a hierarchical data association strategy,in which tracklet af?nities are re?ned whenever new tracklets are formed during the progressive tracklet linking procedure.To further increase the robustness of the af?nity measures,a few approaches have been recently proposed.For example,Li et al.[34]propose a HybridBoost algorithm to learn the af?nity models between two tracklets. Kuo et al.[35]propose global online discriminative appearance models,where descriptors are prede?ned,and later they pro-pose to use automatic important feature selection by learning from a large number of local image descriptors[36].Yang and Nevatia[37]propose to learn nonlinear motion patterns to better explain direction changes.

In this paper,the idea of hierarchical data association pro-posed in[11]is adopted,and we attempt to improve the asso-ciation accuracy in the following aspects:1)strengthening the robustness of tracklet af?nity calculation by updating detection responses using the reliable temporal information;2)reducing the possibility of identity switches by explicitly detecting and dealing with ambiguous tracklets;and3)allowing incorrectly linked tracklets to break up.Our work does not learn appear-ance models,motion models,or af?nity models so far;however, incorporating these techniques into the proposed framework will de?nitely improve its performance.

III.P ROPOSED A PPROACH

In this section,we will introduce the detail of the proposed association approach.This section is organized as follows. Section III-A introduces how initial tracklets are generated. Section III-B presents how reliable temporal information is extracted and how it is used to update original detections. Sections III-C and III-D depict the details of local and global data associations,respectively.The diagram of the approach is illustrated in Fig.1.

A.Human Detection and Tracklet Formation

We?rst use the foreground segmentation and shape-model-based human detection approach[5]to obtain the detections in each individual frame.Each detection response r i is repre-sented by a3-D shape model m,whose parameters include the

3-D location z ,the orientation o ,the size s ,and the leg posture pose .

Then,given all the detection responses R ={r i },they are conservatively associated to form the tracklets.The af?nity A (r i ,r j )between any two detection responses r i and r j is de?ned as

A (r i ,r j )=

A app (r i ,r j )A pos (r i ,r j )if f j ?f i =10otherwise (1)

where A app (r i ,r j )and A pos (r i ,r j )represent the appearance

similarity and position proximity,respectively,and f i denotes r i ’s frame index.

As the human model is part based,to make the appearance model more discriminative,we de?ne a three-part appearance model a ={a pt |pt =h,t,l }for each detection,where pt de-notes the body part;h ,t ,and l represent the head,torso,and legs,respectively;and each a pt is an 8×8×8RGB color histogram.The appearance af?nity is calculated as

A app (r i ,r j )=

pt ={h,t,l }

w pt min(v i,pt ,v j,pt )BC (a i,pt ,a j,pt )

pt ={h,t,l }

w pt min(v i,pt ,v j,pt )

(2)

where v i,pt is the visible ratio of part pt of r i ,BC calculates the Bhattacharyya coef?cient of two histograms,and w pt is the weight for part pt .For a human object,as the head and torso are more accurately described by the model than legs,they should have higher weights than legs.Therefore,we set w h =w t =0.4and w l =0.2.Slight changes of these weights would make no obvious difference.

The position proximity of two detection responses is de?ned in terms of the distance d traversed by a human at a high speed within one time step

A pos (r i ,r j )=

1,if |z i ?z j |≤d

G (max(|z i ?z j |?d,0);0,σz ),otherwise

(3)where G (·;x 0,σx )represents the Gaussian distribution with mean x 0and standard deviation σx .In our experiment,we set d to be 0.7m when the time step is 0.4s.Considering that the detected feet position can hardly be accurate if the lower body is occluded,we also calculate the head position af?nity A head pos ,and the ?nal position proximity A pos is taken as max(A pos ,A head pos ).

Having the af?nity values,the two-threshold strategy,as presented in [11],is used to generate the tracklets,i.e.,two re-sponses are linked if and only if their af?nity is high enough and signi?cantly higher than the af?nity of any of their con?icting pairs.

B.Reliable Temporal Information Extraction and Detection Responses Update

Given the tracklets,reliable temporal information can be extracted from them and used to re?ne adjacent detections by means of model matching.By reliable temporal information,we mean that the appearance,position,and size information of the corresponding detection are accurate enough,as de?ned

Fig.2.Illustration of reliable detections.

Section III-B1so that these pieces of information can be used to guide the update of detections of same identity in neighboring frames.

1)Reliable Temporal Information Extraction:To extract reliable temporal information,we ?rst look for the reliable detections according to the following two criteria.

a)The detection’s head contour is well aligned with the foreground contour.

b)The detection has high appearance,head position,and feet position af?nities with its adjacent detections.

Fig.2illustrates some reliable detection responses found by the above criteria.

Next,we re?ne the models of unoccluded reliable detections by using the temporal information provided by the tracklets.(Those occluded ones will be re?ned when their occluding objects have been re?ned.)The re?ned model M for a reliable detection is selected as M =

max m (pose,s,o,p h )

G (s ;s 0,σs )G (o ;o 0,σo )

×L s (B (m (pose,s,o,p h )))

(4)

where m represents the 3-D model;B (m )is the model’s boundary on the image;s 0is the average size of the reliable de-tections within the corresponding tracklet;o 0is the detection’s tangential direction in the tracklet;L s is the shape likelihood measuring how well the model matches with the image edges (see [5]for details);and p h is the original model’s head position on the image,which is assumed to be accurate and need not be re?ned.Fig.3shows an example of the best matched model before and after using the temporal information,where the original wrong orientation estimation has been corrected.

Having the updated models for the reliable detections,we then extract the corresponding appearance,size,and position information from them and propagate it to the adjacent frames.2)Temporal Information Propagation:For an unoccluded detection adjacent to an updated detection in the same tracklet,we re?ne it by using the reliable temporal information.An unoccluded detection means that the person corresponding to the detection is fully visible,or occluded by the image border,or a detection whose occluding detections have already been updated.To re?ne such a detection,we ?rst search the best head position according to

p h =max p

G (s ;s 0,σs )G (o ;o 0,σ0)

×A app (ubm (s,o,p ),r ref )L s (B (ubm (s,o,p )))

(5)

Fig.3.Illustration of the best matched models without and with the temporal information.(a)Without temporal information.(b)With temporal information.

where ubm represents the upper body model,used to avoid the high computational cost of searching for the optimal leg pose, and r ref is the referenced detection response in the adjacent frame.Then,model matching is performed by

M=max

m(pose,s,o,p h)

G(s;s0,σs)G(o;o0,σo)

×A app(m(pose,s,o,p0),r ref)L s(B(m(pose,s,o,p h))).

(6)

The difference between(6)and(4)is the appearance af?nity term A app because in the current stage,we have an appearance model to refer to.To avoid drifting,the updated model is accepted only when it has a higher af?nity to the referenced appearance model than the original model.

After obtaining the shape model,the detection’s appearance is updated by

a=αva model+(1?αv)a ref(7)

where a model is the appearance model of the newly obtained shape model;a ref is the referenced appearance model;v= {v pt|pt=h,t,l}and v pt is the visible ratio of part pt;andαis the smoothing factor,which helps avoid large changes of the appearance model caused by incorrect detection update,and is set to be0.2.A body part pt is not updated if v pt is less than0.5.

3)Occlusion Order Determination:As stated above,our approach requires that a detection can be updated only when it becomes unoccluded.To avoid wrong occlusion order es-timation,for two detections whose heads are at the similar horizontal level in the image and whose torsos intersect for only a small percentage(5%is used in our experiment),we consider that they are not mutually occluded.In addition,we assume that,if it can be de?nitely determined that a detection A occludes another detection B in one frame f,it is impossible that B can occlude A in frame f?1and frame f+1.In this way,we avoid estimating occlusion order in ambiguous cases. Furthermore,FPs,as depicted in Fig.4,may introduce problem when propagating the reliable temporal information because FPs may occlude some true detections and FPs are very unlikely to be updated as they cannot link to any

reliable Fig.4.Illustration of an FP detection response.

detections.Therefore,we collect the detections that cannot be linked to any other detection responses as candidate FPs.A detection response is allowed to be updated if it is only occluded by candidate FPs.In case that a candidate FP is found to have a high af?nity to an updated detection,it is not taken as an FP anymore.

4)Recovery From Likely Identity Switches:Identity switch may exist in some tracklets,which are usually caused by oc-clusions where accurate detection is dif?cult.As the detection update proceeds,the renewed detection may deviate from the original detection farther and farther away due to the guid-ance of the reliable temporal information.When the deviation becomes signi?cant,i.e.,the intersection ratio between the updated detection and the original detection is small or the appearance af?nity between them is not high enough,we doubt that there may be something inconsistent.In this situation,we break up the tracklet at that point and look for possible better association for the resulting two separated tracklets.

C.Conservative Local Data Association

After all the possible detection updates have been made inside each tracklet,the tracklets can be associated.However, as only a small portion of detections may have been updated at this stage,the tracklets’link probability may still contain many inaccuracies,making global association of tracklets risky. Therefore,we introduce an intermediate tracklet association step,i.e.,local conservative Hungarian linking,which only aims at performing association for tracklets that exhibit high link probability and low ambiguity,and at the same time have no gaps in between.In addition,if an end of a tracklet has been up-dated but has no other tracklets to link to,we infer that some ob-ject might be missed at the detection stage and use the detection update method as a detector to recover the missed detections. From here on,we use double subscripts to represent the quantities corresponding to the detections of a tracklet,with the?rst denoting the index of the tracklet and the second denoting the detection’s index inside the tracklet.For example, the detection of a tracklet T i is denoted as r i,k,where k= {1,2,...,|T i|}.We also denote by T end

={T i}the set of all

the tracklets that end at frame f and by T start

f+1

={T j}the set of all the tracklets that start at frame f+1.

Fig.5.Illustration of an associated tracklet pair in local data association.(a1),(a2),and (a3)are the last three detections of T i ;(b1),(b2),and (b3)are the ?rst three detections of T j ;and (a3)and (b1)are from two consecutive frames.(a3)and (b1)are not linked in the tracklet formation step because the brightness of the woman suddenly changes from strong to moderate.In the local data association step,as T i and T j have acceptable link probability and they are not linkable to any other tracklet that found no correspondence by using the Hungarian algorithm,they are associated.

In conservative tracklet association,the link probability be-tween two tracklets T i ∈T end f and T j ∈T start f +1is de?ned as P link _local (T i ,T j )=P app (T i ,T j )P local _motion (T i ,T j ).(8)P app (T i ,T j )calculates the af?nity according to (2)between the average appearance model of the last three detections of T i and that of the ?rst three detections of T j .

Denoting T i ’s predicted model at its rear end as r i,|T i |+1and T j ’s predicted model at its front end as r j,0,P local _motion (T i ,T j )is calculated by the average intersection ratios of the predicted model and the corresponding detection P local _motion

(T i ,T j )=1 area (r i,|T i |+1∩r j,1)i,|T i |+1∪r j,1+area (r i,|T i |∩r j,0)

i,|T i |∪r j,0 .

(9)

For each frame f ,we apply the Hungarian algorithm on the resulting link probability matrix to obtain the tracklet corre-spondence.As the correspondence is only calculated locally,we cannot accept all of them but only those reliable ones.Therefore,we ?rst accept the links with high reliability using the two-threshold strategy;we then accept the correspondence with moderate link probability,and at the same time,the two corresponding tracklets are not linkable to any other tracklet that has no correspondence,ensuring that the accepted corre-spondence values introduce no controversial association.Fig.5shows an example of local data association.

In addition to the accepted correspondence values,if a track-let has one end detection updated but is not linkable to any other tracklet,while that end is not at the image border or in the scene occluder areas,we are sure that the corresponding object is missing.In this case,we use the procedure stated in Section III-B2to detect the missed object.To avoid drifting,we accept the detection only if it has a high appearance af?nity ( 0.93)to the reference model,the shape matching score is high ( 0.65),and it does not overlap signi?cantly (0.9)with other existed detections in the frame.Fig.6shows an example of recovered missed detections.

After the local association,reliable temporal information introduced in Section III-B2can be propagated again,and new association can be made when more detections have been up-dated.The iteration continues until there are no new detection updates or local data

associations.

Fig.6.Illustration of the result of tracklet extension.(a)Single-frame detec-tion result.(b)Detection result after detection update and recovery.

D.Global Data Association Using the Hungarian Algorithm When no further update or association can be made by the local association,we resort to the global Hungarian tracklet association.

The global link probability is de?ned as P link _global (T i ,T j )

=P app (T i ,T j )P global _motion (T i ,T j )P gap (T i ,T j ).

(10)

P app (T i ,T j )is the same as the one de?ned in (8).For the mo-tion link probability P global _motion (T i ,T j ),instead of making a constant velocity assumption,we only use it to exclude impos-sible connections.Speci?cally,for any two tracklets T i and T j ,if |z i,|T i |?z j,1|>(f j,1?f i,|T i |)d max ,P global _motion (T i ,T j )is set to be 0,where d max is the maximum distance that can be traversed by a human object in one time step (1m in our experiment).Otherwise,we consider the similarity between the end orientation o i,|T i |and the start orientation o j,1by de?ning

P global _motion (T i ,T j )=δ+(1?δ)max 0, o i,|T i |,o j,1

(11)where δcontrols the importance of P global _motion in P link _global and is set to be 0.9.

Fig.7.Illustration of the calculation of P gap:(a)Input frame I at the current interpolated position;(b)predicted shape model m;(c)corresponding occupancy map I occ;(d)difference map between m and I occ;(e)image region

inside m;(f)image region of r i,|T

i |;and(g)image region of r j,1.It can

be seen from(d)that only a small part of m overlaps with I occ;thus,this position cannot be explained as missed detection caused by occlusion.Then, the appearance similarity between m and the two tracklets is checked.However, the two appearance af?nities are both low.Therefore,p i,j gap at this position is set to beη,meaning that this position is unlikely to bridge T i and T j.

P gap(T i,T j)measures how well the gap between T i and T j can be explained and is de?ned as

P gap(T i,T j)=f j,1?f i,|T

k=1

p i,j gap(k)(12)

where p i,j gap(k)calculates how likely the detection at the k th position in the gap is a missed detection.To do this,we?rst lin-early interpolate the real-world positions within the gap.Then, for the k th interpolated position,we check if it is occluded by other detections for more than50%;if it is,this is taken as a missed detection and p i,j gap is set to be the missed detection rate p miss.Otherwise,we check the upper body appearance of the predicted model at this position.If the appearance is similar to

both r i,|T

i|and r j,1,p i,j gap(k)is set to p miss as well;otherwise,

p i,j gap(k)is set toη( p miss).Fig.7illustrates how P gap is calculated.

Having P link_global,the tracklet association problem is for-mulated as a MAP problem as proposed in[11],which con-siders track initialization,termination,tracklet association,and the probability of tracklets being false alarms.The convergence is guaranteed by reducing the initialization and termination probabilities of each track after each iteration until they reach a prede?ned lower

bound.Fig.8.Wrong association caused by ambiguous tracklet.Green dots represent the detections of one human;blue dots represent the detections of another hu-man;and red dots represent detections where occlusion happens and ambiguous tracklets are produced.(a)Ambiguous tracklet linkable to two tracklets at each end.(b)Two degenerate tracklets that lack motion

information.

Fig.9.Illustration of an associated case in global data association.(a),(b), (c),and(d)are from four consecutive frames.(a)Last detection of T i.In(b) and(c),detections for the man are missed.(d)First detection of T j.T i and T j are linked in global data association as the gap between them can be properly explained as missed detections.

The difference between our approach and[11]is that we speci?cally deal with the ambiguous tracklets that may violate the?rst-order Markov chain assumption,i.e.,linking between two tracklets not only depends on themselves but also on some other tracklets.We consider a tracklet as an ambiguous tracklet when it is linkable to two tracklets at the same end,which usually appears when there are missed detections.In addition, the degenerate tracklets(i.e.,tracklets consist of one detection) are also considered ambiguous because they tend to introduce identity switches due to the lack of motion information.Fig.8 depicts the two types of ambiguous tracklets.

To approximate second-order Markov chain on the ambigu-ous tracklets,given the Hungarian association results,we only accept the connection of an ambiguous tracklet at the end with the higher link probability.The connection of the other end is left for association in the following iterations,when there may be fewer ambiguities(e.g.,detection update might have been performed to correct the detections or retrieve the missed detections,or the appearance model may have been updated,or the degenerate tracklet has linked to another tracklet and hence contains motion information).

Fig.9illustrates an example of the associated tracklets in the global data association step.

After the global Hungarian association,new links may have been established and we can go on performing detection update and local tracklet linking.The whole process ends when no new links can be found using the global Hungarian association.

IV.E XPERIMENTAL R ESULTS

In this section,we?rst show the improvement made by the proposed approach on detection and association accuracy,and then we compare our approach with the others.For the?rst part of the experiment,we use our human detection results[5]as the input to the association approach,whereas for the second part, to fairly compare our approach with others,we use the same detection results,ground truth data,and evaluation tools,as are used in the compared approaches[31],[37].

In our experiment,parameters not speci?ed manually are learned through90ground truth trajectories of The University of Hong Kong(HKU)video where mutual occlusion fre-quently happens,and they are set exactly the same for all the experiments.

To determine whether a target is being tracked,we?rst obtain the bounding box for each shape model and then the commonly used PASCAL criterion,i.e.,the intersection over union greater than0.5is adopted for all the experiments,except for the PETS 2009data set where the Euclidean distance in world coordinates smaller than1m is used.

For quantitative evaluation of the proposed approach,we follow the currently most widely accepted protocol,i.e.,the CLEAR MOT[38]metrics.The MultiObject Tracking Accu-racy(MOTA)combines three types of errors,namely,FPs, missed targets(FNs),and identity switches(IDs),and is nor-malized such that the score of100%corresponds to no errors (all three error types are equally weighted in our evaluation). The MultiObject Tracking Precision(MOTP)measures the alignment of the tracker output with respect to the ground truth. We also report recall,precision,False alarm per Frame(Fa/F), as well as mostly lost(ML),partially tracked(PT),and mostly tracked(MT)scores,and the number of identity switches(IDs) and fragmentation(Frag)of the produced trajectories compared with ground truth trajectories according to[34].

A.Detection and Tracking Results

In this part,we tested our iterative data association and detection update approach on two data sets:The?rst one is the CA VIAR benchmark data set[39],with resolution being 384×288,and the second one is an outdoor scene video taken at HKU campus,with the resolution being1280×720.In both data sets,object mutual occlusion is severe in a certain portion of frames and less severe in the other frames.Therefore,reliable temporal information of a target can be properly extracted from some frames where the target is well separated from the other objects and then propagated to the frames where the target overlaps with the other objects.The original frame rate of both data sets is25f/s.As the proposed approach requires additional computational time to perform detection update and missed detection recovery,to reduce the run time,we sample one frame out of every ten frames from the video sequences for tracking, i.e.,the frame rate of the input to the tracking approach is2.5f/s. We found through experiment that using such low frame rate would not impair the performance of the proposed approach. For the CA VIAR data set,we have tested all the26se-quences;for the HKU data set,we have tested5sequences, each of which containing100frames(corresponding to1000

TABLE I

D ETECTION R ESULT B EFOR

E AND A FTER

A PPLYING THE P ROPOSED A

PPROACH

frames in the original sequence).The applied HKU data set is not completely the same as the one that is used in[5]because in

[5],we only picked up the frames that are crowded,whereas for

a sequence,human distribution in some frames may be sparse. For the CA VIAR data set,as the image quality of the video is low,some of the human objects at the far end of the scene are hard to identify even by human eyes,particularly when there are multiple human objects.In addition,some true human objects of low-resolution values are not labeled in the ground truth data.Therefore,only humans with an image width greater than24pixels are counted when calculating the detection and association accuracy of the CA VIAR data set,as is done in some other research works[10],[11],[21].

1)Detection Update Results:Table I shows the detection results of the single-frame detection and the detection results after detection update for the two tested data sets.It is shown that for both data sets,the detection accuracy has been im-proved for about2%whereas the FP rate has been reduced by more than2%.It should be noted that the FP rate1.79%for the CA VIAR data set is still relatively high,which is partly caused by the incorrect labeling of the ground truth data,in which some of the humans that appear partially in the scene are not labeled,whereas our approach can correctly detect some of them,which,however,have been counted as false alarms. Fig.10illustrates some human detection results before and after applying our approach on the CA VIAR data set.It is shown that many ambiguities have been resolved by using the reliable temporal information.In Fig.10(a1)–(b1),the negative effect of shadows has been removed,resulting in more accurate estimation of the human size,and the head of the man on the lower right of the image has been more accurately localized as well.In Fig.10(a2)–(b2),a false alarm caused by rich texture is eliminated,and the localization accuracy of all the four humans on the lower part of the image has been improved.Fig.10(b3) corrects the wrong localization of the human in red by using the reliable appearance information for resegmentation.Both Fig.10(b3)and(b4)?nd a missed detection and correctly localize them.

Fig.11illustrates the human detection results before and after applying the proposed approach on the HKU campus data set.It is shown that some human objects that are quite dif?cult to be detected from single images have been successfully detected using the proposed approach[e.g.,the two ladies in white and black,respectively,in Fig.11(a1)]and a false alarm in Fig.11(b2)has been removed.Meanwhile,human orientations,postures,and sizes have been more accurately estimated as well.

2)Human Tracking Results:The results of our association approach without and with detection update are shown in

Fig.10.(a)Results of single-frame human detection and(b)applying the proposed approach for detection update on the CA VIAR data set.

Fig.11.Illustration of the detection result on the HKU data set.(a)Original frame.(b)Single-frame detection result.(c)Detection update result.

TABLE II

T RACKING R ESULT OF THE P ROPOSED A PPROACH W ITH AND W ITHOUT D ETECTION U

PDATE

Fig.12.Tracking result of the proposed approach on the CA VIAR data set.Example 1.

Table II.It is shown that detection update has improved the association accuracy for 2.8%and 10.1%for the CA VIAR and HKU data sets,respectively.The improvement is mainly made on those very dif?cult cases such as severe occlusion or low ?gure–ground contrast.Improvement made on the HKU data set is more signi?cant because the video was recorded in high resolution,where reliable temporal information exists more prevalently and can be propagated more accurately as well.We use the bounding box representation instead of models to illustrate the tracking results.This is because,for those missed detections found by association,we may not have the reliable temporal information to re?ne them.In such case,we just estimate the positions of the detection’s head and feet and draw a bounding box with a ?xed aspect ratio to represent the detection.Different colors of the bounding boxes represent different object identities.In addition,the bounding box is thickened when a new track is initiated.

Figs.12and 13depict our tracking results on two sequences of the CA VIAR data set.For the ?rst sequence,all human objects have been tracked successfully,although there are many occlusions.For the second sequence,however,there are several errors.The ?rst kind of error is the missed detections and iden-tity switches at the far end of the scene,where both the resolu-tion and objects’appearance temporal consistency are low.The second type of error is missed detections caused by persistent complete occlusion.It is shown in Fig.13that the man in blue has been fully occluded for a relatively long time and,meanwhile,exhibits abrupt change of motion.The proposed

approach does not associate the detections of the man before and after the occlusion because it is considered to be too risky.Fig.14illustrates the tracking result of the proposed ap-proach on one HKU campus sequence.It can be seen that most of the human objects in this sequence have been successfully tracked,although some of them have clothing of very similar colors and long-time occlusion exists.Missed detection of the young man at the far end of the scene is persistent because he is fully occluded by his companion for a long time.

B.Tracking Performance Comparison With Other Approaches In this part,we compare our association approach with four state-of-the-art trackers.We ?rst compare our approach with [36]and [37]on the CA VIAR data set.In [36],a robust appearance model is learned for each target,and in [37],both appearance models and motion patterns are learned.We then compare our approach with [22]and [31]on the PETS 2009data set.Reference [22]proposed a tracker based on the k -shortest paths algorithm on a regularly discretized grid,and in [31],occlusion is explicitly modeled and gradient de-scent is applied to minimize a continuous nonconvex energy function.The detections,ground truth,and evaluation tools are downloaded respectively from the homepages of authors of [37]1and [31].2

1https://www.wendangku.net/doc/8d4774819.html,/people/yangbo/downloads.html

2https://www.wendangku.net/doc/8d4774819.html,rmatik.tu-darmstadt.de/~aandriye/

Fig.13.Tracking result of the proposed approach on the CA VIAR data set.Example

Fig.14.Tracking result of the proposed approach on the HKU campus data set.

As the detection responses are provided in the form of bounding boxes while our approach needs shape models as input,we use our model matching approach [5]to ?nd the best matched shape model for each detection.For occlusion order determination,detection A is assumed to occlude detection B if the bottom of A is lower than the bottom of B .

For the CA VIAR data set,we use the 2.5-f/s frame rate,whereas for the PETS 2009data set,as the sequences were

TABLE III

C OMPARISON OF T RACKING R ESULTS ON CAVIAR

D ATA S ET (L ESS C ROWDED

)

TABLE IV

C OMPARISON OF T RACKING R ESULTS ON PETS 2009

D ATA S

Fig.15.Visual comparison of the tracking result on PETS 2009S2L2of [31]and our approach.The ?rst row is the tracking result illustrated in [31],and the second row is the tracking result of our approach.

recorded in a low frame rate (7f/s),we did not perform sampling any more.

Twenty sequences of the CA VIAR data set have been evalu-ated as done in [36]and [37],and Table III lists the comparison of the results.It is shown that our approach outperforms [36]and [37]in terms of recall,precision,number of ML tracks,and identity switches.However,the number of fragmentation of our approach is higher than that of both [36]and [37].

For the PETS 2009data set,we evaluated the three sequences collected for tracking.From S2L1to S2L3,the human density varies from low to high.As the evaluation is made in the world coordinates in [31],we perform the same evaluation for fair comparison.The result is listed in Table IV.

For sequences S2L1and S2L2,where humans can be well separated in a certain amount of frames,our approach achieves a tracking accuracy value,i.e.,MOTA,1.5%and 5.4%higher than [31].However,for S2L3,as humans are mostly mutually occluded through the whole sequence,where very limited reli-able temporal information can be extracted,our approach works approximately as an ordinary data association approach,and the performance of our approach is 0.5%lower than [31].For all the three sequences,the numbers of ML tracks and identity switches of our approach are the lowest.

Fig.15gives a visual comparison between the results of [31]and our approach on the S2L2sequence.It is shown that our approach is able to deal with occlusion better and hence track more humans.For example,in frames 40,76,and 138,two humans partially occluded by the plate are tracked as one person by [31](identi?ed as 7,27,and 29,respectively),whereas our approach can track them separately.In addition,our approach can also track more humans than [31]at the upper part of the scene where some humans are overexposed and the resolution is low.

It can be seen from the above results that one drawback of our approach is that for many of the tested sequences,the number of fragmentation of our approach is higher than that of the other approaches.This is mainly caused by the applied part-based appearance model,for which inaccurate segmentation,

which occurs frequently at the spatial temporal locations where occlusion exists,will result in an ambiguous appearance model. In addition,as the appearance model is based on color his-togram,it has relatively low discriminability.These two reasons make our approach dif?cult to deal with some very ambiguous situations.To reduce the possibility of identity switches,a con-servative strategy is applied in our approach:If the link proba-bility is low(i.e.,the association is likely to introduce identity switches),we choose to discard the association,thus resulting in more fragmentation.We expect that this problem can be much alleviated if more features are used in addition to colors and discriminative training of appearance models is applied. https://www.wendangku.net/doc/8d4774819.html,putational Cost Analysis

Our approach is currently realized using MATLAB and implemented on an Intel Corei72.93-GHz CPU.Most of the computational time is spent on the detection update,which depends on the computational time for each detection update and the total number of detections that needs to be updated.For each detection update,usually,the search of the head position takes1s,where the optimal orientation and size are searched within a small neighborhood of the expected orientation o0and size s0.Then,given the head position,the optimal model is selected,where,except for the optimal orientation and size,the leg pose is also searched using the hierarchical model matching, as introduced in[5].This step usually takes2s.The total number of detection updates depends on the density of the crowd,which is hard to tell,and its upper bound is the total number of human objects in all frames.In very crowded scenes, as very little reliable temporal information can be extracted,our approach degenerates to an ordinary data association approach and the computational cost becomes low because very few detection updates need to be performed.

The whole association process terminates within20itera-tions for all the tested sequences.For the?rst several iterations, there are both detection update and local and global tracklet associations.For the remaining iterations,as no detections can be updated anymore,only global associations take place. Overall,the average frame processing time of our approach in MATLAB implementation is respectively35s for the HKU data set,6s for the CA VIAR data set,6s for the PETS2009 S2L1sequence,15s for the PETS2009S2L2sequence,and 8s for the PETS2009S2L3sequence.Considering that both the model matching and detection update processes can run in parallel,if our approach is implemented in C++with GPU acceleration and code optimization,we expect that the approach has the potential to run in real time for a moderately crowded scene,e.g.,ten humans.

V.C ONCLUSION

In this paper,we have proposed an iterative data association and detection update approach for human tracking in crowded scenes.Both the detection and association results of the pro-posed approach have shown obvious improvement over the single-frame detection results and the association results with-out detection https://www.wendangku.net/doc/8d4774819.html,ing the reliable temporal information for detection update and tracklet extension,many of the missed detections and false alarms in the original detection results have been eliminated,and the human location,size,and orientation estimation errors have been alleviated as https://www.wendangku.net/doc/8d4774819.html,parison with several state-of-the-art approaches demonstrates the effective-ness of the proposed approach in tracking crowds of densities from low to high,except for those very crowded scenarios where reliable temporal information can hardly be extracted. The detection and tracking results can be expected to be improved by introducing discriminatively trained af?nity and appearance models into the proposed framework.Our ap-proach can be also improved by exploiting high-level scene understanding ability to resolve more ambiguities,e.g.,scene occluder detection by either speci?cally detecting certain commonly seen occluders such as trees and pillars or statistical analysis of the obtained trajectories.

R EFERENCES

[1]Z.Guohui and W.Yinhai,“Optimizing minimum and maximum green

time settings for traf?c actuated control at isolated intersections,”IEEE Trans.Intell.Transp.Syst.,vol.12,no.1,pp.164–173,Mar.2011. [2]X.Song and H.B.L.Duh,“A simulation of bonding effects and their im-

pacts on pedestrian dynamics,”IEEE Trans.Intell.Transp.Syst.,vol.11, no.1,pp.153–161,Mar.2010.

[3]A.Shende,M.P.Singh,and P.Kachroo,“Optimization-based feedback

control for pedestrian evacuation from an exit corridor,”IEEE Trans.

Intell.Transp.Syst.,vol.12,no.4,pp.1167–1176,Dec.2011.

[4]J.Candamo,M.Shreve,D.B.Goldgof,D.B.Sapper,and R.Kasturi,

“Understanding transit scenes:A survey on human behavior-recognition algorithms,”IEEE Trans.Intell.Transp.Syst.,vol.11,no.1,pp.206–224, Mar.2010.

[5]L.Wang and N.Yung,“Three-dimensional model based human detection

in crowded scenes,”IEEE Trans.Intell.Transp.Syst.,vol.13,no.2, pp.691–703,Jun.2012.

[6]C.Rasmussen and G.D.Hager,“Probabilistic data association methods

for tracking complex visual objects,”IEEE Trans.Pattern Anal.Mach.

Intell.,vol.23,no.6,pp.560–576,Jun.2001.

[7]https://www.wendangku.net/doc/8d4774819.html,aniciu,V.Ramesh,and P.Meer,“Kernel-based object tracking,”

IEEE Trans.Pattern Anal.Mach.Intell.,vol.25,no.5,pp.564–575, May2003.

[8]X.Wang,G.Hua,and T.Han,“Discriminative tracking by metric learn-

ing,”in https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2010,pp.200–214.

[9]Z.Kalal,K.Mikolajczyk,and J.Matas,“Tracking-learning-detection,”

IEEE Trans.Pattern Anal.Mach.Intell.,vol.34,no.7,pp.1409–1422, Jul.2012.

[10]B.Wu and R.Nevatia,“Detection and tracking of multiple,partially oc-

cluded humans by Bayesian combination of edgelet based part detectors,”

https://www.wendangku.net/doc/8d4774819.html,put.Vis.,vol.75,no.2,pp.247–266,Nov.2007.

[11]C.Huang,B.Wu,and R.Nevatia,“Robust object tracking by hierarchical

association of detection responses,”in https://www.wendangku.net/doc/8d4774819.html,put.Vis., 2008,pp.788–801.

[12]M. D.Breitenstein, F.Reichlin, B.Leibe, E.Koller-Meier,and

L.Van Gool,“Online multiperson tracking-by-detection from a single, uncalibrated camera,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.33, no.9,pp.1820–1833,Sep.2011.

[13]J.Munkres,“Algorithms for the assignment and transportation problems,”

J.Soc.Ind.Appl.Math.,vol.5,no.1,pp.32–38,1957.

[14]T.Zhao,R.Nevatia,and B.Wu,“Segmentation and tracking of multiple

humans in crowded environments,”IEEE Trans.Pattern Anal.Mach.

Intell.,vol.30,no.7,pp.1198–1211,Jul.2008.

[15]K.Okuma,A.Taleghani,N.d.Freitas,J.J.Little,and D.G.Lowe,“A

boosted particle?lter:Multitarget detection and tracking,”in Proc.Eur.

https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2004,pp.28–39.

[16]Y.Li,H.Ai,T.Yamashita,https://www.wendangku.net/doc/8d4774819.html,o,and M.Kawade,“Tracking in low

frame rate video:A cascade particle?lter with discriminative observers of different lifespans,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog., Beijing,China,2007,pp.1–8.

[17]D.Gavrila,“Pedestrian detection from a moving vehicle,”in Proc.Eur.

https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2000,pp.37–49.

[18]Y.Cai,N.Freitas,and J.Little,“Robust visual tracking for multiple

targets,”in https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2006,pp.107–118.

[19]I.Ali and M.N.Dailey,“Multiple human tracking in high-density

crowds,”Image https://www.wendangku.net/doc/8d4774819.html,put.,vol.30,no.12,pp.966–977,Dec.2012.

[20]H.Jiang,S.Fels,and J.J.Little,“A linear programming approach for mul-

tiple object tracking,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog., 2007,pp.1–8.

[21]L.Zhang,Y.Li,and R.Nevatia,“Global data association for multi-object

tracking using network?ows,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog.,2008,pp.1–8.

[22]J.Berclaz,F.Fleuret,E.Turetken,and P.Fua,“Multiple object tracking

using k-shortest paths optimization,”IEEE Trans.Pattern Anal.Mach.

Intell.,vol.33,no.9,pp.1806–1819,Sep.2011.

[23]H.B.Shitrit,J.Berclaz,F.Fleuret,and P.Fua,“Tracking multiple people

under global appearance constraints,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.

Vis.,2011,pp.137–144.

[24]A.Andriyenko and K.Schindler,“Globally optimal multi-target

tracking on a hexagonal lattice,”in https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2010, pp.466–479.

[25]H.Pirsiavash,D.Ramanan,and C.Fowlkes,“Globally-optimal greedy

algorithms for tracking a variable number of objects,”in Proc.IEEE Conf.

Comput.Vis.Pattern Recog.,2011,pp.1201–1208.

[26]H.Izadinia,I.Saleemi,W.Li,and M.Shah,“(MP)2T:Multiple

people multiple parts tracker,”in https://www.wendangku.net/doc/8d4774819.html,put.Vis.,2012, pp.100–114.

[27]P.F.Felzenszwalb,R.B.Girshick,D.McAllester,and D.Ramanan,

“Object detection with discriminatively trained part-based models,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.32,no.9,pp.1627–1645, Sep.2010.

[28]Q.Yu,G.Medioni,and I.Cohen,“Multiple target tracking using spatio-

temporal Markov chain Monte Carlo data association,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog.,2007,pp.1–8.

[29]B.Benfold and I.Reid,“Stable multi-target tracking in real-time surveil-

lance video,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog.,2011, pp.3457–3464.

[30]B.Leibe,K.Schindler,N.Cornelis,and L.Van Gool,“Coupled

object detection and tracking from static cameras and moving vehicles,”

in IEEE Trans.Pattern Anal.Mach.Intell.,Oct.2008,vol.30,no.10, pp.1683–1698.

[31]https://www.wendangku.net/doc/8d4774819.html,an,S.Roth,and K.Schindler,“Continuous energy minimization for

multi-target tracking,”IEEE Trans.Pattern Anal.Mach.Intell.,vol.36, no.1,pp.58–72,2014.

[32]C.Stauffer,“Estimating tracking sources and sinks,”in https://www.wendangku.net/doc/8d4774819.html,put.

Vis.Pattern Recog.Workshop,2003,p.35.

[33]A.G.A.Perera,C.Srinivas,A.Hoogs,G.Brooksby,and W.Hu,“Multi-

object tracking through simultaneous long occlusions and split–merge conditions,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog.,2006, pp.666–673.

[34]Y.Li,C.Huang,and R.Nevatia,“Learning to associate:HybridBoosted

multi-target tracker for crowded scene,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.

Pattern Recog.,2009,pp.2953–2960.

[35]C.-H.Kuo,C.Huang,and R.Nevatia,“Multi-target tracking by on-line

learned discriminative appearance models,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.

Vis.Pattern Recog.,2010,pp.685–692.

[36]C.-H.Kuo and R.Nevatia,“How does person identity recognition help

multi-person tracking?”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog., 2011,pp.1217–1224.

[37]B.Yang and R.Nevatia,“Multi-target tracking by online learning of non-

linear motion patterns and robust appearance models,”in Proc.IEEE https://www.wendangku.net/doc/8d4774819.html,put.Vis.Pattern Recog.,2012,pp.1918–1925.

[38]R.Stiefelhagen,K.Bernardin,R.Bowers,J.S.Garofolo,D.Mostefa,and

P.Soundararajan,“The CLEAR2006evaluation,”in Proc.Int.Eval.Conf.

Classi?cation Events,Activities Relationships,2006,pp.1–44.

[39]CA VIAR Test Case Scenarios.[Online].Available:http://homepages.inf.

https://www.wendangku.net/doc/8d4774819.html,/rbf/CA

VIARDATA1/

Lu Wang received the B.Eng.and M.Eng.degrees

from Harbin Institute of Technology,Harbin,China,

and the Ph.D.degree from The University of Hong

Kong,Pokfulam,Hong Kong.

She is currently an Assistant Professor with the

College of Information Science and Engineering,

Northeastern University,Shenyang,China.Her re-

search interests include computer vision and pattern

recognition.

Nelson Hon Ching Yung(M’85–SM’96)received

the B.Sc.and Ph.D.degrees from Newcastle Univer-

sity,Newcastle upon Tyne,U.K.

He was a Lecturer with Newcastle University from

1985to1990.From1990to1993,he was a Senior

Research Scientist with the Department of Defence

of Australia.In late1993,he joined The University

of Hong Kong(HKU),Pokfulam,Hong Kong,as

an Associate Professor.He is the Founding Direc-

tor of the Laboratory for Intelligent Transportation

Systems Research,HKU.He acts as Consultant to government units and a number of local and international companies.He has coauthored?ve books and book chapters and has published more than 150journal and conference papers in the areas of digital image processing, parallel algorithms,visual traf?c surveillance,autonomous vehicle navigation, and learning algorithms.

Prof.Yung is a Chartered Electrical Engineer.He is a member of The Hong Kong Institution of Engineers and the Institution of Electrical Engineers. He was the Regional Secretary of the IEEE Asia-Paci?c Region,a Council Member and the Chairman of Standards Committee of Intelligent Transporta-tion Systems-Hong Kong(ITS-HK),and the Chair of Computer Division, International Institute for Critical Infrastructures.He was a member of the Advisory Panel of the ITS Strategy Review,Transport Department,Government of the Hong Kong Special Administrative Region.He was a Guest Editor of the SPIE Journal of Electronic Imaging.He also serves as a reviewer for a number of IEEE,Institution of Engineering and Technology,and SPIE journals.His biography has been published in Who’s Who in the World since

1998.

Lisheng Xu(M’08)received the B.Sc.,M.Sc.,and

Ph.D.degrees from Harbin Institute of Technology,

Harbin,China.

From2006to2009,he was with The Chinese

University of Hong Kong,Shatin,Hong Kong,as

a Postdoctoral Fellow.Since2009,he has been a

Full Professor with the Sino-Dutch Biomedical and

Information Engineering School,Northeastern Uni-

versity,Shenyang,China.His current research in-

terests include nonlinear medical signal processing,

computational electromagnetic simulation,computer vision,and pattern recognition.