文档库 最新最全的文档下载
当前位置:文档库 › Human Activity Recognition in Videos Using a Single Example

Human Activity Recognition in Videos Using a Single Example

Human activity recognition in videos using a single example ☆

Mehrsan Javan Roshtkhari ?,Martin D.Levine

Centre for Intelligent Machines,Department of Electrical and Computer Engineering,McGill University,Montreal,QC H3A 2A7,Canada

a b s t r a c t

a r t i c l e i n f o Article history:

Received 25January 2013

Received in revised form 18July 2013Accepted 29August 2013

Available online 6September 2013Keywords:

Action recognition Bag of video words Hierarchical codebook

Spatio-temporal contextual information Probabilistic modeling Context

Ensemble of volumes

This paper presents a novel approach for action recognition,localization and video matching based on a hierarchical codebook model of local spatio-temporal video volumes.Given a single example of an activity as a query video,the proposed method ?nds similar videos to the query in a target video dataset.The method is based on the bag of video words (BOV)representation and does not require prior knowledge about actions,background subtraction,motion estimation or tracking.It is also robust to spatial and tempo-ral scale changes,as well as some deformations.The hierarchical algorithm codes a video as a compact set of spatio-temporal volumes,while considering their spatio-temporal compositions in order to account for spatial and temporal contextual information.This hierarchy is achieved by ?rst constructing a codebook of spatio-temporal video volumes.Then a large contextual volume containing many spatio-temporal volumes (ensemble of volumes)is considered.These ensembles are used to construct a probabilistic model of video volumes and their spatio-temporal compositions.The algorithm was applied to three available video datasets for action recognition with different complexities (KTH,Weizmann,and MSR II)and the results were superior to other approaches,especially in the case of a single training example and cross-dataset 1action recognition.

?2013Elsevier B.V.All rights reserved.

1.Introduction

Human activity analysis is required for video surveillance sys-tems,human –computer interaction,sports interpretation,and video retrieval for content-based search engines [1,2].Moreover,given the tremendous number of video data available online these days,there is a great demand for automated systems that analyze and understand the contents of these videos.Recognizing and lo-calizing human actions in a video is the primary component of such a system,and also the most important,as it affects the perfor-mance of the whole system signi ?cantly.Although there are many methods to determine human actions in highly controlled environ-ments,this task remains a challenge in real world environments due to camera motion,cluttered background,occlusion,and scale/viewpoint/perspective variations [3–6].Moreover,the same action performed by two persons can appear to be very different.In addi-tion,clothing,illumination and background changes can increase this dissimilarity [7–9].

To date,in the computer vision community,“action ”has largely been taken to be a human motion performed by a single person,taking

up to a few seconds,and containing one or more events.Walking,jogging,jumping,running,hand waving,picking up something from the ground,and swimming are some examples of such human actions [1,2,6].In this paper,our main goal is to address the problem of action recognition and localization in real environments using a hierarchical probabilistic video-to-video matching framework.This problem is also referred to as action spotting [10].To achieve this,we have developed a fast data-driven approach,which ?nds similar videos in a “target ”set to a single labeled “query ”video.Assuming that the latter contains an action of interest,e.g.,walking,we ?nd all videos in the target set that that are similar to the query,which implies the same activity.This video-to-video comparison also makes it possible to label activities,the so-called action classi ?cation problem.An overview of the algo-rithm is presented in Fig.1.The major bene ?t of our approach is that it does not require long video training sequences,object segmentation,tracking or background subtraction.The method can be considered as an extension to the original bag of video words (BOV)approach for action recognition.

Although an initial spatio-temporal volumetric representation of human activity may eliminate some pre-processing steps,for exam-ple background subtraction and tracking,it suffers from some major drawbacks.For example,in general,BOV-based approaches for activity recognition in the literature involve salient point detection.They usually ignore the geometrical and temporal structure of these visual volumes,as they store STVs in an unordered manner.Also they are unable to handle scale variations (spatial,temporal,or spatio-temporal)because they are too local,in the sense that

Image and Vision Computing 31(2013)864–876

☆This paper has been recommended for acceptance by Sinisa Todorovic.?Corresponding author.

E-mail addresses:javan@cim.mcgill.ca (M.Javan Roshtkhari),levine@cim.mcgill.ca (M.D.Levine).1

We use this term to denote a query that is selected from a particular dataset when the target videos originate from another dataset.In this situation,the two datasets have been recorded under different

conditions.

0262-8856/$–see front matter ?2013Elsevier B.V.All rights reserved.

https://www.wendangku.net/doc/1217219486.html,/10.1016/j.imavis.2013.08.005

Contents lists available at ScienceDirect

Image and Vision Computing

j o u r na l ho m e p a g e :w w w.e l s e v i e r.c o m /l o c a t e /i m a v i s

they consider just a few neighboring video volumes (e.g.,?ve nearest neighbors in [11]or just one neighbor in [4]).To overcome these is-sues,we have developed a multi-scale,hierarchical codebook of BOVs for densely sampled videos,which incorporates spatio-temporal compositions and their uncertainties .This permits the use of statistical inference to recognize the activities.We also note that,in order to measure similarity between a query and a target dataset,it is necessary to use information regarding the most informative spatio-temporal video volumes (STVs)in the video,i.e.,the salient foreground objects.To select these space-time regions,we use the information obtained from our hierarchical BOV method,which in a sense,can be viewed as being a context-based spatio-temporal seg-mentation method.

In this paper we present a hierarchical probabilistic codebook method for action recognition in videos,which is based on STV construction.The method uses both local and global compositional information of the volumes,which are obtained by dense sampling at various scales.Similar to other volumetric methods,we do not require background subtraction,motion estimation,or complex models of body con ?gurations and kinematics.Moreover,the method tolerates variations in appearance,scale,rotation,and movement.

As shown in Fig.1,the proposed algorithm consists of two main components,hierarchical codebook construction of salient STVs and an inference mechanism for measuring the similarity between salient STVs of the query and target videos.Hierarchical codebook construction consists of four steps:coding the video to construct STVs and low-level probabilistic codebook formation while consid-ering the uncertainties in the STVs;constructing ensembles of video volumes for each pixel in a video frame containing a large number of STVs and probabilistic models of their spatio-temporal compositions;high-level codebook construction of the ensembles;and ?nally,analyzing codewords as a function of time in order to construct a codebook of salient regions.The inference mechanism is based on a set of codewords constructed for each query video.It determines the most similar compositions of STVs in the target videos that match the query video.There are two important differ-ences between our proposed hierarchical approach and previously reported ones.First,the latter are unable to handle both local and global compositional information.Second,they always select the informative regions at the lowest level of the hierarchy.

The main contributions of this paper are as follows:

?We introduce a hierarchical codebook structure for action detection and labeling.This is achieved by considering a large volume contain-ing many STVs and constructing a probabilistic model of this volume to capture the spatio-temporal con ?gurations of STVs.Consequently,similarity between two videos is calculated by measuring the similar-ity both between spatio-temporal video volumes and their composi-tional structures.

?We select the salient pixels in the video frames by analyzing codewords obtained at the highest level of the hierarchical codebook's structure.This differs from conventional background subtraction and salient point detection methods.

In order to evaluate the capability of our approach for action matching and classi ?cation we have conducted experiments using three datasets:KTH [12],Weizmann [13]and MSR II [14].2Three types of experiments were performed:action matching and retrieval,single dataset video classi ?cation,and cross-dataset action recognition.A preliminary version of this paper appeared in the International Conference on Computer and Robot Vision [15].This paper is different from [15]in the following respects:1)we provide more detailed descriptions of how the proposed algorithm learns visual context,2)we have formulated the contextual graphs and similarity measurement in a spatio-temporal context,3)a multiscale approach is implemented to deal with large variations in the scale of the actions,and 4)effects of different parameters has been evaluated by conducting extensive experiments.The rest of this paper is organized as follows.Section 2reviews recent work on action recogni-tion.Section 3describes the proposed approach for hierarchical codebook construction and the steps of the algorithm.Section 4describes the action matching algorithm.Section 5then presents the experimental results and ?nally,Section 6concludes the paper.2.Related work

Many studies have focused on the action recognition problem by in-voking human body models,tracking-based methods,and local descrip-tors [1].The early work often depended on tracking [16–19],in which humans,body parts,or some interest points were tracked between

Query

hand waving

Fig.1.Overview.The goal is to ?nd similar videos to the query video in the target set.This is achieved by constructing an activity model for the query video and then measuring the sim-ilarity between it and the target videos.

2

https://www.wendangku.net/doc/1217219486.html,/en-us/um/people/zliu/ActionRecoRsrc/.

865

M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

consecutive frames to obtain the overall appearance and motion trajectory.Clearly,the performance of these algorithms is highly depen-dent on tracking,which sometimes fails for real world video data[20]. Recently,tracking a?xed number of interest points between video frames has become more popular than other tracking-based approaches since they are capable of coding some contextual information regarding local spatio-temporal features.This method functions by tracking the interest point features between consecutive frames and thereby obtaining a set of trajectories[21,22,19].The contextual information is then computed as the spatial relationship between trajectories[21]or temporal associations between interest points on a single trajectory [22].In addition to the normal issues associated with tracking,these ap-proaches are based on an implicit assumption of a static background, since moving objects in the background might produce trajectories sim-ilar to an object in the region of interest.

Alternatively,shape template matching has been employed for activity recognition;e.g.,2D shape matching[23]or its3D extensions, optical?ow matching[13,24,25].In this case,action templates are constructed to model the actions and used to locate similar motion patterns.Other studies have combined both shape and motion features to achieve more robust results[26,27],claiming that this representation is somewhat robust to object appearance[26].In a more recent study [27],shape and motion descriptors are employed to construct a shape-motion prototype for human activities in a hierarchical tree structure and action recognition is performed in the joint shape and motion fea-ture space.Although it seems that the previous approaches are likely well suited to action localization,they do require a priori high-level rep-resentations of the human motion.Moreover,they depend on such image pre-processing stages as segmentation,object tracking,and back-ground subtraction[28],which are extremely challenging in real-world unconstrained environments.

In order to eliminate such pre-processing,Derpanis et al.[10]have proposed so-called“action templates”.These are calculated as oriented local spatio-temporal energy features that are computed as the re-sponse of a set of tuned3D Gaussian third order derivative?lters. Sadanand et al.[29]introduced action banks to make these template-based recognition approaches more robust to viewpoint and scale vari-ations.Recently,tracking and template-based approaches have been combined to improve the action detection accuracy[18,30].

In a completely different vein,models based on a bag of local visual features3have recently been studied extensively and shown promising results for action recognition[7,31,32,26,33,3,11,4,9,28,34]. These approaches extract and quantize the video data to produce a set of video volumes that form a“visual vocabulary”.In general,the poten-tial real-time performance of these methods is related to the number of video volume samples and their associated features[26].Usually,these features are gradients(spatial,temporal,or spatio-temporal),body landmarks,or color https://www.wendangku.net/doc/1217219486.html,bining them makes it possible to capture motion and the scene context simultaneously without requir-ing reliable trajectories of the objects of interest[35].The video volumes are constructed either by extracting a limited set of interest points or densely sampling the video.In the former,due to the sparse nature of the space–time interest points,the method becomes computationally ef?cient and hence is popular in the action recognition literature [3,12,36,34,37].On the other hand,the selection of appropriate interest points that are guaranteed to contain a salient and discriminative mo-tion pattern in their local context is a dif?cult challenge[38].In addition, it has been shown recently that densely sampling the video always achieves better results than a sparse set of interest points[39].

A major advantage of using volumetric representations of videos is that it permits the localization and classi?cation of actions using data-driven nonparametric approaches instead of requiring the training of sophisticated parametric models.In the literature,action inference is usually determined by using a wide range of classi?cation approaches, ranging from sub-volume matching[24],nearest neighbor classi?ers [40]and their extensions[37],support[32]and relevance vector ma-chines[11],and even more complicated classi?ers employing probabi-listic Latent Semantic Analysis(pLSA)[3].On the other hand,Boiman et al.[40]have shown that a rather simple nearest neighbor image classi?er in the space of the local image descriptors is equally as ef?cient as these more sophisticated classi?ers.This also implies that the particular classi?cation method chosen is not as serious as might be thought,and that the main challenge for action representation is using appropriate features.

However,we note that classical BOV approaches suffer from a signif-icant challenge.That is,the video volumes are grouped solely based on their similarity,in order to reduce the vocabulary size.Unfortunately, this destroys the compositional information concerning the relation-ships between volumes[41,3].Thus,the likelihood of each video vol-ume is calculated as its similarity to the other volumes in the dataset, without considering the spatio-temporal properties of the neighboring contextual volumes.This makes the classical BOV approach excessively dependent on very local data and unable to capture signi?cant spatio-temporal relationships.In addition,it has been shown recently that detecting actions using an“order-less”BOV does not produce accept-able recognition results[7,31,33,38,41–43].To overcome this challenge, contextual information must be included in the original BOV frame-work.One solution is to employ visual phrases instead of visual words.This has been proposed in[43]where a visual phrase is de?ned as a set of spatio-temporal video volumes with a speci?c pre-ordained spatial and temporal structure.The main drawback of this approach is that it cannot localize different activities in a video frame.Alternatively, the solution presented by Boiman and Irani[7]is to densely sample the video and store all video volumes for a video frame,along with their relative locations in space and time.Consequently,the likelihood of a query in an arbitrary space-time contextual volume can be computed and thereby used to determine an accurate label for an action using just simple nearest neighbor classi?ers[40].However,the main prob-lem with this approach is that it requires excessive computational time and a considerable amount of memory to store all of the volumes as well as their spatio-temporal relationships.We present a competent alternative to this in the next section.

In addition to[7],several other methods have been proposed to in-corporate spatio-temporal structure in the context of BOV[61].These are often based on co-occurrence matrices that are employed to describe contextual information.For example,the well-known correlogram exploits spatio-temporal co-occurrence patterns[4].However,only the relationship between the two nearest volumes was considered.This makes the approach too local and unable to capture complex relation-ships between different volumes.Another approach is to use a coarse grid and construct a histogram to subdivide the space-time volumes [35].Similarly,in[36],contextual information is added to the BOV by employing a coarse grid at different spatio-temporal scales.An alternative that does incorporate contextual information within a BOV framework is presented in[42],in which three-dimensional spatio-temporal pyramid matching is employed.While not actually comparing the compositional graphs of image fragments,this technique is based on the original two-dimensional spatial pyramid matching of multi-resolution histograms of patch features[41].Likewise in[44],temporal relationships between clustered patches are modeled using ordinal criteria(e.g.,equals,before, overlaps,during,after,etc.)and expressed by a set of histograms for all patches in the whole video sequence.Similar to[44],in[45]ordinal criteria are employed to model spatio-temporal compositions of clus-tered patches in the whole video frame during very short temporal intervals.The main problems associated with this are the large size of the spatio-temporal relationship histograms and the many parameters associated with the spatio-temporal ordinal criteria.In[46]the spatial information is coded through the concatenation of video words detected

3Essentially the probabilistic topic models,such as the Latent Dirichlet Allocation

(LDA),can also be considered as BOV approaches since they ignore the spatio-temporal or-

der of the local features.

866M.Javan Roshtkhari,M.D.Levine/Image and Vision Computing31(2013)864–876

in different spatial regions as well as data mining techniques,which are used to ?nd frequently occurring combinations of features.Sim-ilarly,[47]addresses this issue by using the spatial con ?guration of the 2D patches by incorporating their weighted sum.In [38],these patches were represented using 3D Gaussian distributions of the spatio-temporal gradient and the temporal relationship between these Gaussian distributions was modeled using HMMs.An interesting alternative is to incorporate mutual contextual information of objects and human body parts by using a random tree structure [28,34]to partition the input space.The likelihood of each spatio-temporal region in the video is then calculated.The primary issue with this approach [34]is that it requires background subtraction,interest point tracking and detection of regions of interest.

Hierarchical clustering seems to be an attractive way of incorporating the contextual structure of video volumes,as well as preserving the com-pactness of their description [33,11].Thus a modi ?ed version of [7]was presented in [11].It uses a hierarchical approach,in which a two-level clustering method is employed.At the ?rst level,all similar volumes are categorized.Then clustering is performed on randomly selected groups of spatio-temporal volumes while considering the relationships in space and time between the ?ve nearest spatio-temporal volumes.However,the small number of spatio-temporal volumes involved again makes this method local in nature.Another hierarchical approach is presented in [33],which attempts to capture the compositional information of a sub-set of the most discriminative video volumes.In all of these proposed so-lutions to date,although a higher level of quantization in the action space produces a compact subset of video volumes,it also signi ?cantly reduces the discriminative power of the descriptors,an issue addressed in [40].Generally,all of the earlier work described above for modeling the mutual relationships between the video volumes have one or more limitations such as:considering relationships between only a pair of local video vol-umes [42,4];being too local and unable to capture interactions of differ-ent body parts [33,48];and considering either spatial or temporal order of volumes [4].

In this paper we present a hierarchical probabilistic codebook method for action recognition and localization in videos.The proposed codebook structure has two important characteristics:it codes the compositional information of the 3D video volumes and selects the most informative ones in the video.

3.Multi-scale hierarchical codebooks

Considering the structure presented in Fig.1,our aim is to ?nd the similarity between the query and all of the target videos.Our work is based on the bag of space –time features approach in that a set of STVs is used for measuring similarity.The proposed recognition algorithm in Fig.1consists of two main steps:densely sampling videos from which hierarchical codebooks are constructed (see Fig.2)and using an inference mechanism for ?nding the appropriate action in the target videos.In this section,we focus on the former and Section 4describes the inference mechanism.We ?rst explain the sampling strategy and then describe the hierarchical codebook structure.3.1.Low level scene representation

The ?rst stage of the algorithm is to represent a query video by meaningful spatio-temporal descriptors.This is achieved by dense sampling,thereby producing a large number of spatio-temporal video volumes.Then similar video volumes are clustered to from a codebook.Since this is actually done on-line,frame-by-frame,the codebook is adaptive.The constructed codebook at this level is called the low-level codebook,as illustrated in Fig.2.

3.1.1.Multi-scale dense sampling

Similar to all BOV approaches,3D STVs in a video are constructed at the lowest level of the hierarchy.Although there are many methods for sampling the video for volume construction,dense sampling has been shown to be superior to the others in terms of retaining the

informative

Fig.2.Overview of the scene representation and hierarchical codebook structure.First,the query video is densely sampled at different spatio-temporal scales followed by the construction of a set of overlapping spatio-temporal video volumes.Subsequently,a two level hierarchical probabilistic codebook is created for the video volumes.At the lower level of the hierarchy,similar video volumes are grouped to form a conventional low level codebook,C ?,while considering the uncertainty in codeword assignment.At the higher level,a much larger spatio-temporal 3D volume around each pixel,containing many STVs,is considered in order to capture the spatio-temporal arrangement of the volumes.We refer to this graph as an ensemble of https://www.wendangku.net/doc/1217219486.html,ing these graphs,similar ensembles are grouped based on the similarity between arrangements of their video volumes and yet another codebook is formed.The most infor-mative codewords are then selected by examining the temporal correspondence between codewords.

867

M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

features of a video [61].Therefore,performance almost always increases with the number of sampled spatio-temporal volumes,making dense sampling the preferable choice [39,7,61].

The 3D spatio-temporal video volumes,v i ∈R n x ×n y ×n t are constructed by assuming a volume of size n x ×n y ×n t around each pixel (in which n x ×n y is the size of the spatial (image)window and n t is the depth of the video volume in time).Spatio-temporal vol-ume construction is performed at several spatial and temporal scales of a Gaussian space-time video pyramid.This yields a large number of volumes at each pixel in the video.Fig.2illustrates the process of spatio-temporal volume construction.These volumes are then characterized by a descriptor,which is the histogram of the spatio-temporal oriented gradients in the video,expressed in polar coordinates [49,51].Assume that G x (x ,y ,t )and G y (x ,y ,t )are spatial gradients and G t (x ,y ,t )is the temporal gradient for each pixel at (x ,y ,t ).The spatial gradient used to calculate the 3D gradient magnitude is normalized to reduce the effect of local texture and contrast.Hence,let:

G s x ;y ;t eT???????????????????????????????????????????????????

G x x ;y ;t eT2tG y x ;y ;t eT2q ;x ;y ;t eT∈v i

e G s

x ;y ;t eT?G s x ;y ;t eTX x ;y ;t eT∈v i

G s x ;y ;t eTtεmax

e1T

where e G s

is the normalized spatial gradient and εmax is a constant,set to 1%of the maximum spatial gradient magnitude in order to avoid numerical instabilities.Hence,the 3D normalized gradient is repre-sented in polar coordinates (M (x ,y ,t ),θ(x ,y ,t ),?(x ,y ,t )):M x ;y ;t eT???????????????????????????????????????????????????e G s x ;y ;t eT2tG t x ;y ;t eT2q θx ;y ;t eT?tan

?1G y

x ;y ;t eTG x x ;y ;t eT

?x ;y ;t eT?tan

?1G t x ;y ;t eT

e G

s

x ;y ;t eT

!e2T

where M (x ,y ,t )is the 3D gradient magnitude,and ?(x ,y ,t )and θ(x ,y ,

t )are the orientations within ?π2;π2??

and [?π,π],respectively.The descriptor vector for each video volume,taken as a histogram of ori-ented gradients (HOG),is constructed using the quantized θand ?into n θand n ?bins,respectively,weighted by the gradient magnitude,

M .The descriptor of each video volume will be referred to as h i ∈R n θtn

?.This descriptor represents both motion and appearance and pos-sesses some degree of robustness to unimportant variations in the data,such as illumination changes [49].However,it should be noted that our algorithm does not rely on a speci ?c descriptor for the video volumes,and other descriptors might enhance the per-formance of the approach.Examples of more complicated descrip-tors are the ones in [9],the spatio-temporal gradient ?lters in [52],the spatio temporal oriented energy measurements [10]and the popular three-dimensional Scale Invariant Feature Transform (SIFT)[50].

3.1.2.Codebook of video volumes

As the number of these volumes is extremely large (for example,about 106in a one minute video)it is advantageous to group similar STVs to reduce the dimensions of the search space.This is commonly performed in all BOV approaches [42,9,61].Here,similar video volumes are also grouped when constructing a codebook.The procedure is straightforward [15,61].The ?rst codeword is made equivalent to the ?rst observed spatio-temporal volume.After that,by measuring the similarity between each observed volume and the codewords already existing in the codebook,either the codewords are updated or a new one is formed.Then,each codeword is updated with a weight of w i ,j ,which is based on the similarity between the volume and the existing codewords.Here,we utilize the Euclidean distance for this purpose.

Thus,the normalized weight of assigning codeword c j to video volume v i is given by 4:w i ;j ?1X

j

1distance v i ;c j

?

1distance v i ;c j

:

e3T

Another important parameter is the number of times,f j ,that a codeword has been observed [61].The codebook is continuously being pruned to eliminate codewords that are either infrequent or very similar to the others,which ultimately generates M L differ-ent codewords that are taken as the labels for the video volumes,

C L ?c i f g M L

i ?1.

After the initial codebook formation,5each new 3D volume,v i ,can be assigned to all labels,c j 's,with a degree of similarity,w i ,j ,as shown in Fig.3a.We note that the number of labels (shown in color),M L ,is much less than the number of volumes,N .Moreover,codebook con-struction can be performed using any other clustering method,such as k-means ,online fuzzy c-means [51],or mutual information [42].3.2.High level scene representations

At the previous step,similar video volumes were grouped in order to construct the low level codebook.The outcome of this is a set of similar volumes,clustered regardless of their positions in space and time.This is the point at which all other BOV methods stop.As stated in the previous section,the main drawback of many BOV approaches is that they do not consider the spatio-temporal composition (context)of the video volumes.Certain methods for capturing such information have appeared in the literature (see [7,41,47]).In this paper,we present a probabilistic framework for quantifying the arrangement of the spatio-temporal volumes.3.2.1.Ensembles of volumes

Suppose a new video is to be analyzed;we refer to it as the query .The goal is to measure the likelihood of each pixel in the target videos given the query.To accomplish this,it is necessary to analyze the spatio-temporal arrangement of the volumes in the clusters that have been determined in Section 3.1.Thus,we next consider a large 3D volume around each pixel in (x ,y ,t )space.This large region con-tains many volumes with different spatial and temporal sizes as shown in Fig.3b.Thus it captures both the local and more distant in-formation in the video frames.Such a set is called an ensemble of vol-umes around the particular pixel in the video.The ensemble of volumes,E (x ,y ,t ),surrounding each pixel (x ,y )in the video at time t ,is de ?ned as:

E x ;y ;t eT?v E x ;y ;t eTj

n o J

j ?1

?v j :v j ?R x ;y ;t eTn o J

j ?1

e4T

where R (x ,y ,t )∈?3is a region with pre-de ?ned spatial and temporal di-mensions centered at point (x ,y ,t )in the video (e.g.,r x ×r y ×r t )and J indicates the total number of volumes inside the ensemble.These large contextual 3D spaces are employed to construct higher-level codebooks.

3.2.2.Contextual information and spatio-temporal compositions

To capture the spatio-temporal compositions of the video volumes,we use the relative spatio-temporal coordinates of the volume in each ensemble,as shown in Fig.3c.Assume that the ensemble of video volumes at point (x i ,y i ,t i )is E i and the central video volume inside that ensemble is called v o .Assume that v o is located at the point (x o ,y o ,t o )

4

Throughout the rest of the paper,each video volume will be represented by its de-scriptor vector.5

Recall that initialization requires a minimum of one video frame.

868M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

in absolute coordinates.Therefore,ΔE i v j ∈R 3is the relative position (in space and time)of the j th video volume,v j ,inside the ensemble of volumes:ΔE i v j

?x j ?x o ;y j ?y o ;t j ?t o :

e5T

Then each ensemble of video volumes at point (x i ,y i ,t i )is represented by a set of such video volumes and their relative positions,and hence Eq.(4)can be rewritten as:E x i ;y i ;t i eT?ΔE

i v j ;v j ;v o n o J

j ?1

:

e6T

An ensemble of volumes is characterized by a set of video volumes,the central video volume,and the relative distance of each of the vol-umes in the ensemble to the central video volume,as represented in Eq.(6).This provides a view-based graphical spatio-temporal multi-scale description at each pixel in every frame of a video.

A common approach for calculating similarity between ensembles of volumes is to use the star graph model in [7,11,49].This model uses the joint probability between a database and a query ensemble to decouple the similarity of the topologies of the ensembles and that of the actual video volumes [11].To avoid such a decomposition,we estimate the pdf of the volume composition in an ensemble and then measure simi-larity between these estimated pdf s.

During the codeword assignment process described in Section 3.1.2,each volume v j inside each ensemble was assigned to a label c m ∈C L with some degree of similarity w j ,m using Eq.(3).Given the codewords assigned to the video volumes,each ensemble of vol-umes can be represented by a set of codewords and their spatio-temporal relationships.Let c m ∈C L be the codeword assigned to the video volume v j and c n ∈C L ,the codeword assigned to the cen-tral video volume v o .Therefore,Eq.(6)can be rewritten as 6:v j ←c m v o ←c n

E x i ;y i ;t i eT?

m ?1:M L n ?1:M L

Δ;c m ;c n f g j ?1:J

e7T

where Δdenotes the relative position of the codeword c m inside the ensemble of volumes.By representing an ensemble as a set of

codewords and their spatio-temporal relationships,the topology of the ensemble,Γ,is de ?ned as:Γ?

m ?1:M L n ?1:M L

Γm ;n ΔeT

n o e8T

where Γis the topology of an ensemble of video volumes that encodes

the spatio-temporal relationships between codewords inside the ensemble.Γm ,n (Δ)∈Γis taken to be the spatio-temporal relationship between two codewords,c m and c n in the ensemble.7Therefore,Γm ;n ΔeT?Δ;c m ;c n eT:

e9T

Let v denote an observation,which is taken as a video volume inside the ensemble.Assume that its relative location is represented by Δv ,and v o is the central volume of the ensemble.The aim is to measure the probability of observing a particular ensemble model.Therefore,given

an observation,ΔE i v j ;v j ;v o

,the posterior probability of each topologi-cal model,Γm ,n Δ,is written as:

P Γm ;n Δj ΔE i v j ;v j ;v o ?P Δ;c m ;c n j ΔE

i v j ;v j ;v o

:

e10T

The posterior probability in Eq.(10)de ?nes the probability of ob-serving the codewords c m and c n and their relative location,Δ,given

the observed video volumes ΔE i

v j ;v j ;v o in an ensemble of volumes.Eq.(10)can be rewritten as:

P Δ;c m ;c n j ΔE i v j ;v j ;v o ?P Δ;c n j c m ;ΔE i v j ;v j ;v o P c m j ΔE i

v j ;v j ;v o

:

e11T

Since now the unknown video volume,v j ,has been replaced by a known interpretation,c m ,the ?rst factor on the right hand side of Eq.(11)can be treated as being independent of v j .Moreover,it is as-sumed that video volumes are independent.Thus v o can be removed from the second factor on the right hand side of Eq.(11)and hence,it can be rewritten as follows:

P Δ;c m ;c n j ΔE i v j ;v j ;v o ?P Δ;c n j c m ;ΔE i v j ;v o P c m j ΔE i

v j ;v j

:

e12

T

a b c

Fig.3.(a)Codeword assignment to spatio-temporal video volumes.Each codeword is assigned to a volume with a degree of similarity w i ,j .(b)An ensemble of spatio-temporal volumes obtained at one of the computed scales.A large 3D volume surrounding each pixel,containing many spatio-temporal volumes,is considered and referred to as an ensemble of volumes.This large 3D volume will be used both for further analysis and measuring the likelihood of each pixel.(c)Relative spatio-temporal coordinates of a particular video volume inside an en-semble of volumes,ΔE i

v j .

6

←symbolizes value assignment.

7

These topological models,Γm ,n (Δ),are obtained by assuming that the codeword en-tries are independent.Although in the case of overlapping video volumes such an assump-tion is not true,this is the standard Markovian assumption made for BOV.

869

M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

On the other hand,the codeword assigned to the video volume is in-dependent of its position,ΔE i

v j .Therefore Eq.(12)can be reduced to:P Δ;c m ;c n j ΔE i v j ;v j ;v o ?P Δ;c n j c m ;ΔE

i v j ;v o P c m j v j

:

e13T

Rewriting Eq.(13)gives:

P Δ;c m ;c n j ΔE

i v j ;v j ;v o

?P Δj c m ;c n ;ΔE i v j ;v o P c n j c m ;ΔE i

v j ;v o P c m j v j

:

e14T

Similarly,by assuming independency between codewords and their locations,Eq.(14)can be reduced to:P Δ;c m ;c n j ΔE

i v j ;v j ;v o

?P

Δj c m ;c n ;ΔE

i v j

P e

c n j v o T

P c m j v j :

e15T

The ?rst factor on the right hand side of Eq.(15)is the probabilistic vote for a spatio-temporal position,given the codewords assigned to the central video volume of the ensemble,the codeword assigned to the video volume,and its relative position.We note that,given a set of en-sembles of video volumes,the probability distribution function (pdf )in Eq.(15)can be formed using either a parametric model or non-parametric estimation.Here,we approximate P Δj c m ;c n ;ΔE i v j describ-ing each ensemble in Eq.(15)using (nonparametric)histograms.P (c m |v j )and P (c n |v o )in Eq.(15)are the votes for each codeword entry and they are obtained in the codeword assignment procedure in Section 3.1.2.Eventually,each ensemble of volumes can be represented by a set of pdf s as follows:P Γj E i eT?

m ?1:M L n ?1:M L

P Γm ;n ΔeTj E i

n o e16T

where P (Γ|E i )is a set of pdf modeling topology of the ensemble of volumes.Therefore,similarity between two video sequences can be computed simply by matching the pdf s of the ensembles of volumes at each pixel.

3.2.3.Codebook of ensembles of spatio-temporal volumes

Once a video clip has been processed,each ensemble of spatio-temporal volumes has been represented by a set of pdf s as given in Eq.(16).Having performed the ?rst level of clustering in Section 3.1.2,and given the representation of each ensemble obtained in Eq.(16),the aim now is to cluster the ensembles.This will then permit us to construct a behavioral model for the query video.Although clustering can be performed using many different approaches,spectral clustering methods are currently in vogue due to their superior performance to traditional methods.Moreover,they can be computed ef ?ciently.Spectral clustering constructs a similarity matrix of feature vectors and seeks an optimal par-tition of the graph representing the similarity matrix using eigen-decomposition [53].Usually,this is followed by either k-means or fuzzy c-means clustering.We utilize the normalized decomposition method of [54].

Employing the overall pdf P (Γ|E i )in Eq.(16)to represent each ensem-ble of volumes makes it possible to use divergence functions from statis-tics and information theory as the appropriate dissimilarity measure.Here we use the symmetric Kullback –Leibler (KL)divergence to measure the difference between the two pdf s,f and g [55]:d f ;g eT?KL f jj g eTtKL g jj f eT

e17T

where KL (f ||g )is the Kullback –Leibler (KL)divergence of f and g .There-fore,given the pdf of each ensemble of volumes in Eq.(16)the similarity between two ensembles of volumes,E (x i ,y i ,t i )and E (x j ,y j ,t j ),is de ?ned as:s E i ;E j ?e

?

d 2P Γj E i eT;P Γj E j

eTeT

2σ2

e18T

where P (Γ|E (x i ,y i ,t i ))and P (Γ|E (x j ,y j ,t j ))are the pdf s of the ensembles E (x i ,y i ,t i )and E (x j ,y j ,t j ),respectively,obtained in Section 3.2.2.d is the symmetric KL divergence between the two pdf s in Eq.(17)and σis the variance of the KL divergence over all of the observed ensembles of STVs in the query.

Given the similarity measurement of the ensembles in Eq.(18),the similarity matrix,S N ,for a set of ensembles of volumes is formed and the Laplacian calculated as follows:L ?D ?1S N D

1

e19T

where D is a diagonal matrix whose i th diagonal element is the sum of all elements in the i th row of S N .Subsequently,an eigenvalue de-composition is applied to L and the eigenvectors corresponding to the largest eigenvalues are normalized and form a new representa-tion of the data to be clustered [54].This is followed by online fuzzy

single-pass clustering [56]to produce M H different codewords

for the high-level codebook of ensembles of STVs,C H ?c i f g M H

i ?1,for each pixel.

3.2.

https://www.wendangku.net/doc/1217219486.html,rmative codeword selection

In order to select a particular video in a target set that contains a similar activity to the one in the query video,the uninformative regions (e.g.,background)must obviously be excluded from the matching procedure.This is conventionally performed in all activity recognition al-gorithms.Generally,for shape-template and tracking based approaches this is done at the pre-processing stages using such methods as back-ground subtraction and ROI selection.These have their inherent prob-lems discussed in Section 2.On the other hand,selecting informative rather than uninformative regions is a normal aspect of almost every BOV-based approach that constructs STVs at interest points.Clearly,these are intrinsically related to the most informative regions in the video.When we consider the framework for activity recognition pro-posed in this paper,the high-level codebook of ensembles of STVs is used to generate codes for all pixels in each video frame .Therefore it is cru-cial to select only the most informative codewords and their related pixels.Given the high-level codebook,C H ,constructed in Section 3.2.3,we saw that a codeword is assigned to each pixel p (x ,y )at time (t )in the video.Therefore,in a video sequence of temporal length T ,a particu-lar pixel p (x ,y )is represented as a sequence of assigned codewords at dif-ferent times:

p x ;y eT?p x ;y eT←c i :?t ∈T ;

c i ∈C

H

n

o :

e20T

A sample video frame and the assigned codewords are illustrated in Fig.4.In order to remove non-informative codewords (e.g.,codewords which represent the scene background),each pixel and its assigned codewords are analyzed as a function of time.As an example,Fig.4plots the assigned codewords to the sampled pixels in the video over time.It is observed that the pixels related to the background or static objects show stationary behavior.Therefore the associated codewords can be removed by employing a simple temporal ?lter at each pixel.This method was inspired by the pixel-based background model presented in [57],where a time series of each of the three quantized color features was created at each pixel.A more compact model of the background is then determined by temporal ?ltering,based on the idea of the Maximum Negative Run-Length (MNRL).The MNRL is de ?ned as the maximum amount of time between observing two samples of a speci ?c codeword at a particular pixel [57].The larger the MNRL,the more likely the codeword is not the background.The main difference from [57]is that we employ the assigned codewords as the representa-tive features for every pixel,as obtained from the high level codebook C H (see Eq.(20)).

The major advantage of selecting informative codewords at the highest level of the coding hierarchy is that compositional scene

870M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

information comes into play.8Hence the computational cost is great-ly reduced and the need for a separate background subtraction algo-rithm is eliminated.

In summary,at ?rst,the query video is densely sampled at differ-ent spatio-temporal scales in order to construct the video volumes.Then a low level codebook is formed and each volume v j is assigned to a codeword c i ,c i ∈C L ,with similarity w j ,i .Then a larger 3D volume around each pixel,containing many STVs,the so-called ensemble of STVs,is considered.The spatio-temporal arrangement of the vol-umes inside each ensemble is model based on a set of pdf s.At the next level of the hierarchical structure,another codebook is formed for these ensembles of STVs,C H .The two codebooks are then employed for ?nding similar videos to the query.

Two main features characterize the constructed probabilistic model of the ensembles.First the spatio-temporal probability distribution is de ?ned independently for each codebook entry.Second,the probability distribution for each codebook entry is estimated using (non-paramet-ric)histograms.The former renders the approach capable of handling certain deformations of an object's parts while the latter makes it possi-ble to model the true distribution instead of making an oversimplifying Gaussian assumption.

4.Similarity map construction and video matching

The overall goal is to ?nd similar videos to a query video in a target set and consequently label them according to the labeled query video using the hierarchical codebook presented in Section 3.Fig.5summa-rizes the process of determining the hierarchical codebooks and how the similarity maps are constructed.

The inference mechanism is the procedure for calculating similarity between particular spatio-temporal volume arrangements in the query and the target videos.More precisely,given a query video containing a particular activity,Q ,we are interested in constructing a dense similarity map for every pixel in the target video,V ,by utilizing pdf s of the volume arrangements in the video.At ?rst,the query video is densely sampled and a low level codebook is constructed for local spatio-temporal video volumes.Then the ensemble of video volumes is formed.These data are used to create a high level codebook,C H ,for coding spatio-temporal compositional information of the video volumes,as described in Section 3.Finally,the query video is represented by its associated codebooks.9In order to construct the similarity map for the target video,V ,it is densely sampled at different spatio-temporal scales and the codewords from C L are assigned to the video volumes.Then the en-sembles of video volumes are formed at every pixel and the similarity between the ensembles in V and the codewords in C H is measured

using Eq.(18).In this way,a similarity map is constructed at every pixel in the target video,S Q ;V x ;y ;t eT.The procedure for similarity map construction has been described in detail in Fig.5.Note again that no background and foreground segmentation and no explicit motion esti-mation are required in the proposed method.

Having constructed a similarity map,it remains to ?nd the best match to the query video.10Generally two scenarios are considered in activity recognition and video matching:(1)Detecting and localizing an activity of interest and (2)Classifying a target video given more than one query,which is usually referred to as action classi ?cation.For both of these,the region in the target video that contains a similar activity to the query must be selected at an appropriate scale.We per-form multi-scale activity localization,so that ensembles of volumes are generated at each scale independently.Hence,we produce a set of independent similarity maps for each scale.Therefore,for a given en-semble of volumes,E (x ,y ,t )in the target video,a likelihood function is formed at each scale:p S Q ;V x ;y ;t eTj scale

e21T

where S Q ;V x ;y ;t eTis the similarity between the ensemble of volumes in the target video,E (x ,y ,t ),and the most similar codeword in the high-level codebook,c k ?∈C H ,and scale represents the scale at which the sim-ilarity is measured.In order to localize the activity of interest,i.e.,?nding the most similar ensemble of volumes in the target video to the query,the maximum likelihood estimate of the scale at each pixel is employed.Therefore,the most appropriate scale at each pixel is the one that max-imizes the following likelihood estimate:scale ?

?arg max p scale

S Q ;V x ;y ;t eTj scale

:

e22T

In order to ?nd the most similar ensemble to the query,a detection threshold was employed.Hence,an ensemble of volumes is said to be similar to the query and contains the activity of interest if S Q ;V x ;y ;t eT≥γat scale ?.In this way,the region in the target video that matches the query is detected.11

For action classi ?cation problem,we consider a set of queries,Q ?∪Q i f g ,each containing a particular activity.12Then the target video is labeled according to the most similar video in the query.For each query video,Q i ,two codebooks are formed and then the similarity maps are constructed as described in Fig.5.This produces a set of similarity maps for all activities of interest.Therefore,the

target

a) A sample video frame

from the KTH dataset.

b) Codeword assignment to

each pixel.

Plots of the assigned codewords to

two sample pixels at different times.

https://www.wendangku.net/doc/1217219486.html,rmative codeword selection.a)A sample video frame from KTH dataset in which the person is running.b)High-level codewords assigned to every pixel in the video frame.c)Temporal correspondence of the codewords at each pixel.A time series of the assigned codewords from the high level codebook is ascribed to each pixel in the video.Pixels related to the background or static objects show a stationary behavior over time,and hence,they are assumed to be uninformative.

8

Some advanced approaches for background modeling also incorporate spatio-temporal compositions of the motion-informative regions to build a background model [63,52].9

The query is represented by two codebooks:the low level codebook of spatio-temporal video volumes,C L ,and the high level codebook of the ensembles of video vol-umes,C H .

10

The inference mechanism is relatively simple as our aim is to introduce and formulate a hierarchical structure for constructing a similarity map between videos based on densely sampled STVs and their spatio-temporal compositions.However,it could be replaced by a more sophisticated one.11

The threshold,γwas set empirically to 0.7of the maximum similarity value for every query video in all experiments.12

In most of the reported approaches for activity recognition it is implicitly assumed that the query contains a single activity.

871

M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

video contains a particular activity,i ?,that maximizes the accumulated similarity between all ensembles of volumes in the target video as follows:

i ?

?arg max i

X

E x ;y ;t eT∈V

S Q i ;V x ;y ;t eT0

@

1

A ;

Q i ∈Q :

e23T

Despite the simple inference mechanism employed here for ac-tion recognition and localization,the obtained experimental results show the strength of our approach for similarity map construction between two videos.We also note that the proposed statistical model of codeword assignment and the arrangement of the spatio-temporal volumes permit small local misalignments in the relative geometric arrangement of the composition.This property,in addition to the multi-scale volume construction in each ensemble,enables the algorithm to handle certain non-rigid deformations in space and time.This,of course,is necessary since human actions are not exactly repro-ducible,even for the same person.Obviously,activity recognition from a single example eliminates the need for a large number of training videos for model construction and signi ?cantly reduces computational costs.On the other hand,it imposes some limitations by its nature.It ap-pears that learning from a single example is not as general as the models constructed using many training examples,and therefore our approach

may not be as general as the model-based approaches.However,it should be emphasized that constructing a generic viewpoint and scale invariant model for an activity requires a large amount of labeled train-ing data,which do not currently exist.Moreover,imposing strong priors by assuming particular types of activities reduces the search space of possible poses considered,which limits their application to action recognition.

We conclude this section by examining the computational complex-ity of our algorithm.Suppose there are K video volumes available in each ensemble and the number of codewords in the low-and high-level codebooks are M L and M H ,respectively.For each ensemble,the time complexity of the low level and high level codeword assignment

are O (K ×M L ),and O M H àá

,respectively.Therefore the complexity of

calculating each point in a similarity map is O K ?M L ?M H

.

5.Experimental results

The algorithm was tested on three different datasets:KTH [12],Weizmann [13]and MSR II [14]to determine its capabilities for action recognition.The Weizmann and KTH datasets are the standard bench-marks in the literature used for action recognition.The Weizmann

Fig.5.The complete algorithm for similarity measurement between query and target videos.The query video is densely sampled and two codebooks are formed.The similarity between a target video and query at each pixel is measured based on these and then employed to construct a similarity map.

Boxing 0.860.040.070.020.000.01Clapping 0.05

0.840.070.020.020.00Waving 0.060.110.810.010.000.01Jogging 0.030.010.000.790.110.06Running 0.020.030.000.120.750.08Walking

0.030.000.020.070.060.82B o x i n g

C l a p p i n g

W a v i n g

J o g g i n g

R u n n i n g

W a l k i n g

Bend 0.960.000.000.010.000.000.000.000.010.02Jack 0.000.910.000.010.030.010.030.010.000.00Jump 0.000.000.870.030.020.000.070.000.010.00Pjump 0.010.000.040.900.020.020.010.000.000.00Run 0.000.010.000.010.920.000.020.030.000.01Side 0.000.020.000.040.000.930.000.000.010.00Skip 0.000.020.070.010.010.000.870.020.000.00Walk 0.000.010.000.000.030.000.020.930.010.00Wave10.000.010.000.000.000.000.000.020.940.03Wave2

0.010.000.000.000.000.000.000.010.020.96B e n d

J a c k

J u m p

P j u m p

R u n

S i d e

S k i p

W a l k

W a v e 1

W a v e 2

a) Weizmann dataset

b) KTH dataset

Fig.6.Confusion matrices for single video action matching,a)Weizmann dataset,b)KTH dataset.A single video is used as a query to which the other videos in the dataset were matched.

872M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

dataset consists of ten different actions performed by nine actors,and the KTH action data set contains six different actions,performed by twenty-?ve different persons in four different scenarios(indoor,out-

door,outdoor at different scales,outdoor with different clothes).The MSR II consists of54video sequences,recorded in different environ-ments with cluttered backgrounds in crowded scenes,and contains three types of actions similar to the KTH:boxing,hand clapping,and hand waving.We evaluated our approach for three different scenarios. The?rst one is“action matching and retrieval using a single example”, in which both target and query videos are selected from the same dataset.This task measures the capability of the proposed approach for video matching.The second scenario is the“single dataset action classi?cation”task in which more than one query video is employed to construct the model of a speci?c activity.Here,single dataset classi?-cation implies that both query and target videos are selected from the same dataset.Finally,in order to measure the generalization capability of our algorithm to?nd similar activities in videos recorded in different environments,“cross-dataset action detection”was performed.This scenario implies that that the query and target videos could be selected from different datasets.

Video matching and classi?cation were performed using KTH and Weizmann,which are single-person,single-activity videos.We used them to compare with the current state-of-the-art even though they were collected in controlled environments.For cross-dataset action recognition,we used the KTH dataset as the query set,while the tar-get videos were selected from the more challenging MSR II dataset. Our experiments demonstrate the effectiveness of our hierarchical codebook method for action recognition in these various categories. In all cases,we have assumed that local video volumes are of size n x=n y=n t=5,and the HOG is calculated assuming nθ=16 and n?=8.The ensemble size was set to r x=r y=r t=50.The number of codewords in the low-and high-level codebooks was set to55and120,respectively.13Later in this section we will thoroughly examine the effect of different parameters on the performance of the algorithm.

5.1.Action matching and retrieval using a single example

Since our proposed method is a video-to-video matching frame-work,it is not necessary to have a training sequence.This means that we can select one labeled query video for each action,and?nd the most similar one to it in order to perform the labeling.For the Weizmann dataset,we used one person for each action as a query video and the rest(eight other persons)as the target sets.This was done for all persons in the dataset and the results were averaged.The confusion matrix for the Weizmann dataset is shown in Fig.6a,achieving an average recognition rate of91.9%over all10actions.The columns of the confusion matrix rep-resent the instances to be classi?ed,while each row indicates the corre-sponding classi?cation results.

We carried out the same experiment on the KTH dataset.The confu-sion matrix is shown in Fig.6b.The average recognition rate was81.2% over all6actions.The results indicate that the method proposed in this paper outperforms state-of-the-art approaches,even though the former requires no background/foreground segmentation and tracking.The av-erage accuracy of the other methods is presented in Table1.

The overall results on the Weizmann dataset are better than those on the KTH dataset.This is predictable,since the Weizmann dataset con-tains videos with more static backgrounds and more stable and discrim-inative actions than the KTH dataset.

In order to measure the capabilities of our approach in dealing with scale and illumination variations,we reported the average recognition rate for different recording scenarios in the KTH dataset.According to [12],KTH contains four different recording conditions which are:s1) outdoors;s2)outdoors with scale variations;s3)outdoors with differ-ent clothes;and s4)indoors.The evaluation procedure employed here is to construct four sets of target videos,each having been obtained under the same recording condition.Then,a query is selected from one of these four scenarios and the most similar video to the query is found in each target dataset in order to perform the labeling.The average recognition rates are reported in Table2.When the target and query videos are selected from the same subset of videos with the same recording conditions,the average recognition rate is higher than when they are taken under different recording conditions.Moreover, although we have claimed that our method is scale-and illumination-invariant,it appears that,in these experiments,the recognition rate decreases when the query and target videos have been taken under different recording conditions.This is particularly evident when the target videos are recorded at different scales(see the second column in Table2).Thus scale and clothing variations degrade the performance of our algorithm more than changes in illumination.Therefore,as we might have expected,an activity model constructed using just a single example cannot adequately account for all scale/illumination variations in a scene.

5.2.Single dataset action classi?cation

In order to make an additional quantitative comparison of our algorithm with the state-of-the-art,we have extended it to the action classi?cation problem.This refers to the more classical situation in which we use a set of query videos instead of just a single one,as discussed previously.We have evaluated our algorithm's ability to apply the correct label to a given video sequence,when both the training14and target datasets are obtained from the same dataset.We tested the Weizmann and KTH datasets,and applied the standard experimental procedures in the literature.For the Weizmann dataset,the common approach for clas-si?cation is to use leave-one-out cross-validation,i.e.,eight persons are used for training and the videos of the remaining person are matched to one of the ten possible action labels.Consistent with other methods in the literature,we mixed the four scenarios for each action in the KTH dataset.We followed the standard experimental procedure for this dataset[12],in which16persons are used for training and nine for test-ing.This is done100times and after which the average performance over these random splits is calculated[12].The confusion matrix for the

13These parameters are similar to the ones in a similar study[15].

14Although our method does not actually require any speci?c training sequences,we re-fer to the query videos as the training set for consistency with the literature.

Table1

Action recognition comparison with the state-of-the-art for single video action matching (percentage of the average recognition rate).

Method Dataset

KTH Weizmann

Proposed method81.291.9

Thi et al.[59]77.1788.6

Seo et al.[9]6978Table2

Single video action matching in the KTH dataset when target videos are limited to four subsets,each obtained under different recording conditions.The query video is selected from one of the four subsets of videos with a different recording condition.Then the most similar video from each target is found and used as the label applied to the query (percentage of the average recognition rate).

Target

s1s2s3s4

Query s188.571.482.183.6 s272.174.269.771.6

s281.970.577.180.6

s282.373.681.184.4

873

M.Javan Roshtkhari,M.D.Levine/Image and Vision Computing31(2013)864–876

Weizmann dataset is reported in Fig.7a and the average recognition rate is 98.7%over all 10actions in the leave-one-out setting.As expected from earlier experiments reported in the literature,our results indicate that the “skip ”and “jump ”actions are easily confused,as they appear visually similar.For the KTH dataset,we achieved an average recognition rate of 95%for the six actions as shown in the confusion matrix in Fig.7.As ob-served from Fig.7b,the primary confusion occurs between jogging and running,which was also problematical for the other approaches.Obviously,this is due to the inherent similarity between the two actions.The recognition rate was also compared to other approaches (see Table 3).Comparing our results with those of the state-of-the-art,we observe that they are similar,though again we do not require any back-ground/foreground segmentation and tracking.

5.3.Cross-dataset action matching and retrieval

Similar to other approaches for action recognition [60],we use cross-dataset recognition to measure the robustness and generalization capa-bilities of our algorithm.In this paradigm,the query videos are selected from one dataset (the KTH dataset in our experiments)and the targets from another (MSR II dataset),so that we compare similar actions performed by different persons in different environments.We selected three classes of actions from the KTH dataset as the query videos:box-ing,hand waving,and hand clapping,including 25persons performing each action.A hierarchical codebook was created for each action catego-ry and the query was matched to the target videos.We varied the detec-tion threshold,γ,to obtain the precision/recall curves for each action

type,as shown in Fig.8.This achieved an overall recognition rate of 79.8%,which is comparable to the state-of-the-art (see Table 4).5.4.Effect of parameter variation

As our proposed method creates two codebooks to group similar video volumes and ensembles of video volumes,it is necessary to analyze the effect of different codebook sizes on the performance of the algorithm.Therefore,the overall recognition rate for differ-ent codebook sizes was determined as described previously using the KTH dataset.Various codebook sizes (M H and M L )were employed and the average recognition rate calculated.In Fig.9,the average recognition rate is plotted as a function of the both low-and high-level codebook sizes (number of codewords).We observe that small low level codebooks will not produce acceptable results,even with a large number of high level codewords.Therefore preserving information at the lowest level is necessary to achieve ac-ceptable results.Recall that we have shown in the previous section how the number of codewords affects the computational cost of our algorithm.

Similarly,using larger high level codebooks demands more memory and dramatically increases computational time.Therefore the number of codewords must be kept as small as possible.Although there is a

Bend 1.000.000.000.000.000.000.000.000.000.00Jack 0.00 1.000.000.000.000.000.000.000.000.00Jump 0.000.000.960.000.010.000.030.000.000.00Pjump 0.000.000.010.990.000.000.000.000.000.00Run 0.000.000.010.000.980.000.010.000.000.00Side 0.000.000.000.000.00 1.000.000.000.000.00Skip 0.000.000.030.010.010.000.950.000.000.00Walk 0.000.000.000.000.010.000.000.990.000.00Wave10.000.000.000.000.000.000.000.00 1.000.00Wave2

0.000.000.000.000.000.000.000.000.00 1.00B e n d

J a c k

J u m p

P j u m p

R u n

S i d e

S k i p

W a l k

W a v e 1

W a v e 2

a) Weizmann dataset

b) KTH dataset

Boxing 0.970.010.020.000.000.00Clapping 0.01

0.960.030.000.000.00Waving 0.010.010.980.000.000.00Jogging 0.000.000.000.890.070.04Running 0.000.000.000.080.910.01Walking

0.000.000.000.010.000.99B o x i n g

C l a p p i n g

W a v i n g

J o g g i n g

R u n n i n g

W a l k i n g

Fig.7.Confusion matrices for action classi ?cation,a)Weizmann dataset,b)KTH dataset.

Table 3

Comparison of action recognition with the state-of-the-art (percentage of the average rec-ognition rate).For the KTH dataset,the evaluation is made using either leave-one-out or data-split as described in the original paper [12].Method

Evaluation approach

Dataset KTH

Weizmann Proposed method Split 95.098.7Seo et.al.[9]Split 95.197.5Thi et al.[59]Split 94.6798.9Tian et al.[60]Split

94.5–Liu et al.[42]Leave one out 94.2–Zhang et al.[43]Split 94.0–Wang et al.[36]Split 93.8–Yao et al.[28]

Split

93.597.8Bregonzio et al.[31]Leave one out 93.1796.6Ryoo et al.[44]Split

91.1–Yu et al.[45]

Leave one out 95.67–Mikolajczyk et al.[8]Split

95.3–Jiang et al.[27]

Leave one out

95.77

1

00.10.1

0.20.2

0.30.3

0.40.4

0.50.5

0.60.6

0.70.8

0.90.7

0.8

0.9

1

Precision

R e c a l l

Boxing

Hand waving Hand clapping

Fig.8.The precision –recall curves for cross-dataset action recognition.The query videos are selected from the KTH dataset and the targets from the MSR II dataset.Three activities were selected for classi ?cation:boxing,hand waving,and hand clapping.

874M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

trade-off between codeword size and the performance of the algorithm,it can be inferred from our experiments that using relatively small codebooks at both low and high levels (e.g.,M L =55and M H ?120)achieves acceptable results for action recognition.6.Conclusion and future work

We have presented a new hierarchical approach based on spatio-temporal volumes for the challenging problem of video-to-video matching and tested for the problem of human action recognition in videos.At the lowest level in the data hierarchy,our approach is an ex-tension of conventional BOV approaches.However,this is only at the bottom level of a more descriptive data hierarchy that is based on representing a video by compositional contextual data.The hierarchical structure consists of three main levels:

?Densely sampling and coding a video using spatio-temporal volumes to produce a low-level codebook.This codebook is similar to the one constructed in conventional BOV approaches.

?Constructing an ensemble of video volumes and representing their structure using probabilistic modeling of the compositions of the spatio-temporal volumes.This is followed by the construction of a high-level codebook for the volume ensembles.

?Analyzing the codewords assigned to each pixel as a function of time in order to determine salient regions.

Given a single query video (an example of a particular activity),the method computes the similarity of each pixel in each frame of the target videos to the query,and ?nds the subset of target videos that are similar to that query.This is accomplished by analyzing a relatively large con-textual region around the pixel,while considering the compositional structure using a probabilistic framework.The algorithm was tested on three popular benchmarks,KTH,Weizmann,and MSR II.We showed that it is effective and robust for both action-matching and cross-dataset

recognition.Moreover,the results are highly competitive with state-of-the-art methods.However,a major advantage of our approach is that it does not require background and foreground segmentation and track-ing,and is susceptible to on-line real-time analysis.The proposed video method can easily be extended to multi-action retrieval and action local-ization by modifying the inference mechanism.Since the proposed meth-od codes the video using spatio-temporal video volumes and their compositional information,it does not impose any constraints on the video contents and therefore,it can be extended to unconstrained video matching and content-based search engines.One of the major advan-tages of the proposed algorithm for event recognition in videos is that it does not require a model of the event.However,it does have some draw-backs that need to be addressed in future work.Clearly,such a video rep-resentation of activities in a scene cannot be applied for long-term behavior understanding,e.g.,behaviors that consist of numbers of activi-ties that occur sequentially.Some form of event segmentation might deal with this issue.Future research will extend the approach by adding another level of analysis to the hierarchical structure,which models the spatial and temporal connectivity of the learnt activities.

References

[1]R.Poppe,A survey on vision-based human action recognition,Image Vision Comput.

28(6)(2010)976–990.

[2]P.Turaga,R.Chellappa,V.S.Subrahmanian,O.Udrea,Machine recognition of human

activities:a survey,IEEE Trans.Circuits Syst.Video Technol.18(11)(2008)1473–1488.

[3]J.C.Niebles,H.C.Wang,L.Fei-Fei,Unsupervised learning of human action categories

using spatial-temporal words,https://www.wendangku.net/doc/1217219486.html,put.Vision 79(3)(2008)299–318.

[4]S.Savarese, A.DelPozo,J.C.Niebles, F.-F.Li,Spatial –temporal correlations for

unsupervised action classi ?cation,WMVC,2008,pp.1–8.

[5]L.Wang,L.Cheng,Elastic sequence correlation for human action analysis,IEEE

Trans.Image Process.20(6)(2011)1725–1738.

[6] D.Weinland,R.Ronfard,E.Boyer,A survey of vision-based methods for action rep-resentation,segmentation and recognition,Comput.Vision Image Underst.115(2)(2011)224–241.

[7]O.Boiman,M.Irani,Detecting irregularities in images and in video,https://www.wendangku.net/doc/1217219486.html,put.

Vision 74(1)(2007)17–31.

[8]K.Mikolajczyk,H.Uemura,Action recognition with appearance –motion features

and fast search trees,Comput.Vision Image Underst.115(3)(2011)426–438.[9]H.Seo,https://www.wendangku.net/doc/1217219486.html,anfar,Action recognition from one example,IEEE Trans.Pattern Anal.

Mach.Intell.33(5)(2011)867–882.

[10]K.G.Derpanis,M.Sizintsev,K.Cannons,R.P.Wildes,Ef ?cient action spotting based

on a spacetime oriented structure representation,Computer Vision and Pattern Rec-ognition (CVPR),2010IEEE Conference on,2010,pp.1990–1997.

[11] A.Oikonomopoulos,I.Patras,M.Pantic,Spatiotemporal localization and categoriza-tion of human actions in unsegmented image sequences,IEEE Trans.Image Process.20(4)(2011)1126–1140.

[12] C.Schuldt,https://www.wendangku.net/doc/1217219486.html,ptev,B.Caputo,Recognizing human actions:a local SVM approach,

ICPR,vol.3,2004,pp.32–36.

[13]L.Gorelick,M.Blank,E.Shechtman,M.Irani,R.Basri,Actions as space-time shapes,

IEEE Trans.Pattern Anal.Mach.Intell.29(12)(2007)2247–2253.

[14]J.Yuan,Z.Liu,Y.Wu,Discriminative video pattern search for ef ?cient action detec-tion,IEEE Trans.Pattern Anal.Mach.Intell.33(9)(2011)1728–1743.

[15]M.J.Roshtkhari,M.D.Levine,A multi-scale hierarchical codebook method for human

action recognition in videos using a single example,Conference on Computer and Robot Vision (CRV),2012,pp.182–189.

[16] D.Ramanan,D.A.Forsyth,Automatic annotation of everyday movements,Adv.

Neural Inf.Process.Syst.16(2004)1547–1554.

[17] C.Rao,A.Yilmaz,M.Shah,View-invariant representation and recognition of actions,

https://www.wendangku.net/doc/1217219486.html,put.Vision 50(2)(2002)203–226.

[18] F.Yuan,G.-S.Xia,H.Sahbi,V.Prinet,Mid-level features and spatio-temporal context

for activity recognition,Pattern Recogn.45(12)(2012)4182–4191.

[19]H.Wang,A.Klaser,C.Schmid,L.Cheng-Lin,Action recognition by dense trajectories,

Computer Vision and Pattern Recognition (CVPR),2011IEEE Conference on,2011,pp.3169–3176.

[20]H.Yang,L.Shao,F.Zheng,L.Wang,Z.Song,Recent advances and trends in visual

tracking:a review,Neurocomputing 74(18)(2011)3823–3831.

[21]R.Messing,C.Pal,H.Kautz,Activity recognition using the velocity histories of

tracked keypoints,Computer Vision,2009IEEE 12th International Conference on,2009,pp.104–111.

[22]J.Sun,X.Wu,S.Yan,L.F.Cheong,T.S.Chua,J.Li,Hierarchical spatio-temporal context

modeling for action recognition,Computer Vision and Pattern Recognition (CVPR),IEEE Conference on,2009,pp.2004–2011.

[23] A.Yilmaz,M.Shah,Actions sketch:a novel action representation,Computer Vision

and Pattern Recognition (CVPR),IEEE Conference on,2005,pp.984–989.

[24] E.Shechtman,M.Irani,Space-time behavior-based correlation-or-how to tell if two

underlying motion ?elds are similar without computing them?IEEE Trans.Pattern Anal.Mach.Intell.29(11)(2007)2045–2056.

Table 4

Percentage of the average correct recognition rate for cross dataset action recognition over three different activities.The query and the target videos are selected from the KTH and MSR II datasets,respectively.Method

Accuracy (%)Proposed method 79.8Tian et al.[60]78.8Yuan et al.[37]

59.6

20

40

60

80

100

Number of High-level Codewords

A v e r a g e R e c o g n i t i o n R a t e

Average Recognition Rate v.s. Codebook Size

Fig.9.Effect of different codebook sizes for both low-and high-level codebooks.The aver-age recognition rate is calculated for different codebook sizes for KTH dataset.

875

M.Javan Roshtkhari,M.D.Levine /Image and Vision Computing 31(2013)864–876

[25] A.A.Efros,A.C.Berg,G.Mori,J.Malik,Recognizing action at a distance,Computer

Vision(ICCV),IEEE International Conference on,2003,pp.726–733.

[26]Y.Ke,R.Sukthankar,M.Hebert,Volumetric features for video event detection,Int.J.

Comput.Vision88(3)(2010)339–362.

[27]Z.Jiang,L.Zhe,L.S.Davis,Recognizing human actions by learning and

matching shape–motion prototype trees,IEEE Trans.Pattern Anal.Mach.

Intell.34(3)(2012)533–547.

[28] A.Yao,J.Gall,L.Van Gool,A Hough transform-based voting framework for action

recognition,Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2010,pp.2061–2068.

[29]S.Sadanand,J.J.Corso,Action bank:a high-level representation of activity in video,

Computer Vision and Pattern Recognition(CVPR),2012IEEE Conference on,2012, pp.1234–1241.

[30]S.Khamis,V.I.Morariu,L.S.Davis,A?ow model for joint action recognition and

identity maintenance,Computer Vision and Pattern Recognition(CVPR),2012 IEEE Conference on,2012,pp.1218–1225.

[31]M.Bregonzio,G.Shaogang,X.Tao,Recognising action as clouds of space–time inter-

est points,Computer Vision and Pattern Recognition(CVPR),IEEE Conference on, 2009,pp.1948–1955.

[32] B.Chakraborty,M.B.Holte,T.B.Moeslund,J.Gonzalez,Selective spatio-temporal

interest points,Comput.Vision Image Underst.116(3)(2011)396–410.

[33] A.Kovashka,K.Grauman,Learning a hierarchy of discriminative space–time neigh-

borhood features for human action recognition,Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2010,pp.2046–2053.

[34]G.Yu,J.Yuan,Z.Liu,Unsupervised random forest indexing for fast action search,

Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2011, pp.865–872.

[35] D.Han,L.Bo,C.Sminchisescu,Selection and context for action recognition,Comput-

er Vision(ICCV),IEEE International Conference on,2009,pp.1933–1940.

[36]J.Wang,Z.Chen,Y.Wu,Action recognition with multiscale spatio-temporal con-

texts,Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2011, pp.3185–3192.

[37]J.Yuan,Z.Liu,Y.Wu,Discriminative subvolume search for ef?cient action detection,

Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2009, pp.2442–2449.

[38]L.Kratz,K.Nishino,Anomaly detection in extremely crowded scenes using

spatio-temporal motion pattern models,Computer Vision and Pattern Recognition (CVPR),IEEE Conference on,2009,pp.1446–1453.

[39]H.Wang,M.M.Ullah,A.Klaser,https://www.wendangku.net/doc/1217219486.html,ptev,C.Schmid,Evaluation of local spatio-temporal

features for action recognition,BMVC,2009.

[40]O.Boiman,E.Shechtman,M.Irani,In defense of Nearest-Neighbor based image

classi?cation,Computer Vision and Pattern Recognition(CVPR),IEEE Conference on,2008,pp.1992–1999.

[41]https://www.wendangku.net/doc/1217219486.html,zebnik,C.Schmid,J.Ponce,Beyond bags of features:spatial pyramid matching

for recognizing natural scene categories,Computer Vision and Pattern Recognition (CVPR),IEEE Conference on,vol.2,2006,pp.2169–2178.

[42]J.Liu,M.Shah,Learning human actions via information maximization,Computer Vi-

sion and Pattern Recognition(CVPR),IEEE Conference on,2008,pp.1–8.[43]Y.Zhang,X.Liu,M.-C.Chang,W.Ge,T.Chen,Spatio-temporal phrases for activity

recognition computer vision,European Conference on Computer Vision(ECCV), vol.7574,Springer,Berlin/Heidelberg,2012,pp.707–721.

[44]M.S.Ryoo,J.K.Aggarwal,Spatio-temporal relationship match:video structure

comparison for recognition of complex human activities,Computer Vision(ICCV), IEEE International Conference on,2009,pp.1593–1600.

[45]T.-H.Yu,T.-K.Kim,R.Cipolla,Real-time action recognition by spatiotemporal

semantic and structural forests,Proceedings of the British machine vision confer-ence,2010,p.56.

[46] A.Gilbert,J.Illingworth,R.Bowden,Action recognition using mined hierarchical

compound features,IEEE Trans.Pattern Anal.Mach.Intell.33(99)(2011) 883–897.

[47]M.Marszaek,C.Schmid,Spatial weighting for bag-of-features,Computer Vision and

Pattern Recognition(CVPR),IEEE Conference on,vol.2,2006,pp.2118–2125. [48] A.Gilbert,J.Illingworth,R.Bowden,Scale invariant action recognition using com-

pound features mined from dense spatio-temporal corners,European Conference on Computer Vision(ECCV),Springer-Verlag,2008,pp.222–233.

[49]M.Bertini,A.Del Bimbo,L.Seidenari,Multi-scale and real-time non-parametric ap-

proach for anomaly detection and localization,Comput.Vision Image Underst.116

(3)(2012)320–329.

[50]P.Scovanner,S.Ali,M.Shah,A3-dimensional sift descriptor and its application to

action recognition,International conference on Multimedia,2007,pp.357–360. [51]M.J.Roshtkhari,M.D.Levine,Online dominant and anomalous behavior detection in

videos,Computer Vision and Pattern Recognition(CVPR),2013IEEE Conference on, 2013,pp.2609–2616.

[52]H.Zhong,J.Shi,M.Visontai,Detecting unusual activity in video,Computer Vision

and Pattern Recognition(CVPR),IEEE Conference on,vol.2,2004,pp.819–826. [53]U.Von Luxburg,A tutorial on spectral clustering,https://www.wendangku.net/doc/1217219486.html,put.17(4)(2007)

395–416.

[54] A.Y.Ng,M.I.Jordan,Y.Weiss,On spectral clustering:analysis and an algorithm,Adv.

Neural Inf.Process.Syst.14(2002)849–856.

[55] C.M.Bishop,Pattern Recognition and Machine Learning(Information Science and

Statistics),Springer,New York,2006.

[56]P.Hore,L.Hall,D.Goldgof,Y.Gu,A.Maudsley,A.Darkazanli,A scalable framework

for segmenting magnetic resonance images,J.Signal Proc.Syst.54(1)(2009) 183–203.

[57]K.Kim,T.H.Chalidabhongse,D.Harwood,L.Davis,Real-time foreground-background

segmentation using codebook model,Real-Time Imaging11(3)(2005)172–185.

[58] A.Mittal,A.Monnet,N.Paragios,Scene modeling and change detection in dy-

namic scenes:a subspace approach,Comput.Vision Image Underst.113(1) (2009)63–79.

[59]T.H.Thi,L.Cheng,J.Zhang,L.Wang,S.Satoh,Integrating local action elements for

action analysis,Comput.Vision Image Underst.116(3)(2012)378–395.

[60]Y.Tian,L.Cao,Z.Liu,Z.Zhang,Hierarchical?ltered motion for action recognition in

crowded videos,IEEE Trans.Syst.Man Cybern.42(3)(2012)313–323.

[61]M.Javan Roshtkhari,M.D.Levine,An on-line,real-time learning method for

detecting anomalies in videos using spatio-temporal compositions,Comput.Vision Image Underst.117(10)(2013)1436–1452.

876M.Javan Roshtkhari,M.D.Levine/Image and Vision Computing31(2013)864–876

相关文档