当前位置：文档库 › Seeing What You re Told Sentence-Guided Activity Recognition In Video

Seeing What You re Told Sentence-Guided Activity Recognition In Video

Seeing What You’re Told:Sentence-Guided Activity Recognition In Video

N.Siddharth Stanford University nsid@https://www.wendangku.net/doc/1a9104108.html,

Andrei Barbu

Massachusetts Institute of Technology

andrei@https://www.wendangku.net/doc/1a9104108.html,

Jeffrey Mark Siskind

Purdue University

qobi@https://www.wendangku.net/doc/1a9104108.html,

Abstract

We present a system that demonstrates how the composi-tional structure of events,in concert with the compositional structure of language,can interplay with the underlying fo-cusing mechanisms in video action recognition,providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language.We show how the roles played by participants(nouns),their characteristics(adjectives),the actions performed(verbs), the manner of such actions(adverbs),and changing spa-tial relations between participants(prepositions),in the form of whole-sentence descriptions mediated by a gram-mar,guides the activity-recognition process.Further,the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video:sentence-guided focus of attention,genera-tion of sentential description,and query-based search,sim-ply by leveraging the framework in different manners. 1.Introduction

The ability to describe the observed world in natural lan-guage is a quintessential component of human intelligence.

A particular feature of this ability is the use of rich sen-tences,involving the composition of multiple nouns,adjec-tives,verbs,adverbs,and prepositions,to describe not just static objects and scenes,but also events that unfold over time.Furthermore,this ability appears to be learned by vir-tually all children.The deep semantic information learned is multi-purpose:it supports comprehension,generation, and inference.In this work,we investigate the intuition, and the precise means and mechanisms that will enable us to support such ability in the domain of activity recognition in multi-activity video.

Suppose we wanted to recognize an occurrence of an event described by the sentence The ball bounced,in a video clip.Nominally,we would need to detect the ball and its po-sition in the?eld of view in each frame and determine that the sequence of such detections satis?ed the requirements of bounce.The sequence of such detections and their corre-sponding positions over time constitutes a track for that ob-ject.Here,the semantics of an intransitive verb like bounce would be formulated as a unary predicate over object tracks. Recognizing occurrences of events described by sentences containing transitive verbs,like The person approached the ball,would require detecting and tracking two objects,the person and the ball constrained by a binary predicate.

In an ideal world,event recognition would proceed in a purely feed-forward fashion:robust and unambiguous ob-ject detection and tracking followed by application of the semantic predicates on the recovered tracks.However,the current state-of-the-art in computer vision is far from this ideal.Object detection alone is highly unreliable.The best current average-precision scores on PASCAL VOC hover around40%-50%[3].As a result,object detectors suf-fer from both false positives and false negatives.One way around this is to use detection-based tracking[17],where one biases the detector to overgenerate,alleviating the prob-lem of false negatives,and uses a different mechanism to select among the overgenerated detections to alleviate the problem of false positives.One such mechanism selects de-tections that are temporally coherent,i.e.the track motion being consistent with optical?ow.Barbu et al.[2]proposed an alternate mechanism that selected detections for a track that satis?ed a unary predicate such as one would construct for an intransitive verb like bounce.We signi?cantly ex-tend that approach,selecting detections for multiple tracks that collectively satisfy a complex multi-argument predicate representing the semantics of an entire sentence.That pred-icate is constructed as a conjunction of predicates represent-ing the semantics of individual words in that sentence.For example,given the sentence The person to the left of the chair approached the trash can,we construct a logical form.

PERSON(P)∧TO T HE L EFT O F(P,Q)∧CHAIR(Q)

∧APPROACH(P,R)∧TRASH C AN(R)

Our tracker is able to simultaneously construct three tracks P,Q,and R,selecting out detections for each,in an optimal fashion that simultaneously optimizes a joint mea-sure of detection score and temporal coherence while also satisfying the above conjunction of predicates.We obtain the aforementioned detections by employing a state-of-the-art object detector[5],where we train a model for each ob-ject(e.g.person,chair,etc.),which when applied to an im-

2014 IEEE Conference on Computer Vision and Pattern Recognition

age,produces axis-aligned bounding rectangles with asso-

ciated scores indicating strength of detection.

We represent the semantics of lexical items like person,

to the left of,chair,approach,and trash can with predi-

cates over tracks like PERSON(P),TO T HE L EFT O F(P,Q), CHAIR(Q),APPROACH(P,R),and TRASH C AN(R).These predicates are in turn represented as regular expressions(i.e.

?nite-state recognizers or FSMs)over features extracted

from the sequence of detection positions,shapes,and sizes

as well as their temporal derivatives.For example,the pred-

icate TO T HE L EFT O F(P,Q)might be a single state FSM

where,on a frame-by-frame basis,the centers of the de-

tections for P are constrained to have a lower x-coordinate

than the centers of the detections for Q.The actual formu-

lation of the predicates(Table2)is more complex as it must

deal with noise and variance in real-world video.What is

central is that the semantics of all parts of speech,namely

nouns,adjectives,verbs,adverbs,and prepositions(both

those that describe spatial-relations and those that describe

motion),is uniformly represented by the same mechanism:

predicates over tracks formulated as?nite-state recognizers

over features extracted from the detections in those tracks.

We refer to this capacity as the Sentence Tracker,a func-

tion S:(B,s,Λ)→(τ,J),that takes,as input,an over-

generated set B of detections along with a sentence s and

a lexiconΛand produces a scoreτtogether with a set J of

tracks that satisfy s while optimizing a linear combination

of detection scores and temporal coherence.This can be

used for three distinct purposes as shown in section4: focus of attention One can apply the sentence tracker to the same video clip B,that depicts multiple simultane-ous events taking place in the?eld of view with differ-ent participants,with two different sentences s1and s2. In other words,one can compute(τ1,J1)=S(B,s1,Λ) and(τ2,J2)=S(B,s2,Λ)to yield two different sets of tracks J1and J2corresponding to the different sets of participants in the different events described by s1and s2. generation One can take a video clip B as input and sys-tematically search the space of all possible sentences s that can be generated by a context-free grammar and?nd that sentence s?for which(τ?,J?)=S(B,s?,Λ)yields the maximalτ?.This can be used to generate a sentence that describes an input video clip B.

retrieval One can take a collection B={B1,...,B M} of video clips(or a single long video chopped into short clips)along with a sentential query s,compute(τi,J i)= S(B i,s,Λ)for each B i,and?nd the clip B i with maxi-mal scoreτi.This can be used to perform sentence-based video search.

(Prior work[19]showed how one can take a training set {(B1,s1),...,(B M,s M)}of video-sentence pairs,where the word meaningsΛare unknown,and compute the lex-iconΛ?which maximizes the sumτ1+···+τM com-puted from(τ1,J1)=S(B1,s,Λ?),...,(τM,J M)= S(B M,s,Λ?).)However,we?rst present the two central algorithmic contributions of this work.In section2we present the details of the sentence tracker,the mechanism for ef?ciently constraining several parallel detection-based trackers,one for each participant,with a conjunction of ?nite-state recognizers.In section3we present lexical se-mantics for a small vocabulary of17lexical items(5nouns, 2adjectives,4verbs,2adverbs,2spatial-relation preposi-tions,and2motion prepositions)all formulated as?nite-state recognizers over features extracted from detections produced by an object detector,together with compositional semantics that maps a sentence to a semantic formula con-structed from these?nite-state recognizers where the object tracks are assigned to arguments of these recognizers. 2.The Sentence Tracker

Barbu et al.[2]address the issue of selecting detec-tions for a track that simultaneously satis?es a temporal-coherence measure and a single predicate corresponding to an intransitive verb such as bounce.Doing so constitutes the integration of top-down high-level information,in the form of an event model,with bottom-up low-level information in the form of object detectors.We provide a short review of the relevant material in that work to introduce notation and provide the basis for our exposition of the sentence tracker.

max

j1,...,j T

t=1

f(b t j t)+

t=2

g(b t?1

j t?1

,b t j t)(1) The?rst component is a detection-based tracker.For a given video clip with T frames,let j be the index of a detection and b t j be a particular detection in frame t with score f(b t j).

A sequence j1,...,j T of detection indices,one for each frame t,denotes a track comprising detections b t j t.We seek a track that maximizes a linear combination of aggregate de-tection score,summing f(b t j t)over all frames,and a mea-sure of temporal coherence,as formulated in Eq.1.The temporal coherence measure aggregates a local measure g computed between pairs of adjacent frames,taken to be the negative Euclidean distance between the center of b t j t and

the forward-projected center of b t?1

j t?1

computed with opti-cal?ow.Eq.1can be computed in polynomial time using dynamic-programming with the Viterbi[15]algorithm.It does so by forming a lattice,whose rows are indexed by j and whose columns are indexed by t,where the node at row j and column t is the detection b t j.Finding a track thus reduces to?nding a path through this lattice.

max

k1,...,k T

t=1

h(k t,b t?j t)+

t=2

a(k t?1,k t)(2) The second component recognizes events with hidden Markov models(HMMs),by?nding a MAP estimate of an event model given a track.This is computed as shown in Eq.2,where k t denotes the state for frame t,h(k,b)denotes the log probability of generating a detection b conditioned

track 1

track L

t =1

t =2t =3t =T j 1=1j 1=3

j 1=2j 1=J b 11b 12b 13b 21b 22b 23b 31b 32b 33b T 1b T 2b T 3b 1J 1b 2J 2

b 3J 3

b T J T

×···×

t =1t =2t =3t =T j L =1

j L =3

j L =2j L =J b 11b 12b 13b 21b 22b 23b 31b 32b 33b T 1b T 2b T 3b 1J 1b 2J 2

b 3J 3

b T J T

t =1

t =2

t =3

t =T

k 1=1

k 1=2k 1=K s 1

k 1=3h s 1a s 1,...,...,...K ,...,...,...,...K ,...,...,...,...K ,...,...,...,...K ,...×···×

t =1

t =2

t =3

t =T

k W =1

k W =2k W =K s W k W =3h s W a s W

,...,...,...K

,...,...,...,...K ,...,...,...,...K ,...,...,...,...K ,...word 1word W

Figure 1.The cross-product lattice used by the sentence tracker,consisting of L tracking lattices and W event-model lattices.

on being in state k ,a (k ,k )denotes the log probability of

transitioning from state k to k ,and ?j

t denotes the index of the detection produced by the tracker in frame t .This can also be computed in polynomial time using the Viterbi algo-rithm.Doing so induces a lattice,whose rows are indexed by k and whose columns are indexed by t .

The two components,detection-based tracking and event recognition,can be merged by combining the cost functions from Eq.1and Eq.2to yield a uni?ed cost function

max j 1,...,j T

k 1,...,k T

T t =1

f (b t j t )+T

t =2

g (b t ?1j t ?1,b t

j t )

T t =1

h (k

,b t j t )

T t =2

a (k t ?1,k t )

that computes the joint MAP estimate of the best possible

track and the best possible state sequence.This is done by replacing the ?j t in Eq.2with j t ,allowing the joint maximization over detection and state sequences.This too can be computed in polynomial time with the Viterbi al-gorithm,?nding the optimal path through a cross-product lattice where each node represents a detection paired with an event-model state.This formulation combines a single tracker lattice with a single event model,constraining the detection-based tracker to ?nd a track that is not only tem-porally coherent but also satis?es the event model.This can be used to select that ball track from a video clip that con-tains multiple balls that exhibits the motion characteristics of an intransitive verb such as bounce .

One would expect that encoding the semantics of a com-plex sentence such as The person to the right of the chair quickly carried the red object towards the trash can ,which involves nouns,adjectives,verbs,adverbs,and spatial-relation and motion prepositions,would provide substan-tially more mutual constraint on the collection of tracks for the participants than a single intransitive verb would con-strain a single track.We thus extend the approach described above by incorporating a complex multi-argument predi-cate that represents the semantics of an entire sentence in-stead of one that only represents the semantics of a single

intransitive verb.This involves formulating the semantics of other parts of speech,in addition to intransitive verbs,also as HMMs.We then construct a large cross-product lat-tice,illustrated in Fig.1,to support L tracks and W words.Each node in this cross-product lattice represents L detec-tions and the states for W words.To support L tracks,we subindex each detection index j as j l for track l .Similarly,to support W words,we subindex each state index k as k w for word w ,the number of states K for the lexical entry s w at word w as K s w and the HMM parameters h and a for the lexical entry s w at word w as h s w and a s w .The argument-to-track mapping θi

speci?es the track that ?lls argument i of word w ,where I s w speci?es the arity,the number of ar-guments,of the lexical entry s w at word w .We then seek a path through this cross-product lattice that optimizes

max j 11,...,j T 1

j 1L ,...,j T

L k 11,...,k T 1k 1W ,...,k T W L l =1 T

t =1f (b t j t l )+T

t =2g (b t ?1j t ?1l ,b t

j t l

) +W w =1 T t =1h s w (k t w ,b t j t θ1w ,...,b t j t θI s w

w )+T

t =2

a s w (k t ?1w ,k t

w )

This can also be computed in polynomial time using the Viterbi algorithm.This describes a method by which the function S :(B ,s ,Λ)→(τ,J ),discussed earlier,can be computed,where B is the collection of detections b t j and J

is the collection of detection indices j t

l .

The complexity of the sentence tracker is O (T (J L K W )2)in time and O (J L K W )in space,where T is the number of frames in the video,W is the number of words in the sentence s ,L is

the num-ber of participants,J =max J 1,...,J T ,where J t is the number of detections considered in frame t ,and K =max {K s 1,...,K s W }.In practice,J ≤5,L ≤4,and K =1for all but verbs and motion prepositions of which there are typically no more than three.With such,the method takes less than a second.

3.Natural-Language Semantics

The sentence tracker uniformly represents the semantics of words in all parts of speech,namely nouns,adjectives,verbs,adverbs,and prepositions (both those that describe spatial relations and those that describe motion),as HMMs.Finite-state recognizers (FSMs)are a special case of HMMs where the transition matrices a and the output models h are 0/1,which become ?∞/0in log space.Here,we formu-late the semantics of a small fragment of English consisting of 17lexical items (5nouns,2adjectives,4verbs,2adverbs,2spatial-relation prepositions,and 2motion prepositions),by hand,as FSMs.We do so to focus on what one can do with this approach as discussed in section 4.It is particu-larly enlightening that the FSMs we use are perspicuous and clearly encode pretheoretic human intuitions about word se-

(a)

S→NP VP

NP→D[A]N[PP]

D→an|the

A→blue|red

N→person|backpack|chair|trash can|object PP→P NP

P→to the left of|to the right of

VP→V NP[Adv][PP M]

V→approached|carried|picked up|put down Adv→quickly|slowly

PP M→P M NP

P M→towards|away from

(b)

to the left of:{agent,patient,source,goal,referent},{referent}

to the right of:{agent,patient,source,goal,referent},{referent}

approached:{agent},{goal}

carried:{agent},{patient}

picked up:{agent},{patient}

put down:{agent},{patient}

towards:{agent,patient},{goal}

away from:{agent,patient},{source}

other:{agent,patient,source,goal,referent}

(c)

1a.The backpack approached the trash can.

b.The chair approached the trash can.

2a.The red object approached the trash can.

b.The blue object approached the trash can.

3a.The person to the left of the trash can put down an object.

b.The person to the right of the trash can put down an object.

4a.The person put down the trash can.

b.The person put down the backpack.

5a.The person carried the red object.

b.The person carried the blue object.

6a.The person picked up an object to the left of the trash can.

b.The person picked up an object to the right of the trash can.

7a.The person picked up an object.

b.The person put down an object.

8a.The person picked up an object quickly.

b.The person picked up an object slowly.

9a.The person carried an object towards the trash can.

b.The person carried an object away from the trash can.

10.The backpack approached the chair.

11.The red object approached the chair.

12.The person put down the chair.

Table1.(a)The grammar for our lexicon of17lexical entries(5nouns,2adjectives,4verbs,2adverbs,2spatial-relation prepositions,and 2motion prepositions).Note that the grammar allows for in?nite recursion.(b)Speci?cation of the number of arguments for each word and the roles such arguments refer to.(c)A selection of sentences drawn from the grammar based on which we collected our corpus.

mantics.But nothing turns on the use of hand-coded FSMs. Our framework,as described above,supports HMMs.

Nouns(e.g.person)may be represented by constructing static FSMs over discrete features,such as detector class. Adjectives(e.g.red,tall,and big)may be represented as static FSMs that describe select properties of the detections for a single participant,such as color,shape,or size,inde-pendent of other features of the overall event.Intransitive verbs(e.g.bounce)may be represented as FSMs that de-scribe the changing motion characteristics of a single par-ticipant,such as moving downward followed by moving up-ward.Transitive verbs(e.g.approach)may be represented as FSMs that describe the changing relative motion charac-teristics of two participants,such as moving closer.Adverbs (e.g.slowly and quickly)may be represented by FSMs that describe the velocity of a single participant,independent of the direction of motion.Spatial-relation prepositions (e.g.to the left of)may be represented as static FSMs that describe the relative position of two participants.Motion prepositions(e.g.towards and away from)may be repre-sented as FSMs that describe the changing relative position of two participants.As is often the case,even simple static properties,such as detector class,object color,shape,and size,spatial relations,and direction of motion,might hold only for a portion of an event.We handle such temporal uncertainty by incorporating garbage states into the FSMs that always accept and do not affect the scores computed. This also allows for alignment between multiple words in a temporal interval during a longer aggregate event.We formulate the FSMs for specifying the word meanings as regular expressions over predicates computed from detec-tions.The particular set of regular expressions and associ-ated predicates that are used in the experiments are given in Table2.The predicates are formulated around a number of primitive functions.The function avgFlow(b)computes a vector that represents the average optical?ow inside the de-tection b.The functions x(b),model(b),and hue(b)return

the x-coordinate of the center of b,its object class,and the

average hue of the pixels inside b respectively.The func-

tion fwdProj(b)displaces b by the average optical?ow in-

side b.The functions∠and angleSep determine the angular

component of a given vector and angular distance between

two angular arguments respectively.The function normal

computes a normal unit vector for a given vector.The ar-

gument v to NO J ITTER denotes a speci?ed direction repre-

sented as a2D unit vector in that direction.Regular expres-

sions are formed around predicates as atoms.A given regu-

lar expression must be formed solely from output models of

the same arity and denotes an FSM,i.e.an HMM with a0/1

transition matrix and output model,which become?∞/0 in log space.We use R{n,} =R n···R R?to indicate that R must be repeated at least n times and R[n,] =(R[TRUE]){n,} to indicate that R must be repeated at least n times but can

optionally have a single frame of noise between each repe-

tition.This allows for some?exibility in the models.

A sentence may describe an activity involving multiple

tracks,where different(collections of)tracks?ll the argu-

ments of different words.This gives rise to the require-

ment of compositional semantics:dealing with the map-

pings from arguments to tracks.Argument-to-track assign-

ment is a functionΘ:s→(L,θ)that maps a sentence s

to the number L of participants and the argument-to-track

mappingθi w.The mapping speci?es which tracks?ll which

arguments of which words in the sentence and is mediated

by a grammar and a speci?cation of the argument arity and

role types for the words in the lexicon.Given a sentence,

say The person to the right of the chair picked up the back-

pack,along with the grammar speci?ed in Table1(a)and

the lexicon speci?ed in Tables1(b)and2,it would yield a

mapping corresponding to the following formula.

PERSON(P)∧TO T HE R IGHT O F(P,Q)∧CHAIR(Q)∧PICKED U P(P,R)∧BACKPACK(R)

Constants Simple Predicates Complex Predicates

X B OUNDARY =300PX NEXT T O =50PX

ΔSTATIC =6PX

ΔJUMP =30PX

ΔQUICK =80PX

ΔSLOW =30PX ΔCLOSING =10PX ΔDIRECTION =30?

ΔHUE =30?

NO J ITTER(b,v) = avgFlow(b)·v ≤ΔJUMP

ALIKE(b1,b2) =model(b1)=model(b2)

CLOSE(b1,b2) =|x(b1)?x(b2)|

FAR(b1,b2) =|x(b1)?x(b2)|≥X B OUNDARY

LEFT(b1,b2) =0

RIGHT(b1,b2) =0

HAS C OLOR(b,hue) =angleSep(hue(b),hue)≤ΔHUE

STATIONARY(b) = avgFlow(b) ≤ΔSTATIC

QUICK(b) = avgFlow(b) ≥ΔQUICK

SLOW(b) = avgFlow(b) ≤ΔSLOW

PERSON(b) =model(b)=person

BACKPACK(b) =model(b)=backpack

CHAIR(b) =model(b)=chair

TRASHCAN(b) =model(b)=trashcan

BLUE(b) =HAS C OLOR(b,225?)

RED(b) =HAS C OLOR(b,0?)

STATIONARY C LOSE(b1,b2) =STATIONARY(b1)∧STATIONARY(b2)∧?ALIKE(b1,b2)∧CLOSE(b1,b2)

STATIONARY F AR(b1,b2) =STATIONARY(b1)∧STATIONARY(b2)∧?ALIKE(b1,b2)∧FAR(b1,b2)

CLOSER(b1,b2) =|x(b1)?x(b2)|>|x(fwdProj(b1))?x(b2)|+ΔCLOSING

FARTHER(b1,b2) =|x(b1)?x(b2)|<|x(fwdProj(b1))?x(b2)|+ΔCLOSING

MOVE C LOSER(b1,b2) =NO J ITTER(b1,(0,1))∧NO J ITTER(b2,(0,1))∧CLOSER(b1,b2)

MOVE F ARTHER(b1,b2) =NO J ITTER(b1,(0,1))∧NO J ITTER(b2,(0,1))∧FARTHER(b1,b2)

IN A NGLE(b,v) =angleSep(∠avgFlow(b),∠v)<ΔANGLE

IN D IRECTION(b,v) =NO J ITTER(b,⊥(v))∧?STATIONARY(b)∧IN A NGLE(b,v)

APPROACHING(b1,b2) =?ALIKE(b1,b2)∧STATIONARY(b2)∧MOVE C LOSER(b1,b2)

CARRY(b1,b2,v) =PERSON(b1)∧?ALIKE(b1,b2)∧IN D IRECTION(b1,v)∧IN D IRECTION(b2,v)

CARRYING(b1,b2) =CARRY(b1,b2,(0,1))∨CARRY(b1,b2,(0,?1))

DEPARTING(b1,b2) =?ALIKE(b1,b2)∧STATIONARY(b2)∧MOVE F ARTHER(b1,b2)

PICKING U P(b1,b2) =PERSON(b1)∧?ALIKE(b1,b2)∧STATIONARY(b1)∧IN D IRECTION(b2,(0,1))

PUTTING D OWN(b1,b2) =PERSON(b1)∧?ALIKE(b1,b2)∧STATIONARY(b1)∧IN D IRECTION(b2,(0,?1))

Regular Expressions

λperson =PERSON+

λbackpack =BACKPACK+

λchair =CHAIR+

λtrash can =TRASHCAN+

λobject =(BACKPACK|CHAIR|TRASHCAN)+

λblue =BLUE+

λred =RED+

λquickly =TRUE+QUICK[3,]TRUE+

λslowly =TRUE+SLOW[3,]TRUE+

λto the left of =LEFT+

λto the right of =RIGHT+

λapproached =STATIONARY F AR+APPROACHING[3,]STATIONARY C LOSE+

λcarried =STATIONARY C LOSE+CARRYING[3,]STATIONARY C LOSE+

λpicked up =STATIONARY C LOSE+PICKING U P[3,]STATIONARY C LOSE+

λput down =STATIONARY C LOSE+PUTTING D OWN[3,]STATIONARY C LOSE+

λtowards =STATIONARY F AR+APPROACHING[3,]STATIONARY C LOSE+

λaway from =STATIONARY C LOSE+DEPARTING[3,]STATIONARY F AR+

Table2.The?nite-state recognizers corresponding to the lexicon in Table1(a).

To do so,we?rst construct a parse tree of the sentence s given the grammar,using a recursive-descent parser.For each word,we then determine from the parse tree,which words in the sentence are determined to be its dependents in the sense of government,and how many such dependents exist,from the lexicon speci?ed in Table1(b).For example, the dependents of to the right of are determined to be per-son and chair,?lling its?rst and second arguments respec-tively.Moreover,we determine a consistent assignment of roles,one of agent,patient,source,goal,and referent,for each participant track that?lls the word arguments,from the allowed roles speci?ed for that word and argument in the lexicon.Here,P,Q,and R are participants that play the agent,referent,and patient roles respectively.

4.Experimental Evaluation

The sentence tracker supports three distinct capabilities. It can take sentences as input and focus the attention of a tracker,it can take video as input and produce sentential de-scriptions as output,and it can perform content-based video retrieval given a sentential input query.To evaluate the?rst three,we?lmed a corpus of94short video clips,of varying length,in3different outdoor environments.The camera was moved for each video clip so that the varying back-ground precluded unanticipated confounds.These video clips,?lmed with a variety of actors,each depicted one or more of the21sentences from Table1(c).The depiction, from video clip to video clip,varied in scene layout and the actor(s)performing the event.The corpus was carefully constructed in a number of ways.First,many video clips depict more than one sentence.In particular,many video clips depict simultaneous distinct events.Second,each sen-tence is depicted by multiple video clips.Third the corpus was constructed with minimal pairs:pairs of video clips whose depicted sentences differ in exactly one word.These minimal pairs are indicated as the‘a’and‘b’variants of sentences1–9in Table1(c).That varying word was care-fully chosen to span all parts of speech and all sentential positions:sentence1varies subject noun,sentence2varies subject adjective,sentence3varies subject preposition,sen-tence4varies object noun,sentence5varies object adjec-tive,sentence6varies object preposition,sentence7varies verb,sentence8varies adverb,and sentence9varies motion preposition.We?lmed our own corpus as we are unaware of any existing corpora that exhibit the above properties. We annotated each of the94clips with ground truth judg-ments for each of the21sentences,indicating whether the given clip depicted the given sentence.This set of1974 judgments was used for the following analyses.

4.1.Focus of Attention

Tracking is traditionally performed using cues from mo-tion,object detection,or manual initialization on an object of interest.However,in the case of a cluttered scene involv-ing multiple activities occurring simultaneously,there can be many moving objects,many instances of the same object class,and perhaps even multiple simultaneously occurring instances of the same event class.This presents a signi?cant obstacle to the ef?cacy of existing methods in such scenar-ios.To alleviate this problem,one can decide which objects to track based on which ones participate in a target event.

The sentence tracker can focus its attention on just those objects that participate in an event speci?ed by a sentential description.Such a description can differentiate between different simultaneous events taking place between many moving objects in the scene using descriptions constructed out of a variety of parts of speech:nouns to specify ob-ject class,adjectives to specify object properties,verbs to specify events,adverbs to specify motion properties,and prepositions to specify(changing)spatial relations between objects.Furthermore,such a sentential description can even differentiate which objects to track based on the role that

they play in an event:agent,patient,source,goal,or ref-erent.Fig.2demonstrates this ability:different tracks are produced for the same video clip that depicts multiple si-multaneous events when focused with different sentences.

We further evaluated this ability on all9minimal pairs, collectively applied to all24suitable video clips in our cor-pus.For21of these,both sentences in the minimal pair yielded tracks deemed to be correct depictions.Our web-site1includes example video clips for all9minimal pairs.

4.2.Generation

Much of the prior work on generating sentences to de-scribe images[4,7,8,12,13,18]and video[1,6,9,10,16] uses special-purpose natural-language-generation methods. We can instead use the ability of the sentence tracker to score a sentence paired with a video clip as a general-purpose natural-language generator by searching for the highest-scoring sentence for a given video clip.However, this has a problem.Scores decrease with longer word se-quences and greater numbers of tracks that result from such. This is because both f and g are mapped to log space,i.e. (?∞,0],via sigmoids,to match h and a,which are log probabilities.So we don’t actually search for the highest-scoring sentence,which would bias the process towards short sentences.Instead,we seek complex sentences that are true of the video clip as they are more informative.

Nominally,this search process would be intractable since the space of possible sentences can be huge and even in-?nite.However,we can use beam search to get an ap-proximate answer.This is possible because the sentence tracker can score any word sequence,not just complete phrases or sentences.We can select the top-scoring single-word sequences and then repeatedly extend the top-scoring W-word sequences,by one word,to select the top-scoring W+1-word sequences,subject to the constraint that these W+1-word sequences are grammatical sentences or can be extended to grammatical sentences by insertion of ad-ditional words.We terminate the search process when the contraction threshold,the ratio between the score of a se-quence and the score of the sequence expanding from it, drops below a speci?ed value and the sequence being ex-panded is a complete sentence.This contraction threshold controls complexity of the generated sentence.

When restricted to FSMs,h and a will be0/1,which be-come?∞/0in log space.Thus increase in the number of words can only decrease a score to?∞,meaning that a se-quence of words no-longer describes a video clip.Since we seek sentences that do,we terminate the above beam-search process before the score goes to?∞.In this case,there is no approximation:a beam search maintaining all W-word sequences with?nite score yields the highest-scoring sen-tence before the contraction threshold is met.

1https://www.wendangku.net/doc/1a9104108.html,/?qobi/cccp/cvpr2014.html

To evaluate this approach,we searched the space of sentences generated by the grammar in Table1(a)to?nd the top-scoring sentence for each of the94video clips in our corpus.Note that the grammar generates an in?nite number of sentences due to recursion in NP.Even restrict-ing the grammar to eliminate NP recursion yields a space of147,123,874,800sentences.Despite not restricting the grammar in this fashion,we are able to effectively?nd good descriptions of the video clips.We evaluated the accuracy of the sentence tracker in generating descriptions for our en-tire corpus,for multiple contraction thresholds.Accuracy was computed as the percentage of the94clips for which generated descriptions were deemed to describe the video by human judges.Contraction thresholds of0.95,0.90,and 0.85yielded accuracies of67.02%,71.27%,and64.89% respectively.We demonstrate examples of this approach in Fig.3.Our website1contains additional examples.

4.3.Retrieval

The availability of vast video corpora,such as on YouTube,has created a rapidly growing demand for content-based video search and retrieval.The existing sys-tems,however,only provide a means to search via human-provided captions.The inef?cacy of such an approach is evident.Attempting to search for even simple queries such as pick up or put down yields surprisingly poor results, let alone searching for more complex queries such as per-son approached horse.Furthermore,some prior work on content-based video-retrieval systems,like Sivic and Zis-serman[14],search only for objects and other prior work, like Laptev et al.[11],search only for events.Even com-bining such to support conjunctive queries for video clips with speci?ed collections of objects jointly with a speci?ed event,would not effectively rule out video clips where the speci?ed objects did not play a role in the event or played different roles in the event.For example,it could not rule out a video clip depicting a person jumping next to a sta-tionary ball for a query ball bounce or distinguish between the queries person approached horse and horse approached person.The sentence tracker exhibits the ability to serve as the basis of a much better video search and retrieval tool, one that performs content-based search with complex sen-tential queries to?nd precise semantically relevant clips,as demonstrated in Fig.4.Our website1contains the top three scoring video clips for each query sentence from Table1(c).

To evaluate this approach,we scored every video clip in our corpus against every sentence in Table1(c),rank order-ing the video clips for each sentence,yielding the following statistics over the1974scores.

chance that a random clip depicts a given sentence13.12% top-scoring clip depicts the given sentence94.68%≥1of top3clips depicts the given sentence100.00% Our website1contains all94video clips and all1974scores. The judgment of whether a video clip depicted a given sen-

The person picked up an

object.

The person put down an object.

Figure2.Sentence-guided focus of attention:different sets of tracks for the same video clip produced under guidance of different sentences. Here,and in Figs.3and4,the red box denotes the agent,the blue box denotes the patient,the violet box denotes the source,the turquoise box denotes the goal,and the green box denotes the referent.These roles are determined

automatically.

The backpack to the left of the chair approached the trash

can.

The person to the left of the trash can put down the chair.

Figure3.Generation of sentential description:constructing the best-scoring sentence for each video clip through a beam search. tence was made using our annotation.We conducted an

additional evaluation with this annotation.One can thresh-

old the sentence-tracker score to yield a binary predicate

on video-sentence pairs.We performed4-fold cross vali-

dation on our corpus,selecting the threshold for each fold

that maximized accuracy of this predicate,relative to the

annotation,on75%of the video clips and evaluating the ac-

curacy with this selected threshold on the remaining25%.

This yielded an average accuracy of86.88%.

5.Conclusion

We have presented a novel framework that utilizes the

compositional structure of events and the compositional

structure of language to drive a semantically meaningful

and targeted approach towards activity recognition.This

multi-modal framework integrates low-level visual compo-

nents,such as object detectors,with high-level semantic

information in the form of sentential descriptions in natu-

ral language.This is facilitated by the shared structure of

detection-based tracking,which incorporates the low-level

object-detector components,and of?nite-state recognizers,

which incorporate the semantics of the words in a lexicon.

We demonstrated the utility and expressiveness of our

framework by performing three separate tasks on our cor-

pus,requiring no training or annotation,simply by leverag-

ing our framework in different manners.The?rst,sentence-

guided focus of attention,showcases the ability to focus the

attention of a tracker on the activity described in a sentence,

indicating the capability to identify such subtle distinctions

as between The person picked up the chair to the left of the

trash can and The person picked up the chair to the right

of the trash can.The second,generation of sentential de-

scription of video,showcases the ability to produce a com-

plex description of a video clip,involving multiple parts of

speech,by performing an ef?cient search for the best de-

scription through the space of all possible descriptions.The

?nal task,query-based video search,showcases the ability

to perform content-based video search and retrieval,allow-

ing for such distinctions as between The person approached

the trash can and The trash can approached the person.

Acknowledgments This research was supported,in part,

by ARL,under Cooperative Agreement Number W911NF-

10-2-0060,and the Center for Brains,Minds and Machines,

funded by NSF STC award CCF-1231216.The views

and conclusions contained in this document are those of

The person carried an object away from the trash

can.

The person picked up an object to the left of the trash can.

Figure 4.Sentential-query-based video search:returning the best-scoring video clip,in a corpus of 94video clips,for a given sentence.

the authors and do not represent the of?cial policies,ei-ther express or implied,of ARL or the https://www.wendangku.net/doc/1a9104108.html,ernment.

The https://www.wendangku.net/doc/1a9104108.html,ernment is authorized to reproduce and dis-tribute reprints for Government purposes,notwithstanding

any copyright notation herein.

References [1] A.Barbu, A.Bridge,Z.Burchill, D.Coroian,S.Dick-inson,S.Fidler,A.Michaux,S.Mussman,N.Siddharth,

D.Salvi,L.Schmidt,J.Shangguan,J.M.Siskind,J.Wag-goner,S.Wang,J.Wei,Y .Yin,and Z.Zhang.Video in sen-tences out.In Twenty-Eighth Conference on Uncertainty in

Arti?cial Intelligence ,pages 102–112,Aug.2012.6

[2] A.Barbu,N.Siddharth,A.Michaux,and J.M.Siskind.Si-multaneous object detection,tracking,and event recognition.

Advances in Cognitive Systems ,2:203–220,Dec.2012.1,2

[3]M.Everingham,L.Van Gool,C.K.I.Williams,J.Winn,

and A.Zisserman.The PASCAL Visual Object Classes

(VOC)challenge.International Journal of Computer Vision ,

88(2):303–338,June 2010.1

[4] A.Farhadi,M.Hejrati,M.Sadeghi,P.Young,C.Rashtchian,

J.Hockenmaier,and D.Forsyth.Every picture tells a story:

Generating sentences from images.In European Conference

on Computer Vision ,pages 15–29,Sept.2010.6

[5]P.F.Felzenszwalb,R.B.Girshick,and D.McAllester.Cas-cade object detection with deformable part models.In IEEE

Computer Society Conference on Computer Vision and Pat-tern Recognition ,pages 2241–2248,June 2010.1

[6] C.Fern′a ndez Tena,P.Baiget,X.Roca,and J.Gonz`a lez.Nat-ural language descriptions of human behavior from video

sequences.In J.Hertzberg,M.Beetz,and R.Englert,ed-itors,KI 2007:Advances in Arti?cial Intelligence ,volume

4667of Lecture Notes in Computer Science ,pages 279–292.

Springer Berlin Heidelberg,2007.6

[7] A.Gupta,Y .Verma,and C.Jawahar.Choosing linguistics

over vision to describe images.In Twenty-Sixth National

Conference on Arti?cial Intelligence ,pages 606–612,July

2012.6

[8]L.Jie,B.Caputo,and V .Ferrari.Who’s doing what:Joint

modeling of names and verbs for simultaneous face and pose

annotation.In Neural Information Processing Systems Con-ference ,pages 1168–1176,Dec.2009.6

[9]M.U.G.Khan and Y .Gotoh.Describing video contents

in natural language.In Workshop on Innovative Hybrid Ap-proaches to the Processing of Textual Data ,pages 27–35,Apr.2012.6[10] A.Kojima,T.Tamura,and K.Fukunaga.Natural language description of human activities from video images based on concept hierarchy of actions.International Journal of Com-puter Vision ,50(2):171–184,Nov.2002.6

[11]https://www.wendangku.net/doc/1a9104108.html,ptev,M.Marsza?ek, C.Schmid,and B.Rozenfeld.Learning realistic human actions from movies.In IEEE Computer Society Conference on Computer Vision and Pat-tern Recognition ,pages 1–8,June 2008.6[12]P.Li and J.Ma.What is happening in a still picture?In First Asian Conference on Pattern Recognition ,pages 32–36,Nov.2011.6[13]M.Mitchell,J.Dodge,A.Goyal,K.Yamaguchi,K.Stratos,X.Han,A.Mensch,A.C.Berg,T.L.Berg,and H.Daum′e III.Midge:Generating image descriptions from computer vision detections.In 13th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 747–756,Apr.2012.6[14]J.Sivic and A.Zisserman.Video Google:a text retrieval approach to object matching in videos.In IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition ,pages 1470–1477,Oct.2003.6[15] A.J.Viterbi.Convolutional codes and their performance in communication systems.IEEE Transactions on Communi-cation ,19(5):751–772,Oct.1971.2[16]Z.Wang,G.Guan,Y .Qiu,L.Zhuo,and D.Feng.Semantic context based re?nement for news video annotation.Mul-timedia Tools and Applications ,67(3):607–627,Dec.2013.6[17]J.K.Wolf,A.M.Viterbi,and G.S.Dixon.Finding the best set of K paths through a trellis with application to multitarget tracking.IEEE Transactions on Aerospace and Electronic Systems ,25(2):287–296,Mar.1989.1[18]Y .Yang, C.L.Teo,H.Daum′e III,and Y .Aloimonos.Corpus-guided sentence generation of natural images.In Conference on Empirical Methods in Natural Language Pro-cessing ,pages 444–454,July 2011.6[19]H.Yu and J.M.Siskind.Grounded language learning from video described with sentences.In 51st Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers),pages 53–63,Aug.2013.2