Modelling and Retrieving Video via Moving Objects (notes)

Modelling and Retrieving Video via Moving Objects

(viewing version)

Modelling and Retrieving Video via Moving Objects

(Work in progress)

Mohammad Nabil

John Shepherd

Anne Ngu

School of Computer Science and Engineering
University of New South Wales (UNSW), Sydney

The Goal

Provide mechanisms to answer the following kinds of queries:

Find an image containing a green apple on a rough wooden table
Find a movie where Kate Winslett is lying on a piece of wreckage in the middle of the North Atlantic
Find any scenes where Arnold Schwarzenegger crashes through a door
Find the place in the 1990 World Cup Final where Maradonna dribbles the ball into the penalty area and scores a goal
Find the day when the elephant enters area A from the west and leaves to the north

... The Goal

Important characteristics of these queries:

objects (e.g. Kate, Arnie, soccer ball, penalty area, ...)
relationships between objects (e.g. Arnie and door)
objects move (so relationships change over time)

Specific goal of this work:

Develop a model for describing spatiotemporal relationships among moving objects in video, animation, etc. that is useful as a basis for querying.

Needs to be: comprehensive, concise, efficient

Does not need to be: 100% accurate

We assume: objects have been identified and segmented in each frame

Remainder of Talk

Model for static images (2D-PIR image graphs)
Model for video scenes (TS-2D-PIR scene graphs)
Scene graph operators
Scene graph construction
Scene graph similarity
Prototype implementation
Summary of progress, future work

2D-PIR Model for Static Images

Provides a concise structure for representing information about spatial relationships between objects in a static image.

Assumes we have a static image that has been pre-processed:

objects have been identified
(we use unique identifier, but any set of "distinguishing characterstics" would be ok)
objects have been segmented
(set of pixels that comprise the object is available ... gives object pointset/region)

Approach draws on work by

[Chang et al.] on 2D strings
[Allen] on interval relationships
[Egenhofer] point-set topological relationships

Basic idea: define relationships between objects in terms of

topological relationship (defined by pointsets (boundary,interior))
interval relationships of object projections on x and y axes

Toplogical Relationships (Egenhofer)

Interval Relationships (Allen)

2D-PIR

2D-PIR = 2-Dimensional Projection Interval Relationship

A 2D-PIR R_s(A,B) is a triple (t,xp,yp), where

t is a topological relationship
xp is the interval relationship between x-axis projections of A and B
yp is the interval relationship between y-axis projections of A and B

Notes:

approach extends naturally to nD-PIR
R_s(A,B) ¬= R_s(A,B) (there is a well-defined inverse)

Image Graphs

An image is modelled by a labelled digraph G(V,R) where

V is a set of nodes labelled by object descriptors
(in current system, a descriptor is simply a unique identifier)
R is a set of edges labelled by 2D-PIR relationships
(relationship between the objects in end-point nodes)

We include only one of R_s(A,B) or R_s(A,B) as determined by object ordering.

For all R_s(A,B) in R, A < B.

All retrieval on spatial relationships uses image graphs rather than images.

2D-PIR Example #1

R_s(A,B) = (disjoint,~during,~during)

(or R_s(B,A) = (disjoint,during,during) )

2D-PIR Example #2

R_s(A,B) = (disjoint,before,~overlaps)

R_s(A,C) = (disjoint,before,~meets)

R_s(B,C) = (covers,~finishes,~starts)

2D-PIR Image Retrieval

Two modes:

query language with 2D-PIR operators
sketch-based similarity retrieval (with user-specified threshold)

Example:

Find images containing a hawk perching on a stump

Query language:

  [ P | P <- ImageDB,  P.R_s(Hawk,Stump) sim (touches,~during,~meets) ]

  A sim B  =  Sim(A,B) > Threshold

(uses list-comprehension notation from FP and Obj.Method notation from OOP)

... 2D-PIR Image Retrieval

Sketch-based query:

user draws rough object outlines and identifies objects
system generates image graph for sketch (query graph)
searches database by matching query graph to image graphs

Example query:

More Example Queries

Query:

Find images containing a green apple on a rough wooden table

expressed as:

  [ P | P <- ImageDB,  P.R_s(Apple,Table) = (touches,?,~meets) ]

Query:

Find images with Kate Winslett lying on a piece of wreckage in the middle of the North Atlantic

expressed as:

  [ P | P <- ImageDB,  P.R_s(KateWinslett,Wreckage) = (overlaps,?,?)
		       & P.R_s(Wreckage,NorthAtlantic) = (inside,?,?) ]

Image Distance

Image graph distance is based on distances of component 2D-PIRs.

For two graphs G₁(V₁, R₁) and G₂(V₂, R₂)

where

2D-PIR distance based on distances of component relationships.

For two 2D-PIRs R_s,1 = (t₁,xp₁,yp₁) and R_s,2 = (t₂,xp₂,yp₂)

Spatial Relationship Distances

How to define e.g. D(before,overlaps)?

We base these distance measures on the observation of what happens when we move two objects continuously together.

We treat relationships that occur adjacent in the time sequence as being closer.

... Spatial Relationship Distances

Based on this idea, we define relationship neighbourhood graphs.

If one relationship R₁(A,B) transforms directly into another R₂(A,B) by continuous movement of objects A and B, then R₁ and R₂ are neighbours.

Relationship distance is defined as the number of arcs in the neighbourhood graph between two relations.

Topological relationship neighbourhood graph:

Interval relationship neighbourhood graph:

Image (Graph) Similarity

Based on our definition of relationship distance, max(D(R_s,1, R_s,2)) = 20.8

We define image similarity as normalised, inverse distance:

S=0 means totally different; S=100 means identical

Static Image Retrieval Prototype

Image Retrieval Costs

Quantities involved:

n	number of objects in image
N	distinct objects in DB
K	number of images in DB
K_q	number of images with Common>0 for query q

Size of image graph: n²/2 2D-PIRs + n ObjectDescriptors

Cost in answering queries:

matching objects (lookup on relation of size N)
finding images containing objects (lookup on relation of size K)
computing similarity for candidates (computation of K_qn² 2D-PIR distances)

A Model for Video Retrieval

A video work is a sequence of scenes.

A scene (aka shot) is:

a sequence of still images (frames)
showing continuous action, from a single viewpoint

All access to the database occurs in "units" of scenes.

A video query:

specifies properties of the required scenes
returns a set of scene identifiers

The user is presented with a thumbnail of each matching scene, and then performs further manual filtering (a la QBIC).

TS-2D-PIR Model for Video Scenes

Aim is analogous to that for 2D-PIR ...

Develop a concise structure for representing spatiotemporal information in video scenes to allow efficient retrieval.

Assumes we have a pre-processed video:

partitioned into scenes
objects identified/segmented in each frame
movement of single point in each object tracked

Not (yet) feasible for general video sources, but possible for some limited domains (e.g. video surveillance, satellite tracking, computer-generated animation).

Time

We model time via instants and intervals.

An interval is defined by start and end instants: I = (t_s, t_f)

Operations on intervals:

start(I) = t_s finish(I) = t_f duration(I) = t_f - t_s I₁ + I₂ = (start(I₁), finish(I₂))

All scene-time is measured relative to start of scene.

If duration of scene is T, then scene interval I_S = (0,T)

Database may store absolute start time elsewhere.

Movement of Single Objects

Object movement is approximated by a sequence of linear movements.

Each movement represented as an iso-directional-interval (d,r,I)

I is a time interval (during which object moves in constant direction)
d is distance moved during I (a real number)
r is the direction moved (one of N, NE, E, SE, S, SW, W, NW)

A track is: [ (d₁,r₁,I₁), (d₂,r₂,I₂), ... (d_n,r_n,I_n) ]

where finish(I₁) = start(I₂), finish(I₂) = start(I₃), ... (i.e. intervals are contiguous)

... Movement of Single Objects

Notes:

assume that the object is "reasonably" rigid
(i.e. it consists of a single connected pointset throughout the scene)
this model only deals with translational motion
simple motion short track; complex motion long track
minimum track length = 1; maximum track length = #frames

Track Example

Track: [ (d₁, NW, I₁), (d₂, N, I₂), (d₃, E, I₃), (d₄, NE, I₄) ]

where I₁ = (t₀,t₁), I₂ = (t₁,t₂), I₃ = (t₂,t₃), I₄ = (t₃,t₄)

TS-2D-PIR

Tracks model individual objects; we also need to model changing spatial relationships between objects (i.e. spatiotemporal relationships)

Since spatial relationships typically persist over intervals, we again use an interval-based data structure.

TS-2D-PIR = Temporal Sequence of 2D-PIR

A TS-2D-PIR R_st(A,B) is a sequence of (t,xp,yp,I), where

R_s(A,B) = (t,xp,yp)
I is the time interval over which R_s(A,B) holds

A TS-2D-PIR is also called an st-relationship.

TS-2D-PIR Example

R_st(B,G) =
	[ (disjoint, ~starts, after, (t0,t1)),
	  (disjoint, ~finishes, after, (t1,t2)),
	  (disjoint, before, after, (t2,t3)) ]

R_st(B,R) =
	[ (disjoint, ~meets, after, (t0,t1)),
	  (disjoint, ~overlaps, after, (t1,t2)),
	  (touches, equals, ~meets, (t2,t3)) ]

R_st(G,R) =
	[ (touches, ~meets, meets, (t0,t1)),
	  (disjoint, ~meets , before, (t1,t2)),
	  (disjoint, after, before, (t2,t3)) ]

Scene Graphs

Our model for video scenes is an extension of image graphs to incorporate temoporal notions.

A scene is modelled by a labelled digraph S(M,R), where

M is a set of nodes labelled by (object,track) pairs
R is a set of edges labelled by TS-2D-PIR relationships
(spatiotemporal relationship between objects in end-point nodes)

Scene Graph Example

M = {
    (B, [(1.4,SW,(t0,t1)), (1,S,(t1,t2)), (0,?,(t2,t3))]),
    (G, [(1.4,SE,(t0,t3))]),
    (R, [(1,E,(t0,t3))])
}
R = {
    ((B,G),
	[ (disjoint, ~starts, after, (t0,t1)),
	  (disjoint, ~finishes, after, (t1,t2)),
	  (disjoint, before, after, (t2,t3)) ]
    ),
    ((B,R),
	[ (disjoint, ~meets, after, (t0,t1)),
	  (disjoint, ~overlaps, after, (t1,t2)),
	  (touches, equals, ~meets, (t2,t3)) ]
    ),
    ((G,R),
	[ (touches, ~meets, meets, (t0,t1)),
	  (disjoint, ~meets , before, (t1,t2)),
	  (disjoint, after, before, (t2,t3)) ]
    )
}

Scene Graph Construction

Algorithm: Scene graph construction
Input:     A pre-processed video scene (objects are identified,
                  object positions, pointsets and projections
		  are available for each frame)
Output:    A scene graph S(M,R) representing the scene 

begin
 (1)    M = { };   R = { };
 (2)    for each frame F do
 (3)       t = time of frame F relative to start of scene
 (4)       for each object X in F do
 (5)          if X is not already in M then
 (6)             include new object descriptor (X,[]) in M
 (7)          else
 (8)             compute (d,r,I) from current and previous positions
 (9)             append (d,r,I) to track for X
 (10)         endif
 (11)         save current position as previous
 (12)         for each object Y greater than X in F do
 (13)            compute current R_st(X,Y)
 (14)            if there is no edge for (X,Y) in R then
 (15)               include new edge (X,Y,[]) in R
 (16)               save current R_st(X,Y) as existing R_st(X,Y)
 (17)               set time for startOfR_st(X,Y) to t
 (18)            else
 (19)               if F is the last frame or
                          current R_st(X,Y) is different to existing R_st(X,Y) then
 (20)                  compute (t,xp,yp,I) based on existing R_st(X,Y)
                          and interval from startOfR_st(X,Y) until t
 (21)                  append (t,xp,yp,I) to TS-2D-PIR for edge (X,Y)
 (22)                  save current R_st(X,Y) as existing R_st(X,Y)
 (23)                  set time for startOfR_st(X,Y) to t
 (24)               endif
 (25)            endif
 (26)         endfor
 (27)      endfor
 (28)   endfor
end.

Scene Retrieval

As for image retrieval, there are two modes:

query language with spatial/temporal/data operators
sketch-based similarity retrieval (user-specified threshold)

Wider range of operators than for static image retrieval:

extract information from within scenes (tracks, st-rels, ...)
select scenes based on exact/inexact criteria (predicates)
construct new scenes (filter/combine existing scenes)

Use OO notation for operators: Obj.Operator(Args)

For relevant operations, if Obj is a set (list) of objects, then Operator is applied to each element in the set.

Trajectory Extraction

If S is a scene graph containing an object descriptor:

(A, [ (d₁,r₁,I₁), (d₂,r₂,I₂), ... (d_n,r_n,I_n) ])

then S.trajectory(A) is the sequence:

[ (d₁,r₁), (d₂,r₂), ... (d_n,r_n) ]

2D-PIR Extraction

If S is a scene graph containing:

R_st(A,B) = [ (t₁,xp₁,yp₁,I₁), (t₂,xp₂,yp₂,I₂),... (t_n,xp_n,yp_n,I_n) ]

then S.st-seq(A,B) is the sequence:

[ (t₁,xp₁,yp₁), (t₂,xp₂,yp₂),... (t_n,xp_n,yp_n) ]

Temporal Extraction

If S is a scene graph and

ST = [ (t₁,xp₁,yp₁), (t₂,xp₂,yp₂),... (t_n,xp_n,yp_n) ]

is a 2D-PIR interval, then S.interval(A,B,ST) is an intveral

I = I₁+I₂+...+I_n

such that there is a subsequence

[ (t₁,xp₁,yp₁,I₁), (t₂,xp₂,yp₂,I₂),... (t_n,xp_n,yp_n,I_n) ]

in R_st(A,B).

If no such sequence exists, the result is the special value Never.

Selection and Predicates

Selection uses list-comprehension notation:

  [ S | S <- DB, Predicate₁, Predicate₂, ... ]

Assume that standard relational tests are available.

The notation Objs(S) denotes the set of objects in a scene.

The predicate has is defined as: S.has(X) = X Objs(S)

Spatial Matching

Let ST be a 2D-PIR sequence of the form

[ (t₁,xp₁,yp₁), (t₂,xp₂,yp₂),... (t_n,xp_n,yp_n) ]

Let P be a 2D-PIR pattern:

a 2D-PIR sequence is a pattern, matching exactly
the symbol ? matches any single component of (t_i,xp_i,yp_i)
the symbol .. matches any 2D-PIR sequence (shortest if multiple matches)

Let L^* be the set of all subsequences of ST.

The spatial matching predicate is defined as:

Scene Construction Operators

Spatiotemporal projection: S.project(X₁,X₂,...,I)

Yields a new version of S containing only the specified objects and only information for the specified time interval.

Special cases:

S.project(Objs(S),I) gives a clip from a scene
S.project(objs...,I_S) eliminates some objects

Overlay: S₁.overlay(S₂, t_s)

Combines scene graphs, with all S₂ motion starting at time t_s after the start of S₁.

Example: Animal Tracking

Assume we have a fixed camera watching a region of country and animals tagged with GPS/radio-transmitters.

Monitor the region for several years, treating each day as a scene.

Animal movement is easy to track automatically.

Each animal is identified uniquely by its transmitter.

Assume image is oriented so that top=N, left=E, bottom=S, right=W.

This kind of data might be useful to biologists to monitor migration, breeding, ...

Example Animal Query #1

Query:

Show the migration path of a specific animal named Elephant.

Strategy:

find the scenes showing this particular animal
extract the trajectory from each scene and display

Expressed as:

  [ S | S <- AnimalDB, S.has(Elephant) ] . trajectory(Elephant)

Example Animal Query #2

Query:

How long is animal A inside region R?

Strategy:

find the scenes where A is inside R
extract the interval for which this is true
compute the length of the interval

Expressed as:

  AinsideR = [ S | S <- AnimalDB, S.interval(A,R,[(~contains,?,?)]) != Never ]

  AinsideR.interval(A,R,[(~contains,?,?)]).duration

Example Animal Query #3

Query:

Find scenes where animal A enters region R from the west side and leaves from the north side.

Strategy:

define the notion of entering a region from west and leaving from north
extract the 2D-PIR sequence for A and R
match it against the defined movement

Expressed as:

  EnterWest = [ (disjoint,before,?), (touches,meets,during),
                (overlaps,overlaps,during), (~contains,during,during) ]

  ExitNorth = [ (~contains,during,during), (overlaps,during,~overlaps),
                (touches,during,~meets), (disjoint,during,after) ]

  [ S | S <- AnimalDB, S.st-seq(A,R).matches(EnterWest + ExitNorth) ]

Would be simpler to do as sketch query.

Example Animal Query #4

Query:

Find scenes where some animal passes through region R.

Strategy:

define operation for "passes through"
select scenes that contain at least one animal doing this

Expressed as:

  PassThrough = [ (disjoint,?,?) .. (~contains,?,?) .. (disjoint,?,?) ]

  [ S | S <- AnimalDB,
        exists [ A | A <- Objs(S), S.st-seq(A,R).matches(PassThrough) ]
  ]

Scene Similarity

Scene similarity is based on combination of distances between scene components (tracks, TS-2D-PIRs).

For two scene graphs S₁(M₁,R₁) and S₂(M₂,R₂),

where

w_t and w_s reflect weighting to be given to trajectories and 2D-PIR sequences respectively.

Track Similarity

Requires measuring distance between two trajectories for possible "alignments" and choosing minimum.

Trajectory distance was tested by:

creating 80 widely-varied trajectories
choosing 10 of these as "queries"
asking 10 users to give similarity for 79 against each query
measuring distances using distance function
comparing measured distance rank with user-assigned distance rank

Found: reasonable precision/recall performance for threshold 70%

Video Scene Retrieval Prototype

Conclusions

Have developed:

model for spatiotemporal relationships in video data
similarity/exact retrieval process based on this model
prototype implementation for animated video sequences

Still to do:

devise efficient access methods
combine with other image/scene-matching techniques
...

References/Further Reading

J.F.Allen,
Maintaining Knowledge about Temporal Intervals,
Communications of the ACM, Nov 1983
[original description of interval relationships; in this case the context was temporal, rather than spatial, intervals]
S.K.Chang, Q.Y.Shi, C.W.Yan,
Iconic Indexing by 2D Strings,
IEEE Transactions on Pattern Recognition and Machine Intelligence, 1987
[description of method to represent static images by using relationships between object projections on axes]
M.J.Egenhofer
Point-set Topological Spatial Relations,
International Journal on Geographic Information Systems, 1991
[original description of point-set topological relations between image regions, which forms the basis for the topolgical component of 2D-PIRs]
M.Nabil, A.Ngu, J.Shepherd
Picture Similarity Retrieval using the 2D Projection Interval Representation,
IEEE Trancations on Knowledge and Data Engineering, 1996
[detailed description of 2D-PIR-based moving object model from this talk]
M.Nabil, A.Ngu, J.Shepherd
Modelling and Retrieving Video via Moving Objects,
Journal of Multimedia Technology and Applications (to appear)
[detailed description of 2D-PIR-based moving object model from this talk]

Produced: 22 May 98