Is my object in this video?

Reconstruction-based object search in videos / Tan Yu; Jingjing Meng; Junsong Yuan

Chaoran Huang
Reading Group

Fundamental premise

No concern about "when" and "where" the query object appears


Conventional approaches


1. Produce candidate locations by
     frame-wise sliding windows or
     frame-wise object proposals

2. Compute matching score of candidates and query


1. No need for location information
     not necessary
     not efficient

2. Exhaustive search is not practical
     a 30 Sec video usually contains more than $30\times 24= 720$ frames and
     10k+ object proposals

Recent studies

Normally high redundancy of objects can be found in one video

$\rightarrow$ select representative objects for videos and match with query object.


High recall can be crucial in some cases
     e.g. Scenario in cover page



1. Train a compact model for each video to reconstruct all of its object proposals

2. Trying to reconstruct the query to answer whether it has ever seen


Only cares about whether the query object is contained in a video or not

The likely spatio-temporally overlapping enlarge the chance of reasonable recall

Offline training, online search

Problem Formulation

Training Stage

We denote by $r_{\theta_i}(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^d$ the reconstruction model learned by the set of object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ from the video $V_i$, where $\theta_i$ is the parameters of the reconstruction model which are learned by reconstructing all object proposals in the training phase:

$\theta_i= \underset{\theta}{\arg\min}\sum_{x\in\mathcal{S}_i}{||x-r_\theta(x)||_2^2}$

Problem Formulation

Search Stage

In the search phase, for each video $V_i$, we calculate the query’s reconstruction error $||r_{\theta_i}(q)-q||$ using the reconstruction model learned from the video. We use the reconstruction error as a similarity measurement to determine the relevance between the query $q$ and the whole video $V_i$:

$dist({q}, V_i) = ||q − r_{\theta_i}(q)||2$

The smaller the reconstruction error $||q − r_{\theta_i}(q)||2$ is, the more relevant the query is to the video $V_i$.

Problem Formulation


After training the reconstruction model for the video $V_i$, we no longer rely on the object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ and only need to store the parameters $\theta_i$, which is more compact than $\mathcal{S}_i$.

Rather than comparing the query $q$ with all the object proposals in the set $\mathcal{S}_i$, the reconstruction model only need to compute $||q - r_{\theta_i}(q)||$ to obtain the relevance between $q$ and $\mathcal{S}_i$.

Proposed implementations of reconstruction model

Subspace Projection

$r_{\theta_i}(x)=\theta_i^\top \theta_i x, s.t. \theta_i\theta_i^\top=\mathit{I}$



Sparse Dictionary Learnings