Reading Group

1. Produce candidate locations by

frame-wise sliding windows or

frame-wise object proposals

2. Compute matching score of candidates and query

1. No need for location information

not necessary

not efficient

2. Exhaustive search is not practical

a 30 Sec video usually contains more than $30\times 24= 720$ frames and

10k+ object proposals

Normally high redundancy of objects can be found in one video

$\rightarrow$ select representative objects for videos and match with query object.

High recall can be crucial in some cases

e.g. Scenario in cover page

proposed

method

1. Train a compact model for each video to reconstruct all of its object proposals

2. Trying to reconstruct the query to answer whether it has ever seen

Only cares about whether the query object is contained in a video or not

The likely spatio-temporally overlapping enlarge the chance of reasonable recall

Offline training, online search

We denote by $r_{\theta_i}(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^d$ the reconstruction model learned by the set of object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ from the video $V_i$, where $\theta_i$ is the parameters of the reconstruction model which are learned by reconstructing all object proposals in the training phase:

$\theta_i= \underset{\theta}{\arg\min}\sum_{x\in\mathcal{S}_i}{||x-r_\theta(x)||_2^2}$

In the search phase, for each video $V_i$, we calculate the query’s reconstruction error $||r_{\theta_i}(q)-q||$ using the reconstruction model learned from the video. We use the reconstruction error as a similarity measurement to determine the relevance between the query $q$ and the whole video $V_i$:

$dist({q}, V_i) = ||q − r_{\theta_i}(q)||2$

The smaller the reconstruction error $||q − r_{\theta_i}(q)||2$ is, the more relevant the query is to the video $V_i$.

After training the reconstruction model for the video $V_i$, we no longer rely on the object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ and only need to store the parameters $\theta_i$, which is more compact than $\mathcal{S}_i$.

Rather than comparing the query $q$ with all the object proposals in the set $\mathcal{S}_i$, the reconstruction model only need to compute $||q - r_{\theta_i}(q)||$ to obtain the relevance between $q$ and $\mathcal{S}_i$.

$r_{\theta_i}(x)=\theta_i^\top \theta_i x, s.t. \theta_i\theta_i^\top=\mathit{I}$

$r_{\theta_i}(x)=f_2(\mathit{W}_i^2f_1(\mathit{W}_i^1x+b_i^1)+b_i^2)$

$r_{\theta_i}(x)=\theta_i\mathit{h}_{\theta_i}(x)$