# Backgrounds

## Conventional approaches

##### Steps

1. Produce candidate locations by
frame-wise sliding windows or
frame-wise object proposals

2. Compute matching score of candidates and query

##### Problems

1. No need for location information
not necessary
not efficient

2. Exhaustive search is not practical
a 30 Sec video usually contains more than $30\times 24= 720$ frames and
10k+ object proposals

## Recent studies

Normally high redundancy of objects can be found in one video

$\rightarrow$ select representative objects for videos and match with query object.

##### Problems

High recall can be crucial in some cases
e.g. Scenario in cover page

# The proposed method

##### Steps

1. Train a compact model for each video to reconstruct all of its object proposals

2. Trying to reconstruct the query to answer whether it has ever seen

##### Properties

Only cares about whether the query object is contained in a video or not

The likely spatio-temporally overlapping enlarge the chance of reasonable recall

Offline training, online search

## Problem Formulation

##### Training Stage

We denote by $r_{\theta_i}(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^d$ the reconstruction model learned by the set of object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ from the video $V_i$, where $\theta_i$ is the parameters of the reconstruction model which are learned by reconstructing all object proposals in the training phase:

$\theta_i= \underset{\theta}{\arg\min}\sum_{x\in\mathcal{S}_i}{||x-r_\theta(x)||_2^2}$

## Problem Formulation

##### Search Stage

In the search phase, for each video $V_i$, we calculate the query’s reconstruction error $||r_{\theta_i}(q)-q||$ using the reconstruction model learned from the video. We use the reconstruction error as a similarity measurement to determine the relevance between the query $q$ and the whole video $V_i$:

$dist({q}, V_i) = ||q − r_{\theta_i}(q)||2$

The smaller the reconstruction error $||q − r_{\theta_i}(q)||2$ is, the more relevant the query is to the video $V_i$.

## Problem Formulation

##### Objectives

After training the reconstruction model for the video $V_i$, we no longer rely on the object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ and only need to store the parameters $\theta_i$, which is more compact than $\mathcal{S}_i$.

Rather than comparing the query $q$ with all the object proposals in the set $\mathcal{S}_i$, the reconstruction model only need to compute $||q - r_{\theta_i}(q)||$ to obtain the relevance between $q$ and $\mathcal{S}_i$.

## Proposed implementations of reconstruction model

##### Subspace Projection

$r_{\theta_i}(x)=\theta_i^\top \theta_i x, s.t. \theta_i\theta_i^\top=\mathit{I}$

##### Auto-encoder

$r_{\theta_i}(x)=f_2(\mathit{W}_i^2f_1(\mathit{W}_i^1x+b_i^1)+b_i^2)$

##### Sparse Dictionary Learnings

$r_{\theta_i}(x)=\theta_i\mathit{h}_{\theta_i}(x)$