Abstract
Grounding textual phrases in visual content with standalone
image-sentence pairs is a challenging task. When
we consider grounding in instructional videos, this problem
becomes profoundly more complex: the latent temporal
structure of instructional videos breaks independence assumptions
and necessitates contextual understanding for
resolving ambiguous visual-linguistic cues. Furthermore,
dense annotations and video data scale mean supervised
approaches are prohibitively costly. In this work, we propose
to tackle this new task with a weakly-supervised framework
for reference-aware visual grounding in instructional
videos, where only the temporal alignment between the transcription
and the video segment are available for supervision.
We introduce the visually grounded action graph, a
structured representation capturing the latent dependency
between grounding and references in video. For optimization,
we propose a new reference-aware multiple instance
learning (RA-MIL) objective for weak supervision of grounding
in videos. We evaluate our approach over unconstrained
videos from YouCookII and RoboWatch, augmented with new
reference-grounding test set annotations. We demonstrate
that our jointly optimized, reference-aware approach simultaneously
improves visual grounding, reference-resolution,
and generalization to unseen instructional video categories.
CVPR 2018 (Oral)