Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly. In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video. For optimization, we propose a new reference-aware multiple instance learning (RA-MIL) objective for weak supervision of grounding in videos. We evaluate our approach over unconstrained videos from YouCookII and RoboWatch, augmented with new reference-grounding test set annotations. We demonstrate that our jointly optimized, reference-aware approach simultaneously improves visual grounding, reference-resolution, and generalization to unseen instructional video categories.
CVPR 2018 (Oral)
[Link to Paper] [Link to Supplementary] [Link to Poster] [Link to Oral Talk]

De-An Huang*
Shyamal Buch*
Lucio Dery
Animesh Garg

(* denotes equal contribution lead author )

Download Annotations

The reference + grounding annotations are available [here]. (README) Annotations are licensed under the Creative Commons CC-By-4.0 License. Please email the lead authors if there are any questions regarding the provided annotations.


  title={Finding ``It'': Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos},
  author={De-An Huang* and Shyamal Buch* and Lucio Dery and Animesh Garg and Li Fei-Fei and Juan Carlos Niebles},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},


This research was sponsored in part by grants from Toyota Research Institute (TRI) and the Office of Naval Research (N00014- 15-1-2813).