*------------------------------------------------------------------------------------------*
| Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos |
|                                     Annotations README                                   |
*------------------------------------------------------------------------------------------*

Below we include the subset of annotations necessary to directly compare against all the 
plots/tables in Section 4 of our main paper.

Datasets:
  (1) YCII = YouCookII (Dataset link: http://youcook2.eecs.umich.edu)
  (2) RW = RoboWatch (Dataset link: http://robo.watch)

The directory sub-structure is:
  (1) RR - reference resolution annotations as a parsed ".en.vtt" files
  (2) VG - visual grounding (bbox) annotations as a JSON file

For the grounding annotations, the key-hierarchy is:
  (1) video_id (from youtube/youcook/robowatch)
  (2) entity_id (3 digit id: [1] act id, [2] entity num, [3] sub-entity num), corresponds
    to the entity id in the ".en.vtt" reference resolution annotations file for that video.

Evaluation list of sets and subsets:
  "subsets.json" contains the subsets for YCII/RW in Section 4 of the paper, including
  YC-S/M/H/All and RW-Cook/Misc/All, for direct comparison with the paper results. YC-S/M/H 
  are mutually exclusive splits of YC-All according to graph complexity, and RW-Cook/Misc have
  a few overlapping videos. Both YC/RW-All sets are represented by the union of the YC/RW subsets.
  Note that YC sets provide recipe IDs, while RW sets provide video IDs.

Other info for grounding annotations:
  The frame id numbers were extracted from the video at 5FPS, and the pixel coordinates are
  from the top left corner in absolute pixel values.

  For annotation, youtube videos were downloaded using “youtube-dl” with the parameters:
    ydl_opts = {
                'ignoreerrors': True,
                'writesubtitles':False, 
                'format': '18',
                'noplaylist' : True,
                'outtmpl': os.path.join(our_recipe_dir, '%(id)s.%(ext)s'),
               }
  Format code "18" means that our videos were downloaded with a consistent height of 360px 
  (640x360 or 480x360, depending on 16:9 or 4:3 aspect ratio in the source video). The 
  bounding box annotations are values in accordance with this spatial resolution.

  Also for reference, our ffmpeg command (to extract frames from the video) was:
  command = 'ffmpeg -i ' + vid_fname +' -y -an -qscale 0 -vf fps=5 ' + vid_out_imgs_name

  Finally, note that in the 0.1 release we provide the annotations for the ability to evaluate 
  grounding+RR in YCII, and grounding in RW - this is everything that is necessary to compare
  against our paper benchmark. If reference annotations for RW are needed in your work, please
  email the lead authors (see below).

***********************
* Q&A (ongoing list): *
***********************
Q: For visual grounding evaluation, do we provide the groundtruth reference resolution
annotations to the model at test time (to compare against your work)?
  A: *No.* To evaluate visual-grounding according to our set-up you do not need 
     ground truth reference annotations provided at test time. Our reference-aware method 
     infers them, but they are not a part of the visual grounding evaluation metric itself
     (this allows consistent evaluation even with non-reference aware methods). 

Q: Where are the reference resolution annotations for RoboWatch?
  A: We’ll be adding them in the 0.2 version; please email the lead authors if they are
     necessary for your research direction.

Additional questions? Email the lead authors at {dahuang, shyamal}@cs.stanford.edu 
(we’ll likely update/clarify additional stuff here, so it’ll help others too)


** Update Log (README last updated: June 2019) **
 - [CVPR 2018] v0.1 released with YCII/RW annotations
 - [Nov 2018] README formatting update
 - [June 2019] v0.1.1 minor update to include subsets.json directly in main annotations zip