The dataset iVQA is provided in the pandas dataframe 'ivqa.csv'.

It contains 10,000 samples (one per line) made with:
a question in column 'question' (e.g. 'What is the last ingredient the woman is showing right before pouring the ingredients in the blender?') 
5 corresponding ground truth answers in columns 'answer1', 'answer2', 'answer3', 'answer4', 'answer5' (e.g. 'lemon')
5 corresponding confidence scores in columns 'conf1', 'conf2', 'conf3', 'conf4', 'conf5' (0 for 'not confident', 1 for 'maybe', 2 for 'confident')
a YouTube video ID in column 'video_id' (e.g. 'vFJDrCB_KdY')
a start time, in seconds, in column 'start' (e.g. 21)
an end time, in seconds, in column 'end' (e.g. 46)

Data splits are provided in pandas dataframes 'train.csv', 'val.csv' and 'test.csv' with similar columns. Note that there is only one question type (0) in iVQA corresponding to objects, places and people. Additionally, durations (in seconds) from the original videos used to extract the clips are provided in original_durations.pkl.

If you want to obtain raw video clips (about 7Gb), please fill the following form: https://docs.google.com/forms/d/e/1FAIpQLSecsMA0A0jduqEWt9EXY5wa6j-TT1GbWmBM-elQBbPhXBJkkA/viewform?usp=sf_link