The dataset WebVidVQA3M is provided in the pickle file 'webvidvqa.pkl'.

webvidvqa.pkl is a dictionary mapping each of 2,404,871 Shutterstock video IDs (e.g. '14838343') to a dictionary with 3 keys: 
'text': alt-text description in WebVid used to generate the questions and answers
'question': list of questions (e.g. 'What does an elephant drink?')
'answer': list of answers (e.g. 'milk')
The (video clip, question, answer) triplet k (e.g. 1) of a video is given by the video, the k-th question and the k-th answer.

The train and val splits are provided in the pandas dataframes 'train_webvidvqa.csv' and 'val_webvidvqa.csv'. Both files contain 2 columns:
'video_id': YouTube video ID 
'video_path': relative path to the feature file (inside the SSD_DIR/webvid_s3d_features folder).