The dataset WebVidVQA3M is provided in the pickle file 'webvidvqa.pkl'. webvidvqa.pkl is a dictionary mapping each of 2,404,871 Shutterstock video IDs (e.g. '14838343') to a dictionary with 3 keys: 'text': alt-text description in WebVid used to generate the questions and answers 'question': list of questions (e.g. 'What does an elephant drink?') 'answer': list of answers (e.g. 'milk') The (video clip, question, answer) triplet k (e.g. 1) of a video is given by the video, the k-th question and the k-th answer. The train and val splits are provided in the pandas dataframes 'train_webvidvqa.csv' and 'val_webvidvqa.csv'. Both files contain 2 columns: 'video_id': YouTube video ID 'video_path': relative path to the feature file (inside the SSD_DIR/webvid_s3d_features folder).