- 아래 3개 데이터셋을 전부 이용시, 23, 31 논문의 데이터 split 전략을 따랐다.
QVHlights
{
"qid": 8737,
"query": "A family is playing basketball together on a green court outside.",
"duration": 126,
"vid": "bP5KfdFJzC4_660.0_810.0",
"relevant_windows": [[0, 16]],
"relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7],
"saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}
qid: 8737
query: "A family is playing basketball together on a green court outside."
- qid is a unique identifier of a query.
- This query corresponds to a video identified by its video id
vid
.
vid: "bP5KfdFJzC4_660.0_810.0"
duration: 126
- duration is an integer indicating the duration of this video.
relevant_windows: [[0, 16]]
- relevant_windows is the list of windows that localize the moments,
- each window has two numbers, one indicates the start time of the moment, another one indicates the end time.
relevant_clip_ids: [0, 1, 2, 3, 4, 5, 6, 7]
- is the list of
ids
to the segmented 2-second clips
that fall into the moments specified by relevant_windows, starting from 0.
saliency_scores: [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
- saliency_scores contains the saliency scores annotations, each sublist corresponds to a clip in relevant_clip_ids.
- e.g. 0: [4, 1, 1], 1: [4, 1, 1], ..., 7: [4, 3, 2]
- There are 3 elements in each sublist, they are the scores from three different annotators.
- A score of 4 means Very Good, while 0 means Very Bad.
- for
weakly supervised ASR
- In addition to the annotation files, we also provided the
subtitle file
for our weakly supervised ASR pre-training: subs_train.jsonl(https://github.com/jayleicn/moment_detr/blob/main/data/subs_train.jsonl).
- This file is formatted similarly as our annotation files, but
without the saliency_scores entry
.
- This file is not needed if you do not plan to pretrain models using it.
Charades-STA
TVSum