QD-DETR - dataset

FSA·2024년 11월 22일

action recognition in videos

목록 보기

18/24

아래 3개 데이터셋을 전부 이용시, 23, 31 논문의 데이터 split 전략을 따랐다.

QVHlights

MR, HD 다 라벨링된 데이터셋
10000 videos annotated with human-written text queries.
dataset (from moment_detr repo): https://github.com/jayleicn/moment_detr/tree/main/data
다운로드: https://nlp.cs.unc.edu/data/jielei/qvh/qvhilights_videos.tar.gz
annotation file은
- 각 파일이 JSON Line(https://velog.io/@hsbc/JSON-Lines) format임.
- 파일의 각 row는 Python에서 하나의 dict로 불러와질 수 있다.
- 아래는 예시

{
    "qid": 8737, 
    "query": "A family is playing basketball together on a green court outside.", 
    "duration": 126, 
    "vid": "bP5KfdFJzC4_660.0_810.0", 
    "relevant_windows": [[0, 16]],
    "relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7], 
    "saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}

qid: 8737

query: "A family is playing basketball together on a green court outside."

qid is a unique identifier of a query.
This query corresponds to a video identified by its video id vid.

vid: "bP5KfdFJzC4_660.0_810.0"

The vid is formatted as {youtube_id}_{start_time}_{end_time}.
Use this information, one can retrieve the YouTube video from a url https://www.youtube.com/embed/{youtube_id}?start={start_time}&end={end_time}&version=3.
For example, the video in this example is https://www.youtube.com/embed/bP5KfdFJzC4?start=660&end=810&version=3.

duration: 126

duration is an integer indicating the duration of this video.

relevant_windows: [[0, 16]]

relevant_windows is the list of windows that localize the moments,
- each window has two numbers, one indicates the start time of the moment, another one indicates the end time.

relevant_clip_ids: [0, 1, 2, 3, 4, 5, 6, 7]

is the list of ids to the segmented 2-second clips that fall into the moments specified by relevant_windows, starting from 0.

saliency_scores: [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]

saliency_scores contains the saliency scores annotations, each sublist corresponds to a clip in relevant_clip_ids.
e.g. 0: [4, 1, 1], 1: [4, 1, 1], ..., 7: [4, 3, 2]
There are 3 elements in each sublist, they are the scores from three different annotators.
A score of 4 means Very Good, while 0 means Very Bad.

test set
test set에선,
- relevant_clip_ids, relevant_windows, saliency_scores 없음.
Please refer to ../standalone_eval/README.md(https://github.com/jayleicn/moment_detr/blob/main/standalone_eval/README.md) for details on evaluating predictions on test.

for weakly supervised ASR
- In addition to the annotation files, we also provided the subtitle file for our weakly supervised ASR pre-training: subs_train.jsonl(https://github.com/jayleicn/moment_detr/blob/main/data/subs_train.jsonl).
- This file is formatted similarly as our annotation files, but without the saliency_scores entry.
- This file is not needed if you do not plan to pretrain models using it.

Charades-STA

for MR.

TVSum

for video summarization.

FSA

모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것

QD-DETR - dataset

action recognition in videos

QVHlights

qid: 8737

query: "A family is playing basketball together on a green court outside."

vid: "bP5KfdFJzC4_660.0_810.0"

duration: 126

relevant_windows: [[0, 16]]

relevant_clip_ids: [0, 1, 2, 3, 4, 5, 6, 7]

saliency_scores: [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]

Charades-STA

TVSum

QD-DETR 코드 돌리기

[논문리뷰] QD-DETR

0개의 댓글