[Video foundation model] 논문 공부 리스트

FSA·2024년 11월 20일

0

action recognition in videos

목록 보기

15/24

VIDEO FOUNDATION MODEL

1. VideoChat: Chat-Centric Video Understanding

23, 5
566회 인용
https://arxiv.org/pdf/2305.06355
https://github.com/OpenGVLab/Ask-Anything
- 3100 star
https://velog.io/@hsbc/VideoChat-Chat-Centric-Video-Understanding

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

24,12
https://arxiv.org/pdf/2501.00574
https://github.com/OpenGVLab/VideoChat-Flash
- 274 stars

2. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

https://arxiv.org/pdf/2306.02858
23, 6
771 회 인용
https://github.com/DAMO-NLP-SG/Video-LLaMA
- 2900 stars

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

https://arxiv.org/pdf/2406.07476
24, 6
127회 인용
https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file
- 1000 stars

VideoLLaMA 3: Frontier Multimodal Foundation

Models for Image and Video Understanding

https://arxiv.org/pdf/2501.13106
25, 1
https://github.com/DAMO-NLP-SG/VideoLLaMA3
- 150 stars

Revisiting Feature Prediction for Learning Visual Representations from Video

https://arxiv.org/pdf/2404.08471
- 24,2
- 50회 인용
https://github.com/facebookresearch/jepa?tab=readme-ov-file
- 2700 stars

https://arxiv.org/abs/2408.04840
24, 8
45회 인용
https://github.com/X-PLUG/mPLUG-Owl
- 2400 stars

World Model on Million-Length Video And Language With Blockwise RingAttention

https://arxiv.org/pdf/2402.08268
- 24, 2
- 28회 인용
https://github.com/LargeWorldModel/LWM?tab=readme-ov-file
- 7200 stars

3. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

23, 6
556회
https://arxiv.org/pdf/2306.05424
https://github.com/mbzuai-oryx/Video-ChatGPT
- 1200 star

5. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

2023, 4
https://arxiv.org/pdf/2303.16727
https://github.com/OpenGVLab/VideoMAEv2
- 520 star

2023, 2
https://arxiv.org/pdf/2302.00402
https://github.com/alibaba/AliceMind/tree/main/mPLUG
- 래포 자체는 2000 star
- 그 중 일부로, 해당 논문 구현체가 있음

7. Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2024, 3
https://arxiv.org/pdf/2303.16058
https://github.com/OpenGVLab/unmasked_teacher
- 297 star

8. AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION

2023, 2
https://arxiv.org/pdf/2302.03024
https://github.com/taoyang1122/adapt-image-models
- 278 star

EVA: Visual Representation Fantasies from BAAI

https://github.com/baaivision/EVA
- 2300 star
eva
- https://openaccess.thecvf.com/content/CVPR2023/papers/Fang_EVA_Exploring_the_Limits_of_Masked_Visual_Representation_Learning_at_CVPR_2023_paper.pdf
  - 2023, 620회 인용
- eva2
  - https://arxiv.org/pdf/2303.11331
  - 2024, 192회 인용
- eva-clip
  - https://arxiv.org/pdf/2303.15389
  - 2023, 365회 인용
- eva-clip 2
  - https://arxiv.org/pdf/2402.04252
  - 2024, 24회 인용

9. SVFormer: Semi-supervised Video Transformer for Action Recognition

MOMENT RETEREIVAL

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

UniVTG: Towards Unified Video-Language Temporal Grounding

2023, 82회 인용
https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_UniVTG_Towards_Unified_Video-Language_Temporal_Grounding_ICCV_2023_paper.pdf
https://github.com/showlab/UniVTG
- 323 stars

Self-Chained Image-Language Model for Video Localization and Question Answering

2023, 104회 인용
https://proceedings.neurips.cc/paper_files/paper/2023/file/f22a9af8dbb348952b08bd58d4734b50-Paper-Conference.pdf
https://github.com/Yui010206/SeViLA
- 178 star

모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것

이전 포스트

action recognition 논문들

다음 포스트

[23, 5][562] VideoChat: Chat-Centric Video Understanding

0개의 댓글

관련 채용 정보