[Video foundation model] 논문 공부 리스트

FSA·2024년 11월 20일

0

action recognition in videos

목록 보기

15/24

VIDEO FOUNDATION MODEL

1. VideoChat: Chat-Centric Video Understanding

23, 5
566회 인용
https://arxiv.org/pdf/2305.06355
https://github.com/OpenGVLab/Ask-Anything
- 3100 star
https://velog.io/@hsbc/VideoChat-Chat-Centric-Video-Understanding

The Llama 3 Herd of Models

24, 7
2300회 인용
https://arxiv.org/pdf/2407.21783
https://www.llama.com/
text, image, video + audio + speech modality

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

25, 1
3회 인용
https://arxiv.org/pdf/2501.01957
https://github.com/VITA-MLLM/VITA
- 2000 stars
text, image, video + audio + speech modality

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

24, 7
70회 인용
https://arxiv.org/pdf/2407.03320?
https://github.com/InternLM/InternLM-XComposer
- 2700 stars

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

24, 12
3회 인용
https://arxiv.org/pdf/2412.09596
https://github.com/InternLM/InternLM-XComposer
- 2700 stars

Apollo: An exploration of video understanding in large multimodal models

24, 12
6회 인용
https://arxiv.org/pdf/2412.10360
아직 코드 없음
image-based encoder 기반 ViLM

ORYX MLLM: ON-DEMAND SPATIAL-TEMPORAL UNDERSTANDING AT ARBITRARY RESOLUTION

24, 9
30회 인용
https://arxiv.org/pdf/2409.12961
https://github.com/Oryx-mllm/Oryx
282 stars
image-based encoder 기반 ViLM

CogVLM2: Visual Language Models for Image andVideo Understanding

24, 8
57회 인용
https://arxiv.org/pdf/2408.16500
https://github.com/THUDM/CogVLM2
- 2200 stars
https://github.com/THUDM/GLM-4
- 5800 stars
image-based encoder 기반 ViLM

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

24, 10
16회 인용
https://arxiv.org/pdf/2410.17434
https://github.com/Vision-CAIR/LongVU
- 347 stars

Pllava:Parameter-free llava extension from images to videos for video dense captioning

24, 4
95회 인용
https://github.com/magic-research/PLLaVA
- 640 stars

Long context transfer from language to vision

24, 6
65회 인용
https://arxiv.org/pdf/2406.16852?
https://github.com/EvolvingLMMs-Lab/LongVA
- 359 stars

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

24,12
https://arxiv.org/pdf/2501.00599
https://github.com/DAMO-NLP-SG/VideoRefer
- 142 stars

Sharegpt4video: Improving video understanding and generation with better captions

https://arxiv.org/pdf/2406.04325
24, 6
90회 인용
https://github.com/ShareGPT4Omni/ShareGPT4Video
- 1000 stars

ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

23,12
172회 인용
https://arxiv.org/pdf/2312.14238
https://github.com/OpenGVLab/InternVL
- 7000 stars

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

24, 4
351회 인용
https://arxiv.org/pdf/2404.16821
https://github.com/OpenGVLab/InternVL
- 7000 stars

LLAVA-VIDEO: VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA

internVL-2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

LLaVA-OneVision: Easy Visual Task Transfer

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

24,12
https://arxiv.org/pdf/2501.00574
https://github.com/OpenGVLab/VideoChat-Flash
- 274 stars

2. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

https://arxiv.org/pdf/2306.02858
23, 6
771 회 인용
https://github.com/DAMO-NLP-SG/Video-LLaMA
- 2900 stars
video encoder 기반 ViLM
audio modality

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

https://arxiv.org/pdf/2406.07476
24, 6
127회 인용
https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file
- 1000 stars

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

https://arxiv.org/pdf/2501.13106
25, 1
https://github.com/DAMO-NLP-SG/VideoLLaMA3
- 150 stars

Revisiting Feature Prediction for Learning Visual Representations from Video

https://arxiv.org/pdf/2404.08471
- 24,2
- 50회 인용
https://github.com/facebookresearch/jepa?tab=readme-ov-file
- 2700 stars

https://arxiv.org/abs/2408.04840
24, 8
45회 인용
https://github.com/X-PLUG/mPLUG-Owl
- 2400 stars

World Model on Million-Length Video And Language With Blockwise RingAttention

https://arxiv.org/pdf/2402.08268
- 24, 2
- 28회 인용
https://github.com/LargeWorldModel/LWM?tab=readme-ov-file
- 7200 stars

3. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

23, 6
556회
https://arxiv.org/pdf/2306.05424
https://github.com/mbzuai-oryx/Video-ChatGPT
- 1200 star

5. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

2023, 4
https://arxiv.org/pdf/2303.16727
https://github.com/OpenGVLab/VideoMAEv2
- 520 star

2023, 2
https://arxiv.org/pdf/2302.00402
https://github.com/alibaba/AliceMind/tree/main/mPLUG
- 래포 자체는 2000 star
- 그 중 일부로, 해당 논문 구현체가 있음

7. Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2024, 3
https://arxiv.org/pdf/2303.16058
https://github.com/OpenGVLab/unmasked_teacher
- 297 star

8. AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION

2023, 2
https://arxiv.org/pdf/2302.03024
https://github.com/taoyang1122/adapt-image-models
- 278 star

EVA: Visual Representation Fantasies from BAAI

https://github.com/baaivision/EVA
- 2300 star
eva
- https://openaccess.thecvf.com/content/CVPR2023/papers/Fang_EVA_Exploring_the_Limits_of_Masked_Visual_Representation_Learning_at_CVPR_2023_paper.pdf
  - 2023, 620회 인용
- eva2
  - https://arxiv.org/pdf/2303.11331
  - 2024, 192회 인용
- eva-clip
  - https://arxiv.org/pdf/2303.15389
  - 2023, 365회 인용
- eva-clip 2
  - https://arxiv.org/pdf/2402.04252
  - 2024, 24회 인용

9. SVFormer: Semi-supervised Video Transformer for Action Recognition

MOMENT RETEREIVAL

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

UniVTG: Towards Unified Video-Language Temporal Grounding

2023, 82회 인용
https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_UniVTG_Towards_Unified_Video-Language_Temporal_Grounding_ICCV_2023_paper.pdf
https://github.com/showlab/UniVTG
- 323 stars

Self-Chained Image-Language Model for Video Localization and Question Answering

2023, 104회 인용
https://proceedings.neurips.cc/paper_files/paper/2023/file/f22a9af8dbb348952b08bd58d4734b50-Paper-Conference.pdf
https://github.com/Yui010206/SeViLA
- 178 star

모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것

이전 포스트

action recognition 논문들

다음 포스트

[23, 5][562] VideoChat: Chat-Centric Video Understanding

0개의 댓글