FSA.log
로그인
FSA.log
로그인
[Video foundation model] 논문 공부 리스트
FSA
·
2024년 11월 20일
팔로우
0
0
action recognition in videos
목록 보기
15/24
VIDEO FOUNDATION MODEL
1.
VideoChat: Chat-Centric Video Understanding
23, 5
566회 인용
https://arxiv.org/pdf/2305.06355
https://github.com/OpenGVLab/Ask-Anything
3100 star
https://velog.io/@hsbc/VideoChat-Chat-Centric-Video-Understanding
The Llama 3 Herd of Models
24, 7
2300회 인용
https://arxiv.org/pdf/2407.21783
https://www.llama.com/
text, image, video + audio + speech modality
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
25, 1
3회 인용
https://arxiv.org/pdf/2501.01957
https://github.com/VITA-MLLM/VITA
2000 stars
text, image, video + audio + speech modality
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
24, 7
70회 인용
https://arxiv.org/pdf/2407.03320?
https://github.com/InternLM/InternLM-XComposer
2700 stars
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
24, 12
3회 인용
https://arxiv.org/pdf/2412.09596
https://github.com/InternLM/InternLM-XComposer
2700 stars
Apollo: An exploration of video understanding in large multimodal models
24, 12
6회 인용
https://arxiv.org/pdf/2412.10360
아직 코드 없음
image-based encoder 기반 ViLM
ORYX MLLM: ON-DEMAND SPATIAL-TEMPORAL UNDERSTANDING AT ARBITRARY RESOLUTION
24, 9
30회 인용
https://arxiv.org/pdf/2409.12961
https://github.com/Oryx-mllm/Oryx
282 stars
image-based encoder 기반 ViLM
CogVLM2: Visual Language Models for Image andVideo Understanding
24, 8
57회 인용
https://arxiv.org/pdf/2408.16500
https://github.com/THUDM/CogVLM2
2200 stars
https://github.com/THUDM/GLM-4
5800 stars
image-based encoder 기반 ViLM
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
24, 10
16회 인용
https://arxiv.org/pdf/2410.17434
https://github.com/Vision-CAIR/LongVU
347 stars
Pllava:Parameter-free llava extension from images to videos for video dense captioning
24, 4
95회 인용
https://github.com/magic-research/PLLaVA
640 stars
Long context transfer from language to vision
24, 6
65회 인용
https://arxiv.org/pdf/2406.16852?
https://github.com/EvolvingLMMs-Lab/LongVA
359 stars
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
24,12
https://arxiv.org/pdf/2501.00599
https://github.com/DAMO-NLP-SG/VideoRefer
142 stars
Sharegpt4video: Improving video understanding and generation with better captions
https://arxiv.org/pdf/2406.04325
24, 6
90회 인용
https://github.com/ShareGPT4Omni/ShareGPT4Video
1000 stars
ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
https://arxiv.org/pdf/2104.09864
21,
1827회 인용
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
23,12
172회 인용
https://arxiv.org/pdf/2312.14238
https://github.com/OpenGVLab/InternVL
7000 stars
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
24, 4
351회 인용
https://arxiv.org/pdf/2404.16821
https://github.com/OpenGVLab/InternVL
7000 stars
LLAVA-VIDEO: VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA
24, 10
37회 인용
https://arxiv.org/pdf/2410.02713
https://github.com/LLaVA-VL/LLaVA-NeXT
internVL-2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
24,12
25회 인용
https://arxiv.org/pdf/2412.05271
https://github.com/OpenGVLab/InternVL
LLaVA-OneVision: Easy Visual Task Transfer
24, 8
250회 인용
https://arxiv.org/pdf/2408.03326
https://github.com/LLaVA-VL/LLaVA-NeXT
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
24, 9
240회 인용
https://arxiv.org/pdf/2409.12191
https://github.com/QwenLM/Qwen2.5-VL
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
24,12
https://arxiv.org/pdf/2501.00574
https://github.com/OpenGVLab/VideoChat-Flash
274 stars
2. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
https://arxiv.org/pdf/2306.02858
23, 6
771 회 인용
https://github.com/DAMO-NLP-SG/Video-LLaMA
2900 stars
video encoder 기반 ViLM
audio modality
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
https://arxiv.org/pdf/2406.07476
24, 6
127회 인용
https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file
1000 stars
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
https://arxiv.org/pdf/2501.13106
25, 1
https://github.com/DAMO-NLP-SG/VideoLLaMA3
150 stars
Revisiting Feature Prediction for Learning Visual Representations from Video
https://arxiv.org/pdf/2404.08471
24,2
50회 인용
https://github.com/facebookresearch/jepa?tab=readme-ov-file
2700 stars
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
https://arxiv.org/abs/2408.04840
24, 8
45회 인용
https://github.com/X-PLUG/mPLUG-Owl
2400 stars
World Model on Million-Length Video And Language With Blockwise RingAttention
https://arxiv.org/pdf/2402.08268
24, 2
28회 인용
https://github.com/LargeWorldModel/LWM?tab=readme-ov-file
7200 stars
3. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
23, 6
556회
https://arxiv.org/pdf/2306.05424
https://github.com/mbzuai-oryx/Video-ChatGPT
1200 star
5.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
2023, 4
https://arxiv.org/pdf/2303.16727
https://github.com/OpenGVLab/VideoMAEv2
520 star
6. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
2023, 2
https://arxiv.org/pdf/2302.00402
https://github.com/alibaba/AliceMind/tree/main/mPLUG
래포 자체는 2000 star
그 중 일부로, 해당 논문 구현체가 있음
7. Unmasked Teacher: Towards Training-Efficient Video Foundation Models
2024, 3
https://arxiv.org/pdf/2303.16058
https://github.com/OpenGVLab/unmasked_teacher
297 star
8. AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION
2023, 2
https://arxiv.org/pdf/2302.03024
https://github.com/taoyang1122/adapt-image-models
278 star
EVA: Visual Representation Fantasies from BAAI
https://github.com/baaivision/EVA
2300 star
eva
https://openaccess.thecvf.com/content/CVPR2023/papers/Fang_EVA_Exploring_the_Limits_of_Masked_Visual_Representation_Learning_at_CVPR_2023_paper.pdf
2023, 620회 인용
eva2
https://arxiv.org/pdf/2303.11331
2024, 192회 인용
eva-clip
https://arxiv.org/pdf/2303.15389
2023, 365회 인용
eva-clip 2
https://arxiv.org/pdf/2402.04252
2024, 24회 인용
9. SVFormer: Semi-supervised Video Transformer for Action Recognition
https://openaccess.thecvf.com/content/CVPR2023/papers/Xing_SVFormer_Semi-Supervised_Video_Transformer_for_Action_Recognition_CVPR_2023_paper.pdf
https://github.com/ChenHsing/SVFormer
: 84 stars
MOMENT RETEREIVAL
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
2024, 72회 인용
https://openaccess.thecvf.com/content/CVPR2024/papers/Ren_TimeChat_A_Time-sensitive_Multimodal_Large_Language_Model_for_Long_Video_CVPR_2024_paper.pdf
https://github.com/RenShuhuai-Andy/TimeChat
292 stars
UniVTG: Towards Unified Video-Language Temporal Grounding
2023, 82회 인용
https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_UniVTG_Towards_Unified_Video-Language_Temporal_Grounding_ICCV_2023_paper.pdf
https://github.com/showlab/UniVTG
323 stars
Self-Chained Image-Language Model for Video Localization and Question Answering
2023, 104회 인용
https://proceedings.neurips.cc/paper_files/paper/2023/file/f22a9af8dbb348952b08bd58d4734b50-Paper-Conference.pdf
https://github.com/Yui010206/SeViLA
178 star
FSA
모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것
팔로우
이전 포스트
action recognition 논문들
다음 포스트
[23, 5][562] VideoChat: Chat-Centric Video Understanding
0개의 댓글
댓글 작성
관련 채용 정보