FSA.log
로그인
FSA.log
로그인
[Video foundation model] 논문 공부 리스트
FSA
·
2024년 11월 20일
팔로우
0
0
action recognition in videos
목록 보기
15/24
VIDEO FOUNDATION MODEL
1.
VideoChat: Chat-Centric Video Understanding
23, 5
566회 인용
https://arxiv.org/pdf/2305.06355
https://github.com/OpenGVLab/Ask-Anything
3100 star
https://velog.io/@hsbc/VideoChat-Chat-Centric-Video-Understanding
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
24,12
https://arxiv.org/pdf/2501.00574
https://github.com/OpenGVLab/VideoChat-Flash
274 stars
2. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
https://arxiv.org/pdf/2306.02858
23, 6
771 회 인용
https://github.com/DAMO-NLP-SG/Video-LLaMA
2900 stars
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
https://arxiv.org/pdf/2406.07476
24, 6
127회 인용
https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file
1000 stars
VideoLLaMA 3: Frontier Multimodal Foundation
Models for Image and Video Understanding
https://arxiv.org/pdf/2501.13106
25, 1
https://github.com/DAMO-NLP-SG/VideoLLaMA3
150 stars
Revisiting Feature Prediction for Learning Visual Representations from Video
https://arxiv.org/pdf/2404.08471
24,2
50회 인용
https://github.com/facebookresearch/jepa?tab=readme-ov-file
2700 stars
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
https://arxiv.org/abs/2408.04840
24, 8
45회 인용
https://github.com/X-PLUG/mPLUG-Owl
2400 stars
World Model on Million-Length Video And Language With Blockwise RingAttention
https://arxiv.org/pdf/2402.08268
24, 2
28회 인용
https://github.com/LargeWorldModel/LWM?tab=readme-ov-file
7200 stars
3. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
23, 6
556회
https://arxiv.org/pdf/2306.05424
https://github.com/mbzuai-oryx/Video-ChatGPT
1200 star
5.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
2023, 4
https://arxiv.org/pdf/2303.16727
https://github.com/OpenGVLab/VideoMAEv2
520 star
6. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
2023, 2
https://arxiv.org/pdf/2302.00402
https://github.com/alibaba/AliceMind/tree/main/mPLUG
래포 자체는 2000 star
그 중 일부로, 해당 논문 구현체가 있음
7. Unmasked Teacher: Towards Training-Efficient Video Foundation Models
2024, 3
https://arxiv.org/pdf/2303.16058
https://github.com/OpenGVLab/unmasked_teacher
297 star
8. AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION
2023, 2
https://arxiv.org/pdf/2302.03024
https://github.com/taoyang1122/adapt-image-models
278 star
EVA: Visual Representation Fantasies from BAAI
https://github.com/baaivision/EVA
2300 star
eva
https://openaccess.thecvf.com/content/CVPR2023/papers/Fang_EVA_Exploring_the_Limits_of_Masked_Visual_Representation_Learning_at_CVPR_2023_paper.pdf
2023, 620회 인용
eva2
https://arxiv.org/pdf/2303.11331
2024, 192회 인용
eva-clip
https://arxiv.org/pdf/2303.15389
2023, 365회 인용
eva-clip 2
https://arxiv.org/pdf/2402.04252
2024, 24회 인용
9. SVFormer: Semi-supervised Video Transformer for Action Recognition
https://openaccess.thecvf.com/content/CVPR2023/papers/Xing_SVFormer_Semi-Supervised_Video_Transformer_for_Action_Recognition_CVPR_2023_paper.pdf
https://github.com/ChenHsing/SVFormer
: 84 stars
MOMENT RETEREIVAL
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
2024, 72회 인용
https://openaccess.thecvf.com/content/CVPR2024/papers/Ren_TimeChat_A_Time-sensitive_Multimodal_Large_Language_Model_for_Long_Video_CVPR_2024_paper.pdf
https://github.com/RenShuhuai-Andy/TimeChat
292 stars
UniVTG: Towards Unified Video-Language Temporal Grounding
2023, 82회 인용
https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_UniVTG_Towards_Unified_Video-Language_Temporal_Grounding_ICCV_2023_paper.pdf
https://github.com/showlab/UniVTG
323 stars
Self-Chained Image-Language Model for Video Localization and Question Answering
2023, 104회 인용
https://proceedings.neurips.cc/paper_files/paper/2023/file/f22a9af8dbb348952b08bd58d4734b50-Paper-Conference.pdf
https://github.com/Yui010206/SeViLA
178 star
FSA
모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것
팔로우
이전 포스트
action recognition 논문들
다음 포스트
[23, 5][562] VideoChat: Chat-Centric Video Understanding
0개의 댓글
댓글 작성
관련 채용 정보