[Video foundation model] 논문 공부 리스트

FSA·2024년 11월 20일
0

VIDEO FOUNDATION MODEL

1. VideoChat: Chat-Centric Video Understanding

The Llama 3 Herd of Models


VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Apollo: An exploration of video understanding in large multimodal models

ORYX MLLM: ON-DEMAND SPATIAL-TEMPORAL UNDERSTANDING AT ARBITRARY RESOLUTION

CogVLM2: Visual Language Models for Image andVideo Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Pllava:Parameter-free llava extension from images to videos for video dense captioning

Long context transfer from language to vision

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Sharegpt4video: Improving video understanding and generation with better captions

ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

LLAVA-VIDEO: VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA

internVL-2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

LLaVA-OneVision: Easy Visual Task Transfer

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

2. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Revisiting Feature Prediction for Learning Visual Representations from Video

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

World Model on Million-Length Video And Language With Blockwise RingAttention

3. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

5. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

6. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

7. Unmasked Teacher: Towards Training-Efficient Video Foundation Models

8. AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION

EVA: Visual Representation Fantasies from BAAI

9. SVFormer: Semi-supervised Video Transformer for Action Recognition

MOMENT RETEREIVAL

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

UniVTG: Towards Unified Video-Language Temporal Grounding

Self-Chained Image-Language Model for Video Localization and Question Answering

profile
모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것

0개의 댓글

관련 채용 정보