CMU MMML - Lecture 1.2 Datasets

Hwangbo Gyeom·2023년 4월 22일

[Study] CMU MMML Lecture

목록 보기

1/1

Multimodal Reaserch Tasks

1980 ~ 1990 : Audio-visual speech recognition
1990 ~ 2000 :
1) Content-based video retrieval. A lot more videos were available on youtube at this time.
2) Affect and emotion recognition. "Affective Computing" was born.
2000 ~ 2010 :
1) Video event recognition(TrecVid)
2) Multimodal sentiment Analysis
2010 ~ 2015 : Image Captioning. "Language and Vision" research born.
2015 ~ 2016 :
1) Video captioning & "grounding"
2) Visual question answering(image-based). For example, you ask aout a specific area in the image.
2016 ~ 2017 : Video QA & referring expressions
2017 ~ 2018 :
1) Multimodal Dialogue
2) Large-scale video event retrieval (e.g. YouTube8M)
2018 ~ 2019 :
1) Language, vision and Navigation
2) Self-driving multimodal navigation

recently applied to robotics, healthcare, education and so on.

Affective Computing?
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. Affective computing technologies sense the emotional state of a user (via sensors, microphone, cameras and/or software logic).

Grounding?
Marking a specific part of the image with bounding box that is depicted by a user.

Real world tasks tackled by MMML

A. Affect Recognition

helps computers to understand emotion, personality, sentiment, etc.

Emotion
Personalities
Sentiment

B. Media Description

Describes about the media that is given.

Image and video captioning

C. Multimodal QA

answering a question about the image/video.

Image and video QA
Visual reasoning

D. Multimodal Navigation

combines reinforcement learning and robotics with understanding language/vision.

Language guided navigation
Autonomous driving

E. Multimodal Dialog

Grounded Dialog

F. Event Recognition

recognize the activity in the video, and also segment the correct activity

Action recognition
Segmentation

G. Multimedia Information Retrieval
retrieve similar image or video

Content based/CrossmediaQuestion from Student.

Question from Student
Q. Is caption to generating Image possible?
A. It is a very very difficult problem. You have to generate high dimensional images, and it is also very hard to evaluate what makes a good image.

Affective Computing

Affective states :
emotions, moods, and feelings.
Common Topics : recognizes emotions. Anger, Fear, Shame, Positivity, etc.

Cognitive states :
thinking and information processing.
Can the computer know if the human is thinking or concentrating or curious or etc.
Common Topics : Engagement, Interest, Surprise, Agreement, Doubt, etc.

Personality :
patterns of acting, feeling, and thinking.
Needs long term reasoning, long term judgement of the personality human might have.
Common Topics : Reasoning for Long-Term. Outgoing, Moody, Artistic, etc.

Pathology :
health, functioning, and disorders.
Looks at the human and decides whether they are prone to various mental health disorders.
Common Topics : Depression, Trauma, Antagonism, Detachment, etc.

Social processes :
groups, cultures, and perception.
looks at how humans interact with each other to be able to reason above these socail processes.
Common Topics : Understands social relations. Rapport, Cooperation, etc.

Audio-visual Emotion Challenge 2011/2012
goal : take audio-visual content and be able to predict both discrete and continuous emotions.

Part of larger SEMAINE corpus
Sensitive Artificail Listener paradigm
Labeled for four dimensional emotions (per frame)
- Arousal, expectancy, power, valence
- These annotated on a spectrum, which makes it a regression problem.
Has Transcripts
- recent datasets have transcripts.
- Can process texts without looking at the audio features.

Audio-visual Emotion Challenge 2013/2014

Reading specific text in a subset of videos
Labeled for emotion per frame (valence, arousal, dominance)
Performing HCI task
- Reading aloud a text in German
- Responding to a number of questions
100 audio-visual sessions
Provide extracted audio-visual features

Audio-visual Emotion Challenge 2015/2016

RECOLA dataset
Audio-Visual emotion recognition
Labeled for dimensional emotion per frame(arousal, valence)
Includes physiological data
27 participants
French, audio, video, ECG, and EDA
Collaboration task in video conference
Broader range of emotive expressions

It is a very popular dataset for 2 reasons.
1) contains fine-grained annotations at the frame level.

for an entire video, there is one single annotation for entire video.

2) apart from both audio and visual features, it also includes physiological data.

one of them is electrocardiography.
- graph of voltage v.s. time of the electrical activity of the heart.
integrate both verbal, non-verbal features, and also these physiological data sources to be able to get better predictions.

Multimodal Sentiment Analysis

Multimodal Corpus of Sentiment Intensity and subjectivity Analysis in Online Opinion Videos(MOSI)
89 speakers with 2199 opinion segments
Audio-visual data with transcriptions
Lables for sentiment/opinion
- Subjective v.s. Objective
- Positive v.s. Negative

Sentiment?
Concept in language. Whether a person is reflecting positively or negatively to a particular video.

People use non-verbal getures when they speak.

Multimodal Sentiment Analysis

Multimodal sentiment and emotion recognition
CMU-MOSEI : 23,453 annoted video segments from 1,000 distinct speakers and 250 topics.
fine-grained five class sentiment and also discrete emotions.

Multi-Party Emotion Recognition

MELD : Multi-party dataset for emotion recognition in conversations
Dataset collected from the 'Friends' TV Show, annotating each of the characters separately.

Goal : look at overall sentiment, and emotions of the entire conversation.

Question from student
Q. Does it make sense to have emotion labels at a frame level?
A. Image, language, etc. These features are not necessarily going to be alligned. Not all the modalities are perfectly alligned at the same time level. So have to deal with allignment problems.

What are the Core Challenges Most Involved in Affective Recognition?
1) Representation

you need to have suitable levels of extraction to get really good multimodal representations

2) Allignment

most of these datasets are temporal datasets consisting of videos. Very big challenge, because you have to allign verbal and non-verbal(different modalities).

3) Fusion

you have multiple data sources, some of which are more useful than others at different time periods.
goal : how to best leverage all these data sources to make a prediction.

4) Co-Learning

area which has arisen in affective computing.

Media Description

Media description provides a free form text description when given a media(image, video, audio-visual clips).
It is one of the main enablers of multi-modal machine learning research in terms of really large datasets that have images and captions.

Large-Scale Image Captioning Dataset

Microsoft Common Objects in COntext (MS COCO)
120000 images
Each image is accompanied with five free form sentences describing it (at least 8 words)
sentences collected using crowdsourcing (Mechanical Turk)
Also contains object detections, boundaries and keypoints.
It is one of the datasets that led the boon of multi-modal.

Evaluating Image Caption Generations

Has an evaluation server
- Training and validation - 80k images(400k captions)
- Testing - 40K images (380k captions), a subset contains more captions for better evaluation, these are kept privately(to avoid over-fitting and cheating)
Evaluation is difficult as there is no one 'correct' answer for describing an image in a sentence. This is the main challenfe of media description.
Given a candidate sentence it is evaluated agains a set of "ground truth" sentences
goal : try their best to take cadidate sentence and evaluate against a set of ground truth sentences using either human or evaluation metrics.

Evaluating Image Captioining Results

A challenge was done with actual human evaluations of the captions(CVPR 2015)
Main Challenge : annotation.
- You don't want to build a model and have to annotate it using humans. Human labels are very expensive.
What about automotic evaluation?
- Human labels are expensive...
Have automatic ways to evaluate
- there must be a clear correspondence/clear correlation between automatic and human evaluations
- CIDEr-D, Meteor, ROUGE, BLEU
- automatic evaluations above do not correspond so well.
- automatic evaluations outperform human performance.
  - clearly, there are issues with automatic metrics. It allows models to overfit to the training data.
lots of research is ongoing for building better models, and also on evaluation metrics that are more reliable.

Video Captioning

Allignment is a challenge since description can happen after the video segment.
Only one single caption per clip - Challenge with evaluation.

How to Address the Challenge of Evaluation

Referring Expressions : Generate / Comprehend a noun phrase which identifies a particular object in an image.

What are the Core Challenges Most Involved in Media Description

1) Representation

you need a good representation that is alligned between two modalities to be able to accurately perform translation

2) Translation

The goal is to map data from one modality to semantically meaningful high dimensional data in another modality. e.g. img to captions.

3) Allignment

Multimodal QA

Motivation

supplement some of the challenges in image captioning and video captioning, which is really hard to evaluate generation.

Visual

Task : Given an image and a question, answer the question.

Multimodal QA dataset1 - VQA(C1)

Real Images
- 200k MS COCO images
- 600k questions
- 6M answers
- 1.8M plausible answers
Abstract Images
- 50k scenes
- 150k questions
- 1.5M answers
- 450k plausible answers

VQA Challenge 2016 and 2017 (C1)

Two challenges organized these past two years
Currently good at yes/no question, not so much free form and counting (e.g. how many yellow bananas are there?)

VQA 2.0

Just guessing without an image lead to ~51% accuracy
- So the V in VQA "only" adds 14% increase in accuracy
Just because 80% of the banana images used for training are yellow, the model is biased(even if the banana is green, it would answer yellow.
It is trying to solve the problem of bias by having the same question and having a bunch of images where the answer is different.

Multimodal QA - other VQA datasets (C7)

TVQA
- Video QA dataset based on 6 popular TV shows
- 152.5K QA pairs from 21.8k clips
- Compositional questions

Multimodal QA - Visual Reasoning (C8)

VCR : Visual Commonsense Reasoning
- Model must answer challenging visual questions expressed in language
- and provide a rationale explaining why its answer is true.

Social-IQ (A10)

Social-IQ : 1.2K videos, 7.5K questions, 50K answers
Questions and answers centered around social behaviors

Question from student
Q. In VQA, do we manually balance the data so that it does not have any bias?
A. No universal answer to this. 2 ways for making non-baised model
1) take a closer look at the dataset, balance it, make sure that it's not biased.
2) ignore the data and look at how a model can be used to take in bias data but learn representations that are not biased.

What are the Core Challenges Most Involved in Multimodal QA?
Very similar to media description retrieval.
1) Representation
2) Allignment
3) Translation
4) Fusion

Main Additional Challenge:

you must also comprehend the question and localize what aprt of the image or video it is corresponding to.
- which makes allignment a bigger challenge.

Embedded Assistive Agents
The next generation of AI assistants need to interact with the real(or virtual) world.
e.g. personal assistance, robots, etc.

Language, Vision and Actions
challenge :

In addition to just understanding language and vision, we also have to take actions in the real-world.
- So, really about multimodal perception of the environment and also linking that to action.

Many Technical Challenges
Action, language and vision loop that you have to execute throughout possible long timestamps before possibly completing the task.

Navigating in a Virtual House
Visually-grounded natural language navigation in real buildings

Room-2-Room : 21,567 open vocabulary, crowd-sourced navigation instructions
Refer360 Dataset : Multiple Step Instructions

What are the Core Challenges Most Involved in Multimodal Navigation?

1, 2) Representation and Fusion
Primarily on representation and fusion. Both to reason about language as well as the environment and the goal is to not just to get a representation that is useful for supervised learning and to maximize the label at that one time step, it's also to learn a good representation that can be useful for reinforcement learning being able to reason about long-term interactions with the world with possibly very sparse rewards.

Hwangbo Gyeom

2개의 댓글

김민준

2023년 4월 26일

안녕하세요 :)

1개의 답글