CMU MMML - Lecture 1.2 Datasets

aerojohn1223·2023년 4월 22일

[Study] CMU MMML Lecture

목록 보기

Multimodal Reaserch Tasks

1980 ~ 1990 : Audio-visual speech recognition
1990 ~ 2000 :
1) Content-based video retrieval. A lot more videos were available on youtube at this time.
2) Affect and emotion recognition. "Affective Computing" was born.
2000 ~ 2010 :
1) Video event recognition(TrecVid)
2) Multimodal sentiment Analysis
2010 ~ 2015 : Image Captioning. "Language and Vision" research born.
2015 ~ 2016 :
1) Video captioning & "grounding"
2) Visual question answering(image-based). For example, you ask aout a specific area in the image.
2016 ~ 2017 : Video QA & referring expressions
2017 ~ 2018 :
1) Multimodal Dialogue
2) Large-scale video event retrieval (e.g. YouTube8M)
2018 ~ 2019 :
1) Language, vision and Navigation
2) Self-driving multimodal navigation

recently applied to robotics, healthcare, education and so on.

Affective Computing?
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. Affective computing technologies sense the emotional state of a user (via sensors, microphone, cameras and/or software logic).

Marking a specific part of the image with bounding box that is depicted by a user.

Real world tasks tackled by MMML

A. Affect Recognition

helps computers to understand emotion, personality, sentiment, etc.

  • Emotion
  • Personalities
  • Sentiment

B. Media Description

Describes about the media that is given.

  • Image and video captioning

C. Multimodal QA

answering a question about the image/video.

  • Image and video QA
  • Visual reasoning

D. Multimodal Navigation

combines reinforcement learning and robotics with understanding language/vision.

  • Language guided navigation
  • Autonomous driving

E. Multimodal Dialog

  • Grounded Dialog

F. Event Recognition

recognize the activity in the video, and also segment the correct activity

  • Action recognition
  • Segmentation

G. Multimedia Information Retrieval
retrieve similar image or video

  • Content based/CrossmediaQuestion from Student.

Question from Student
Q. Is caption to generating Image possible?
A. It is a very very difficult problem. You have to generate high dimensional images, and it is also very hard to evaluate what makes a good image.

Affective Computing

Affective states :
emotions, moods, and feelings.
Common Topics : recognizes emotions. Anger, Fear, Shame, Positivity, etc.

Cognitive states :
thinking and information processing.
Can the computer know if the human is thinking or concentrating or curious or etc.
Common Topics : Engagement, Interest, Surprise, Agreement, Doubt, etc.

Personality :
patterns of acting, feeling, and thinking.
Needs long term reasoning, long term judgement of the personality human might have.
Common Topics : Reasoning for Long-Term. Outgoing, Moody, Artistic, etc.

Pathology :
health, functioning, and disorders.
Looks at the human and decides whether they are prone to various mental health disorders.
Common Topics : Depression, Trauma, Antagonism, Detachment, etc.

Social processes :
groups, cultures, and perception.
looks at how humans interact with each other to be able to reason above these socail processes.
Common Topics : Understands social relations. Rapport, Cooperation, etc.

Audio-visual Emotion Challenge 2011/2012
goal : take audio-visual content and be able to predict both discrete and continuous emotions.

  • Part of larger SEMAINE corpus
  • Sensitive Artificail Listener paradigm
  • Labeled for four dimensional emotions (per frame)
    • Arousal, expectancy, power, valence
    • These annotated on a spectrum, which makes it a regression problem.
  • Has Transcripts
    - recent datasets have transcripts.
    • Can process texts without looking at the audio features.

Audio-visual Emotion Challenge 2013/2014

  • Reading specific text in a subset of videos

  • Labeled for emotion per frame (valence, arousal, dominance)

  • Performing HCI task

    • Reading aloud a text in German
    • Responding to a number of questions
  • 100 audio-visual sessions

  • Provide extracted audio-visual features

Audio-visual Emotion Challenge 2015/2016

  • RECOLA dataset
  • Audio-Visual emotion recognition
  • Labeled for dimensional emotion per frame(arousal, valence)
  • Includes physiological data
  • 27 participants
  • French, audio, video, ECG, and EDA
  • Collaboration task in video conference
  • Broader range of emotive expressions

It is a very popular dataset for 2 reasons.
1) contains fine-grained annotations at the frame level.

  • for an entire video, there is one single annotation for entire video.

2) apart from both audio and visual features, it also includes physiological data.

  • one of them is electrocardiography.
    • graph of voltage v.s. time of the electrical activity of the heart.
  • integrate both verbal, non-verbal features, and also these physiological data sources to be able to get better predictions.

Multimodal Sentiment Analysis

  • Multimodal Corpus of Sentiment Intensity and subjectivity Analysis in Online Opinion Videos(MOSI)
  • 89 speakers with 2199 opinion segments
  • Audio-visual data with transcriptions
  • Lables for sentiment/opinion
    • Subjective v.s. Objective
    • Positive v.s. Negative

Concept in language. Whether a person is reflecting positively or negatively to a particular video.

People use non-verbal getures when they speak.

Multimodal Sentiment Analysis

  • Multimodal sentiment and emotion recognition
  • CMU-MOSEI : 23,453 annoted video segments from 1,000 distinct speakers and 250 topics.
  • fine-grained five class sentiment and also discrete emotions.

Multi-Party Emotion Recognition

  • MELD : Multi-party dataset for emotion recognition in conversations
  • Dataset collected from the 'Friends' TV Show, annotating each of the characters separately.

Goal : look at overall sentiment, and emotions of the entire conversation.

Question from student
Q. Does it make sense to have emotion labels at a frame level?
A. Image, language, etc. These features are not necessarily going to be alligned. Not all the modalities are perfectly alligned at the same time level. So have to deal with allignment problems.

What are the Core Challenges Most Involved in Affective Recognition?
1) Representation

  • you need to have suitable levels of extraction to get really good multimodal representations

2) Allignment

  • most of these datasets are temporal datasets consisting of videos. Very big challenge, because you have to allign verbal and non-verbal(different modalities).

3) Fusion

  • you have multiple data sources, some of which are more useful than others at different time periods.
    goal : how to best leverage all these data sources to make a prediction.

4) Co-Learning

  • area which has arisen in affective computing.

Media Description

Media description provides a free form text description when given a media(image, video, audio-visual clips).
It is one of the main enablers of multi-modal machine learning research in terms of really large datasets that have images and captions.

Large-Scale Image Captioning Dataset

  • Microsoft Common Objects in COntext (MS COCO)
  • 120000 images
  • Each image is accompanied with five free form sentences describing it (at least 8 words)
  • sentences collected using crowdsourcing (Mechanical Turk)
  • Also contains object detections, boundaries and keypoints.
  • It is one of the datasets that led the boon of multi-modal.

Evaluating Image Caption Generations

  • Has an evaluation server

    • Training and validation - 80k images(400k captions)
    • Testing - 40K images (380k captions), a subset contains more captions for better evaluation, these are kept privately(to avoid over-fitting and cheating)
  • Evaluation is difficult as there is no one 'correct' answer for describing an image in a sentence. This is the main challenfe of media description.

  • Given a candidate sentence it is evaluated agains a set of "ground truth" sentences

  • goal : try their best to take cadidate sentence and evaluate against a set of ground truth sentences using either human or evaluation metrics.

Evaluating Image Captioining Results

  • A challenge was done with actual human evaluations of the captions(CVPR 2015)

  • Main Challenge : annotation.

    • You don't want to build a model and have to annotate it using humans. Human labels are very expensive.
  • What about automotic evaluation?

    • Human labels are expensive...
  • Have automatic ways to evaluate

    • there must be a clear correspondence/clear correlation between automatic and human evaluations
    • CIDEr-D, Meteor, ROUGE, BLEU
    • automatic evaluations above do not correspond so well.
    • automatic evaluations outperform human performance.
      • clearly, there are issues with automatic metrics. It allows models to overfit to the training data.
  • lots of research is ongoing for building better models, and also on evaluation metrics that are more reliable.

Video Captioning

  • Allignment is a challenge since description can happen after the video segment.
  • Only one single caption per clip - Challenge with evaluation.

How to Address the Challenge of Evaluation

  • Referring Expressions : Generate / Comprehend a noun phrase which identifies a particular object in an image.

What are the Core Challenges Most Involved in Media Description

1) Representation

  • you need a good representation that is alligned between two modalities to be able to accurately perform translation

2) Translation

  • The goal is to map data from one modality to semantically meaningful high dimensional data in another modality. e.g. img to captions.

3) Allignment

Multimodal QA


  • supplement some of the challenges in image captioning and video captioning, which is really hard to evaluate generation.


  • Task : Given an image and a question, answer the question.

Multimodal QA dataset1 - VQA(C1)

  • Real Images
    • 200k MS COCO images
    • 600k questions
    • 6M answers
    • 1.8M plausible answers
  • Abstract Images
    • 50k scenes
    • 150k questions
    • 1.5M answers
    • 450k plausible answers

VQA Challenge 2016 and 2017 (C1)

  • Two challenges organized these past two years
  • Currently good at yes/no question, not so much free form and counting (e.g. how many yellow bananas are there?)

VQA 2.0

  • Just guessing without an image lead to ~51% accuracy
    • So the V in VQA "only" adds 14% increase in accuracy
  • Just because 80% of the banana images used for training are yellow, the model is biased(even if the banana is green, it would answer yellow.
  • It is trying to solve the problem of bias by having the same question and having a bunch of images where the answer is different.

Multimodal QA - other VQA datasets (C7)

  • TVQA
    • Video QA dataset based on 6 popular TV shows
    • 152.5K QA pairs from 21.8k clips
    • Compositional questions

Multimodal QA - Visual Reasoning (C8)

  • VCR : Visual Commonsense Reasoning
    • Model must answer challenging visual questions expressed in language
    • and provide a rationale explaining why its answer is true.

Social-IQ (A10)

  • Social-IQ : 1.2K videos, 7.5K questions, 50K answers
  • Questions and answers centered around social behaviors

Question from student
Q. In VQA, do we manually balance the data so that it does not have any bias?
A. No universal answer to this. 2 ways for making non-baised model
1) take a closer look at the dataset, balance it, make sure that it's not biased.
2) ignore the data and look at how a model can be used to take in bias data but learn representations that are not biased.

What are the Core Challenges Most Involved in Multimodal QA?
Very similar to media description retrieval.
1) Representation
2) Allignment
3) Translation
4) Fusion

Main Additional Challenge:

  • you must also comprehend the question and localize what aprt of the image or video it is corresponding to.
    • which makes allignment a bigger challenge.

Multimodal Navigation

Embedded Assistive Agents
The next generation of AI assistants need to interact with the real(or virtual) world.
e.g. personal assistance, robots, etc.

Language, Vision and Actions
challenge :

  • In addition to just understanding language and vision, we also have to take actions in the real-world.
    • So, really about multimodal perception of the environment and also linking that to action.

Many Technical Challenges
Action, language and vision loop that you have to execute throughout possible long timestamps before possibly completing the task.

Navigating in a Virtual House
Visually-grounded natural language navigation in real buildings

  • Room-2-Room : 21,567 open vocabulary, crowd-sourced navigation instructions
  • Refer360 Dataset : Multiple Step Instructions

What are the Core Challenges Most Involved in Multimodal Navigation?

1, 2) Representation and Fusion
Primarily on representation and fusion. Both to reason about language as well as the environment and the goal is to not just to get a representation that is useful for supervised learning and to maximize the label at that one time step, it's also to learn a good representation that can be useful for reinforcement learning being able to reason about long-term interactions with the world with possibly very sparse rewards.

2개의 댓글

2023년 4월 26일

안녕하세요 :)

1개의 답글