1980 ~ 1990 : Audio-visual speech recognition
1990 ~ 2000 :
1) Content-based video retrieval. A lot more videos were available on youtube at this time.
2) Affect and emotion recognition. "Affective Computing" was born.
2000 ~ 2010 :
1) Video event recognition(TrecVid)
2) Multimodal sentiment Analysis
2010 ~ 2015 : Image Captioning. "Language and Vision" research born.
2015 ~ 2016 :
1) Video captioning & "grounding"
2) Visual question answering(image-based). For example, you ask aout a specific area in the image.
2016 ~ 2017 : Video QA & referring expressions
2017 ~ 2018 :
1) Multimodal Dialogue
2) Large-scale video event retrieval (e.g. YouTube8M)
2018 ~ 2019 :
1) Language, vision and Navigation
2) Self-driving multimodal navigation
recently applied to robotics, healthcare, education and so on.
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. Affective computing technologies sense the emotional state of a user (via sensors, microphone, cameras and/or software logic).
Marking a specific part of the image with bounding box that is depicted by a user.
A. Affect Recognition
helps computers to understand emotion, personality, sentiment, etc.
B. Media Description
Describes about the media that is given.
C. Multimodal QA
answering a question about the image/video.
D. Multimodal Navigation
combines reinforcement learning and robotics with understanding language/vision.
E. Multimodal Dialog
F. Event Recognition
recognize the activity in the video, and also segment the correct activity
G. Multimedia Information Retrieval
retrieve similar image or video
Question from Student
Q. Is caption to generating Image possible?
A. It is a very very difficult problem. You have to generate high dimensional images, and it is also very hard to evaluate what makes a good image.
Affective states :
emotions, moods, and feelings.
Common Topics : recognizes emotions. Anger, Fear, Shame, Positivity, etc.
Cognitive states :
thinking and information processing.
Can the computer know if the human is thinking or concentrating or curious or etc.
Common Topics : Engagement, Interest, Surprise, Agreement, Doubt, etc.
patterns of acting, feeling, and thinking.
Needs long term reasoning, long term judgement of the personality human might have.
Common Topics : Reasoning for Long-Term. Outgoing, Moody, Artistic, etc.
health, functioning, and disorders.
Looks at the human and decides whether they are prone to various mental health disorders.
Common Topics : Depression, Trauma, Antagonism, Detachment, etc.
Social processes :
groups, cultures, and perception.
looks at how humans interact with each other to be able to reason above these socail processes.
Common Topics : Understands social relations. Rapport, Cooperation, etc.
Audio-visual Emotion Challenge 2011/2012
goal : take audio-visual content and be able to predict both discrete and continuous emotions.
Audio-visual Emotion Challenge 2013/2014
Reading specific text in a subset of videos
Labeled for emotion per frame (valence, arousal, dominance)
Performing HCI task
100 audio-visual sessions
Provide extracted audio-visual features
Audio-visual Emotion Challenge 2015/2016
It is a very popular dataset for 2 reasons.
1) contains fine-grained annotations at the frame level.
2) apart from both audio and visual features, it also includes physiological data.
Multimodal Sentiment Analysis
Concept in language. Whether a person is reflecting positively or negatively to a particular video.
People use non-verbal getures when they speak.
Multimodal Sentiment Analysis
Multi-Party Emotion Recognition
Goal : look at overall sentiment, and emotions of the entire conversation.
Question from student
Q. Does it make sense to have emotion labels at a frame level?
A. Image, language, etc. These features are not necessarily going to be alligned. Not all the modalities are perfectly alligned at the same time level. So have to deal with allignment problems.
What are the Core Challenges Most Involved in Affective Recognition?
Media description provides a free form text description when given a media(image, video, audio-visual clips).
It is one of the main enablers of multi-modal machine learning research in terms of really large datasets that have images and captions.
Large-Scale Image Captioning Dataset
Evaluating Image Caption Generations
Has an evaluation server
Evaluation is difficult as there is no one 'correct' answer for describing an image in a sentence. This is the main challenfe of media description.
Given a candidate sentence it is evaluated agains a set of "ground truth" sentences
goal : try their best to take cadidate sentence and evaluate against a set of ground truth sentences using either human or evaluation metrics.
Evaluating Image Captioining Results
A challenge was done with actual human evaluations of the captions(CVPR 2015)
Main Challenge : annotation.
What about automotic evaluation?
Have automatic ways to evaluate
lots of research is ongoing for building better models, and also on evaluation metrics that are more reliable.
How to Address the Challenge of Evaluation
What are the Core Challenges Most Involved in Media Description
Multimodal QA dataset1 - VQA(C1)
VQA Challenge 2016 and 2017 (C1)
Multimodal QA - other VQA datasets (C7)
Multimodal QA - Visual Reasoning (C8)
Question from student
Q. In VQA, do we manually balance the data so that it does not have any bias?
A. No universal answer to this. 2 ways for making non-baised model
1) take a closer look at the dataset, balance it, make sure that it's not biased.
2) ignore the data and look at how a model can be used to take in bias data but learn representations that are not biased.
What are the Core Challenges Most Involved in Multimodal QA?
Very similar to media description retrieval.
Main Additional Challenge:
Embedded Assistive Agents
The next generation of AI assistants need to interact with the real(or virtual) world.
e.g. personal assistance, robots, etc.
Language, Vision and Actions
Many Technical Challenges
Action, language and vision loop that you have to execute throughout possible long timestamps before possibly completing the task.
Navigating in a Virtual House
Visually-grounded natural language navigation in real buildings
What are the Core Challenges Most Involved in Multimodal Navigation?
1, 2) Representation and Fusion
Primarily on representation and fusion. Both to reason about language as well as the environment and the goal is to not just to get a representation that is useful for supervised learning and to maximize the label at that one time step, it's also to learn a good representation that can be useful for reinforcement learning being able to reason about long-term interactions with the world with possibly very sparse rewards.