1. Introduction
- In this paper, it tried to use CNN for recognizing human action which containing sptial and temporal information
- Architecture is based on two streams(spatial info -> video frames/ temporal info -> optical flow)
2. Two-stream architecture for video recognition
- As shown in Fig1, there are two stream( spatial stream and temporal stream )
- Each stream is implemented using CNN and the softmax scored of last fusion( averaging and SVM )
Spatial stream CNN : there are useful clues in static frames
3. Optical Flow CNN
- Optical CNN is CNN model about temporal recognition
Optical flow stacking
-> displace vector at point (u,v) in frame t.
Trajectory stacking
- (1) : stores displacement
- (2) : stores vector samples along the trajactory
Bi-directional optical flow
- computting additional set of displacement fields in the opposite direction
- construct input by stacking forward L/2 frames and -L/2 frames
Mean flow subtraction
- There can be case like camera movement which can cause dominant displacement
- for simplify, just substracting mean vector
4. Multi-task learning
- it is hard to concatnate two different dataset for video learning
- there are two softmax layer ( one for HMDB-51 and one for UCF-101 )
What's with the recognition in videos? What does it even mean? Does it have something to do with Streamer Technik Tipps? Or am I wrong to suggest this?