https://dl.acm.org/doi/pdf/10.1145/3290605.3300719
Found 1) People reference intervals of video more frequently than time-points
2) visual entities are referenced more often than sounds