VideoStory A New Mul1media Embedding for Few-­‐Example Recogni1on and Transla1on of Events Amirhossein Habibian, Thomas Mensink, Cees Snoek ISLA, University of Amsterdam Problem statement Recognize and translate video events Learning from few examples Provide seman1c interpreta1on of videos Event AMemp1ng bike trick Video Descrip1on 2 Video events in ACM Mul1media News events: earthquake, abdica/on, product launch Sport events: scoring goal, ace serve, slam dunk Social events: concert, debates, exhibi/ons Every day events: interac1ons of people and objects 3 Repairing an appliance Working on sewing project Grooming an animal Birthday party Recognizing events Represen1ng videos as histograms of low-­‐level features Local
Feature
descriptors
embedding
•  Visual descriptors
• SIFT, HOG, GIST, …
•  Video descriptors
• MBH, STIP, …
•  Audio descriptors
• MFCC, AIM, …
•  Bag-of-words
•  VLAD
•  Fisher vector
• Audio-visual BoW
Problem: very high-­‐dimensional and non seman1cally 4 [Jiang et al., TRECVID 2010] [Natarajan et al., CVPR 2012] [Chen et al., MM 2013] Recognizing and transla1ng events Represen1ng videos as histograms of concept scores Deep convolu1onal neural network Local
descriptors
•  Visual descriptors
• SIFT, HOG, GIST, …
•  Video descriptors
• MBH, STIP, …
•  Audio descriptors
• MFCC, AIM, …
Feature
embedding
•  Bag-of-words
Classification
•  Attribute detection
•  Concept detection
•  VLAD
•  Fisher vector
• Audio-visual BoW
Problem: define, annotate and train concept classifiers 5 [Smith et al., ICME 2003] [Hauptmann et al., TMM 2007] [Merler et al., TMM 2012] [Ma et al., MM 2012] Recogni1on and transla1on by embedding W xi Embedding A Stunt Bike Motorcycle yi Joint space where xi W ≈ yi A Explicitly relate training W and A from mul1media A = Iden1ty matrix individual term classifiers A = Projec1on matrix select/group terms [Rasiwasa et al., MM 2010] [Weston et al., IJCAI 2011] [Akata et al., CVPR 2013] [Das et al., WSDM 2013] 6 VideoStory: Embed the story of a video Stunt Bike Motorcycle xi yi si W A Embedding Design criteria: learn W and A such that Descrip/veness: preserve video descrip1ons Predictability: recognize terms from video content 7 Key observa1on: Compelling forces Descrip1veness en predictability are compelling All grouped Stunt/Bike/Motorcycle/… 8 VideoStory Stunt Bike/Motorcycle No grouping Stunt Bike Motorcycle … Why is this important? Grouping terms: Number of classes is reduced Training classifiers per group: More posi1ve examples available per group We can train from freely available web data 9 Key contribu1on: Joint op1miza1on Jointly op1mize for descrip1veness and predictability Video classifiers W Textual projec1on matrix A VideoStory embedding S VideoStory connects the two loss func1ons 10 VideoStory: Descrip1veness Reconstruct term vectors from VideoStory + A Textual projec1on matrix A VideoStory embedding S Term vectors yi Relates to: regularized latent seman1c indexing 11 VideoStory: Predictability Predict VideoStory from video features + W Video classifiers W VideoStory embedding S Video features xi Relates to: ridge regression 12 VideoStory: Training (1) VideoStory Training Video and descrip1ons VideoStory Algorithm A W Set of videos and their cap1ons Encode video features xi Fisher Vectors of MBH [Wang ICCV’13] Encode video descrip1ons yi Bag-­‐of-­‐words of terms 13 VideoStory: Training (2) VideoStory Training Video and descrip1ons VideoStory Algorithm A Using Stochas/c Gradient Descent: Choose random sample Compute sample gradient wrt objec1ve W Update parameters with step-­‐size η 14 [BoMou ICCS 2010] YouTube46K dataset Videos and 1tle descrip1ons from YouTube 46K videos, 19K unique terms in descrip1ons Seeded from video event descrip1ons Filters to remove low quality videos 15 Available for download: www.mediamill.nl VideoStory: Event classifier training VideoStory Training Event Classifier Training Video Video and descrip1ons VideoStory Construc<on S Labels Event Training Model VideoStory Algorithm W Event classifiers: SVM with RBF kernel A 16 Datasets for evalua1on TRECVID Mul1media Event Detec1on 2013 56K videos -­‐ 20 events -­‐ 10 posi1ves train videos Columbia Consumer Video 9K videos -­‐ 15 events -­‐ 10 posi1ves train videos 17 [Jiang et al. ICMR 2011][Strassel et al. LREC 2012] VideoStory: Recogni1on and transla1on VideoStory Training Event Classifier Training Video Video and descrip1ons VideoStory Construc<on S Labels Event Training Model VideoStory Algorithm W Recogni<on and Transla<on Video S A Event Recogni<on VideoStory Construc<on Event Transla<on 18 Evalua1on: Event score Mean Average Precision Descrip<on Rouge-­‐1 Experiment 1: Effect of Embedding Frequent terms: train classifier for most frequent terms Grouping first: first descrip1veness; then predictability VideoStory: joint descrip1veness and predictability VideoStory outperforms other embeddings 19 Experiment 2: Story Quality vs. Quan1ty Expert10K: 10K TRECVID videos with expert descrip1ons YouTube10K: 10K random subset of YouTube46K dataset YouTube46K: 46K YouTube videos and descrip1ons Web supervision on par with expert provided descrip1ons 20 Experiment 3: VideoStory vs Others AMributes [Habibian & Snoek CVIU’14] Low-­‐Level MBH [Wang & Schmid ICCV’13] VideoStory -­‐ MBH 21 TRECVID MED .135 .174 .196 Columbia CV .314 .409 .432 New Experiment: VideoStory with DeepNet AMributes [Habibian & Snoek CVIU’14] Low-­‐Level MBH [Wang & Schmid ICCV’13] VideoStory -­‐ MBH CNN features [Zeiler & Fergus ECCV’14] TRECVID MED .135 .174 .196 .198 VideoStory -­‐ CNN .243 22 Experiment 4: VideoStory transla1on Getting
Gettingaavehicle
vehicleunstuck
unstuck
0.5
0.5
Predictions
Predictions
water
water
−0.1
−0.1
people
people
truck
truck
dump
dump
dog
dog snow
snow
Term
Term
ss
drive
drive
mud
mud
Predictions
Predictions
0.4
0.4
car
car
Rock
Rockclimbing
climbing
climb
climb
hang
hang
dog
dog
fail
fail
rock
rock
boy
boy
wall
wall
indoor
indoor
−0.2
−0.2
Terms
Terms
Figure
Figure 8:
8: VideoStory
VideoStory event
event recognition
recognition and
and translation
translation results
results on
on
23 7.7. REFERENCES
REFERENCES
[22]
[22] Z.
Z.Ma,
Ma,Y.
Y.Yang,
Yang,Y
Experiment 4: VideoStory transla1on Evaluate on TRECVID MED Ground-­‐truth: provided descrip1ons Measure with ROUGE-­‐1 VideoStory outperforms predefined aMributes 24 Conclusions VideoStory a seman1c mul1media embedding –  Jointly op1mizes descrip1veness & predictability –  Training event classifiers from few examples –  Translate videos to textual descrip1on 25 Thank you!