Tagging Video Contents with Positive/Negative Interest Based on User’s Facial Expression The 14th International MultiMedia Modeling Conference (MMM2008) 9-11 January, 2008 Kyoto University, Japan Masanori Miyahara, Masaki Aoki, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering, Kobe University, Japan Table of Contents 1. Motivation 2. Proposed System 3. Experiments 4. Conclusion and Future Work 2 Introduction Multichannel digital broadcasting on TV Video-sharing sites on the Internet There are too many videos for viewers to select Video contents recommendation system ⇒ Tagging video contents 3 Video contents recommendation system (User analysis) (Content analysis) Remote control operation histories[1] Motion vector Color distribution Favorite keywords[2] Object recognition Facial expression[3] (Contents recommendation) Collaborative Filtering[4] Tagged contents database [1]2001,Taka [2]2001,Masumitsu [3]2006,Yamamoto [4]1994, Resnick 4 Main contribution Conventional tagging system based on a viewer’s facial expression [2006,Yamamoto] Estimate Interest or Neutral ⇒Estimate Neutral, Positive or Negative Simple facial feature points ⇒Allocate more feature points ⇒extract facial feature points by EBGM Noisy frame (tilting the face, occlusion) ⇒Reject noisy frames automatically ⇒ reject 5 Experimental environment Display The viewer looks at the display alone The viewer’s face is recorded into video by webcam A PC plays back the video and also analyzes the viewer’s face Webcam PC User Top view of experimental environment 6 Overview of proposed system Facial feature point extraction AdaBoost Facial expression recognition Tag EBGM ・Neutral SVM Person recognition Personal model Neutral image ・Positive ・Negative ・Rejective Personal facial expression classifier 7 Face regions extraction using AdaBoost Extract face regions using AdaBoost based on Haar-like feature [2001,Viola] The face size is normalized Reduce computation time in the next process merits 8 Facial feature point extraction and person recognition using EBGM [1997,Wiskott] Gabor Wavelet Jet Bunch Graph A jet is a set of convolution coefficients obtained by applying Gabor kernels with different frequencies and orientations to a point in an image A set of jets extracted at all facial feature points is called a face graph. A bunch of face graph extracted from many people is called a bunch graph Compute the similarity between Bunch Graph and Face Graph for searching facial feature points and person recognition 9 Facial expression recognition using SVM A viewer registers the personal model in advance Neutral image Personal facial expression SVM classifier After EBGM recognizes a viewer, the system retrieves the personal model Feature vector for SVM Neutral (A) Positive (B) 10 Definition of Facial Expression Classes Classes Meanings Neutral (Neu) Expressionless Positive (Pos) Happiness, Laughter, Pleasure, etc. Negative (Neg) Anger, Disgust, Displeasure, etc. Rejective (Rej) Not watching the display in the front direction, Occluding part of face, Tilting the face, etc. 11 Example of “Neutral” 12 Example of “Positive” 13 Example of “Negative” 14 Example of “Rejective” 15 Automatic rejection If the frame not extracted face region in, the frame is tagged with “Rejective” If the frame extracted face region in, the frame is used by training and testing data for SVM Neutral Positive SVM Yes Negative Rejective Face region extraction No 16 Experimental conditions Two subject (A,B) watched four videos. The length of them was 17 minutes in average The categories were “variety shows” System recorded their facial video in synchronization with the video content at 15fps After that, subjects tagged the video with four labels 17 Tagged results (manual) Neu Pos Neg Rej Subject A 49865 7665 3719 1466 Total (frame) 62715 Subject B 56531 2347 3105 775 62758 We used those experimental videos and the tagging data as training and testing data in the next SVM experiments 18 Preliminary experiment 1 Experiment for the performance of face regions extraction using AdaBoost Facial feature point extraction AdaBoost Facial expression recognition Tag EBGM ・Neutral SVM Person recognition Personal model Neutral image Personal facial expression classifier ・Positive ・Negative ・Rejective 19 Preliminary experiment 1 Face regions extraction using AdaBoost Subject A Neu Pos Neg False extraction 20 3 1 Total frames 49865 7665 3719 Rate (%) 0.040 0.039 0.027 Subject B Neu Pos Neg False extraction 132 106 9 Total frames 56531 2347 3105 Rate (%) 0.234 4.516 0.290 20 Preliminary experiment 2 Experiment for the performance of person recognition using EBGM Facial feature point extraction AdaBoost Facial expression recognition Tag EBGM ・Neutral Person recognition SVM ・Positive ・Negative Personal model Neutral image Personal facial expression classifier ・Rejective 21 Preliminary experiment 2 Person recognition using EBGM Subject A Neu Pos Neg False recognition 2 0 0 Total frames 49845 7662 3718 Rate(%) 0.004 0.000 0.000 Subject B Neu Pos Neg False recognition 2 20 0 Total frames 56399 2241 3096 Rate(%) 0.004 0.893 0.000 22 Experiment Experiment for the performance of facial expression recognition using SVM Facial feature point extraction AdaBoost Facial expression recognition Tag EBGM ・Neutral Person recognition SVM ・Positive ・Negative Personal model Neutral image ・Rejective Personal facial expression classifier 23 Experiment Facial expression recognition using SVM Three of four experimental videos (11,000sec*15fps*3videos≒500,000frames) were used for training data, and the rest for testing data cross validation Precision and recall rate were computed by the followings 24 Experimental results Facial expression recognition using SVM 1.00 0.98 0.98 0.90 0.93 0.89 0.88 0.82 0.74 0.80 0.81 0.70 0.60 precision recall 0.50 0.40 0.30 0.20 0.10 0.00 Neu Pos Neg Rej 25 Discussion The averaged recall rate was 87% and the averaged precision rate was 88%. When the subjects modestly expressed their emotion, the system often mistook the facial expression for Neutral. When the subjects had an intermediate facial expression, the system often made a mistake because one expression class was only assumed in a frame. 26 Demo 27 Conclusion and future work We proposed the system that tagged video contents with Neutral, Positive, Negative, Rejective labels based on viewer’s facial expression. The evaluation of various categories and various subjects Combination of user analysis and content analysis Construct a automatic video content recommendation system 28 - 29 Experimental Result Confusion matrix - subject A Subject A Neu Pos Neg Rej Sum Recall(%) Neu 48275 443 525 622 49865 96.81 Pos 743 6907 1 14 7665 90.11 Neg 356 107 3250 6 3719 87.39 Rej 135 0 5 1326 1466 90.45 Sum 49509 7457 3781 1968 62715 91.19 Precision(%) 97.51 92.62 85.96 67.38 85.87 30 Experimental Result Confusion matrix - subject B Subject B Neu Pos Neg Rej Sum Recall(%) Neu 56068 138 264 61 56531 99.18 Pos 231 2076 8 32 2347 88.45 Neg 641 24 2402 38 3105 77.36 Rej 203 0 21 551 775 71.10 Sum 57143 2238 2695 682 62758 84.02 Precision(%) 98.12 92.76 89.13 80.79 90.20 31 Computation time Recording facial video (15fps) Processing facial video Pentium4 3GHz (1fps) Core2 Quad 2.4GHz (4fps) If a user watches video content within about 6 hours per day, a system(Core2 Quad) can complete the process 32 Conventional system VS propose system The video data used to evaluate the conventional system is not released, so we can’t make an easy comparison. 33 Learning in advance Current proposed system The viewer’s neutral image His personal facial expression classifier • He watches video content (for 50 minutes) • After that, he tags the video content Future work (Saving user’s trouble) Semi-supervised or unsupervised learning 34 The video categories Experiment in this study Only “variety shows” Future work “dramas”, ”news”, etc. But I expect our system not to work well because a viewer’s facial expression is not very changing when he watches such video 35 Conventional facial expression recognition About facial expression recognition, many studies have been carried out. So, we also have to consider other methods 36 Evaluation The value of experimental results (Averaged recall for subject A + Averaged recall for subject B)/2 (Averaged precision for subject A + Averaged precision for subject B)/2 37 Q and A リアルタイム性について 特徴点を増やす意味は 事前学習 バラエティだけでいいのか? コンテンツ推薦はどうやってやるのか なぜEBGMか?AAMは? ホラー映画 エクマンの6表情 タギングはいつするの? テレビにカメラはついてないが ボタンをおせばいいのでは? 動いていないときは処理しないようにすれば処理時間短 縮できる 毎フレームと区間、どっちがいいか キーワードやEPGだけで十分ではないか 顔を撮影されるのは心理的な抵抗がある ドラマをみて感動したってのはどこ? 手動タギングは被験者自身がやるが、他人がやったらど うか 映像を2回見ると影響はあるか Rejectiveの仕組みは? 顔表情の分野のほかの手法と比較してどうか 番組自動推薦をやろうとすると大量の被験者が必要とな るが実現できるか? 2人で十分なのか 被験者が表情を強調あるいは抑制していないことをどう やって保障するのか タギングするときには、コンテンツをみるのか、自分の顔を みるのか。 表情が認識できなかったとき、どうやって番組を推薦する のか。何かプランはあるか 技術的な新規性はなにか?既存の技術を組み合わせた だけじゃないのか。 無表情画像と顔表情をあらかじめ登録するというのは、大 量のユーザがこのシステムを使うときには障壁とならない か。 スポーツや映画などではどうか。 実環境では、照明変動や顔の動きにどう対応するか TRECVIDとかはどう? 結論にメインコントリビューションが書かれていない 協調フィルタリングって? 38
© Copyright 2024 ExpyDoc