the personal model

Tagging Video Contents with
Positive/Negative Interest
Based on User’s Facial Expression
The 14th International MultiMedia Modeling Conference (MMM2008)
9-11 January, 2008 Kyoto University, Japan
Masanori Miyahara, Masaki Aoki, Tetsuya Takiguchi, Yasuo Ariki
Graduate School of Engineering, Kobe University, Japan
Table of Contents
1. Motivation
2. Proposed System
3. Experiments
4. Conclusion and Future Work
2
Introduction


Multichannel digital broadcasting on TV
Video-sharing sites on the Internet
There are too many videos for viewers to select

Video contents recommendation system
⇒ Tagging video contents
3
Video contents recommendation system
(User analysis)
(Content analysis)
Remote control
operation histories[1]
Motion vector
Color distribution
Favorite keywords[2]
Object recognition
Facial expression[3]
(Contents recommendation)
Collaborative Filtering[4]
Tagged contents
database
[1]2001,Taka [2]2001,Masumitsu
[3]2006,Yamamoto [4]1994, Resnick
4
Main contribution

Conventional tagging system based on a viewer’s facial
expression [2006,Yamamoto]
 Estimate Interest or Neutral
⇒Estimate Neutral, Positive or Negative

Simple facial feature points
⇒Allocate more feature points
⇒extract facial feature points by EBGM

Noisy frame (tilting the face, occlusion)
⇒Reject noisy frames automatically
⇒
reject
5
Experimental environment
Display

The viewer looks at the display
alone

The viewer’s face is recorded
into video by webcam

A PC plays back the video and
also analyzes the viewer’s face
Webcam
PC
User
Top view of experimental
environment
6
Overview of proposed system
Facial feature point extraction
AdaBoost
Facial expression
recognition
Tag
EBGM
・Neutral
SVM
Person recognition
Personal model
Neutral image
・Positive
・Negative
・Rejective
Personal facial expression classifier
7
Face regions extraction using AdaBoost

Extract face regions using AdaBoost based on
Haar-like feature [2001,Viola]

The face size is normalized

Reduce computation time in the next process
merits
8
Facial feature point extraction and
person recognition using EBGM
[1997,Wiskott]
Gabor Wavelet



Jet
Bunch Graph
A jet is a set of convolution coefficients obtained by applying
Gabor kernels with different frequencies and orientations to a
point in an image
A set of jets extracted at all facial feature points is called a face
graph. A bunch of face graph extracted from many people is
called a bunch graph
Compute the similarity between Bunch Graph and Face Graph
for searching facial feature points and person recognition
9
Facial expression recognition using SVM

A viewer registers the personal model in advance


Neutral image
Personal facial expression SVM classifier

After EBGM recognizes a viewer, the system
retrieves the personal model

Feature vector
for SVM
Neutral (A)
Positive (B)
10
Definition of Facial Expression Classes
Classes
Meanings
Neutral (Neu)
Expressionless
Positive (Pos)
Happiness, Laughter, Pleasure, etc.
Negative (Neg)
Anger, Disgust, Displeasure, etc.
Rejective (Rej)
Not watching the display
in the front direction,
Occluding part of face,
Tilting the face, etc.
11
Example of “Neutral”
12
Example of “Positive”
13
Example of “Negative”
14
Example of “Rejective”
15
Automatic rejection


If the frame not extracted face region in, the
frame is tagged with “Rejective”
If the frame extracted face region in, the frame
is used by training and testing data for SVM
Neutral
Positive
SVM
Yes
Negative
Rejective
Face region extraction
No
16
Experimental conditions





Two subject (A,B) watched four videos.
The length of them was 17 minutes in average
The categories were “variety shows”
System recorded their facial video in
synchronization with the video content at 15fps
After that, subjects tagged the video with four labels
17
Tagged results (manual)
Neu
Pos
Neg
Rej
Subject A
49865
7665
3719
1466
Total
(frame)
62715
Subject B
56531
2347
3105
775
62758
We used those experimental videos and the tagging data
as training and testing data in the next SVM experiments
18
Preliminary experiment 1
Experiment for the performance of face regions extraction using AdaBoost
Facial feature point extraction
AdaBoost
Facial expression
recognition
Tag
EBGM
・Neutral
SVM
Person recognition
Personal model
Neutral image
Personal facial expression classifier
・Positive
・Negative
・Rejective
19
Preliminary experiment 1
Face regions extraction using AdaBoost
Subject A
Neu
Pos
Neg
False extraction
20
3
1
Total frames
49865
7665
3719
Rate (%)
0.040
0.039
0.027
Subject B
Neu
Pos
Neg
False extraction
132
106
9
Total frames
56531
2347
3105
Rate (%)
0.234
4.516
0.290
20
Preliminary experiment 2
Experiment for the performance of person recognition using EBGM
Facial feature point extraction
AdaBoost
Facial expression
recognition
Tag
EBGM
・Neutral
Person recognition
SVM
・Positive
・Negative
Personal model
Neutral image
Personal facial expression classifier
・Rejective
21
Preliminary experiment 2
Person recognition using EBGM
Subject A
Neu
Pos
Neg
False recognition
2
0
0
Total frames
49845
7662
3718
Rate(%)
0.004
0.000
0.000
Subject B
Neu
Pos
Neg
False recognition
2
20
0
Total frames
56399
2241
3096
Rate(%)
0.004
0.893
0.000
22
Experiment
Experiment for the performance of facial expression recognition using SVM
Facial feature point extraction
AdaBoost
Facial expression
recognition
Tag
EBGM
・Neutral
Person recognition
SVM
・Positive
・Negative
Personal model
Neutral image
・Rejective
Personal facial expression classifier
23
Experiment
Facial expression recognition using SVM

Three of four experimental videos
(11,000sec*15fps*3videos≒500,000frames) were
used for training data, and the rest for testing data


cross validation
Precision and recall rate were computed by the
followings
24
Experimental results
Facial expression recognition using SVM
1.00
0.98 0.98
0.90
0.93
0.89
0.88
0.82
0.74
0.80
0.81
0.70
0.60
precision
recall
0.50
0.40
0.30
0.20
0.10
0.00
Neu
Pos
Neg
Rej
25
Discussion

The averaged recall rate was 87% and the
averaged precision rate was 88%.

When the subjects modestly expressed their
emotion, the system often mistook the facial
expression for Neutral.

When the subjects had an intermediate facial
expression, the system often made a mistake
because one expression class was only
assumed in a frame.
26
Demo
27
Conclusion and future work
We proposed the system that tagged video contents
with Neutral, Positive, Negative, Rejective labels
based on viewer’s facial expression.



The evaluation of various categories and
various subjects
Combination of user analysis and content
analysis
Construct a automatic video content
recommendation system
28
-
29
Experimental Result
Confusion matrix - subject A
Subject A
Neu
Pos
Neg
Rej
Sum
Recall(%)
Neu
48275
443
525
622
49865
96.81
Pos
743
6907
1
14
7665
90.11
Neg
356
107
3250
6
3719
87.39
Rej
135
0
5
1326
1466
90.45
Sum
49509
7457
3781
1968
62715
91.19
Precision(%)
97.51
92.62
85.96
67.38
85.87
30
Experimental Result
Confusion matrix - subject B
Subject B
Neu
Pos
Neg
Rej
Sum
Recall(%)
Neu
56068
138
264
61
56531
99.18
Pos
231
2076
8
32
2347
88.45
Neg
641
24
2402
38
3105
77.36
Rej
203
0
21
551
775
71.10
Sum
57143
2238
2695
682
62758
84.02
Precision(%)
98.12
92.76
89.13
80.79
90.20
31
Computation time

Recording facial video (15fps)

Processing facial video
 Pentium4 3GHz (1fps)
 Core2 Quad 2.4GHz (4fps)

If a user watches video content within about 6 hours
per day, a system(Core2 Quad) can complete the
process
32
Conventional system VS
propose system

The video data used to evaluate the conventional
system is not released, so we can’t make an easy
comparison.
33
Learning in advance

Current proposed system
 The viewer’s neutral image
 His personal facial expression classifier
• He watches video content (for 50 minutes)
• After that, he tags the video content

Future work (Saving user’s trouble)
 Semi-supervised or unsupervised learning
34
The video categories

Experiment in this study
 Only “variety shows”

Future work
 “dramas”, ”news”, etc.
But I expect our system not to work well
because a viewer’s facial expression is not very
changing when he watches such video
35
Conventional facial
expression recognition

About facial expression recognition, many
studies have been carried out.

So, we also have to consider other methods
36
Evaluation

The value of experimental results
 (Averaged recall for subject A + Averaged recall for subject B)/2
 (Averaged precision for subject A + Averaged precision for subject B)/2
37
Q and A





















リアルタイム性について
特徴点を増やす意味は
事前学習
バラエティだけでいいのか?
コンテンツ推薦はどうやってやるのか
なぜEBGMか?AAMは?
ホラー映画
エクマンの6表情
タギングはいつするの?
テレビにカメラはついてないが
ボタンをおせばいいのでは?
動いていないときは処理しないようにすれば処理時間短
縮できる
毎フレームと区間、どっちがいいか
キーワードやEPGだけで十分ではないか
顔を撮影されるのは心理的な抵抗がある
ドラマをみて感動したってのはどこ?
手動タギングは被験者自身がやるが、他人がやったらど
うか
映像を2回見ると影響はあるか
Rejectiveの仕組みは?
顔表情の分野のほかの手法と比較してどうか
番組自動推薦をやろうとすると大量の被験者が必要とな
るが実現できるか?











2人で十分なのか
被験者が表情を強調あるいは抑制していないことをどう
やって保障するのか
タギングするときには、コンテンツをみるのか、自分の顔を
みるのか。
表情が認識できなかったとき、どうやって番組を推薦する
のか。何かプランはあるか
技術的な新規性はなにか?既存の技術を組み合わせた
だけじゃないのか。
無表情画像と顔表情をあらかじめ登録するというのは、大
量のユーザがこのシステムを使うときには障壁とならない
か。
スポーツや映画などではどうか。
実環境では、照明変動や顔の動きにどう対応するか
TRECVIDとかはどう?
結論にメインコントリビューションが書かれていない
協調フィルタリングって?
38