Slides

PANDA: Pose Aligned Networks for Deep A5ribute Modeling Ning Zhang1,2 Manohar Paluri1 Marc’Aurelio Ranzato1 Trevor Darrell2 Lubomir Bourdev1 1 Facebook AI Research 2 EECS, UC Berkeley Why is a5ribute classificaQon challenging? Low resoluQon Pose variaQons Occlusion Toward a5ribute classificaQon Transfer knowledge Facial A5ribute Farhadi et al. (CVPR 09) [Lampert et al. (CVPR 09), Farhadi et al. (CVPR 09)] [Kumar et al. (ICCV 09)] Part-­‐based approach [Bourdev et al. (ICCV11), Zhang et al. (ICCV 13) Joo et al. (ICCV 13)] Progress in deep learning image classificaQon object detecQon [Krizeshsky et al. NIPS 12, Zeiler et al. ICLR 14] [Girshick et al. CVPR 14] human pose esQmaQon face verificaQon [Toshev et al. CVPR 14] [Taigman et al. CVPR 14 ] Can we train CNN from scratch? a"ribute Lack of training data! bounding box images method mean AP Joo et al. ICCV 2013 CNN from scratch 70.7 58.11 What if we finetune from ImageNet? DeCAF classifier gender short sleeves wear hat wear shorts long hair method Joo et al from scratch from ImageNet mean AP 70.7 58.11 67.49 How can we simplify the task? [Donahue et al. ICML 2014] Decompose the image into parts Part-­‐based approach [Bourdev et al. (ICCV11), Zhang et al. (ICCV 13) Joo et al. (ICCV 13)] Decompose the image into parts gender long hair wear long pants wear jeans wear hat Wear glasses is sihng is playing tennis is dancing … Our approach Part-­‐based models Pose normalizaQon Deep convoluLonal networks DiscriminaQve feature representaQon Pose Aligned Networks for Deep A5ribute modeling (PANDA) Poselets capture part of the pose from a given viewpoint [Bourdev & Malik, ICCV 2009] PANDA whole person region representaQon poselet 1 SVM Classifier CNN poselet 2 gender
short
sleeves
wear hat
wear shorts
long hair
CNN CNN Final representaQon Part-­‐Level CNN each poselet CNN Input: Poselet RGB patches
fc_attr
5
5
conv
5
norm+pool
3
56
56x
3
5
64
3
28
28x
64
12
12x
conv
128
attribute 1
128
attribute 2
Generic a"ribute layer 3
norm+pool
3
64
6x6
64
3x3
576
128
attribute N
Final RepresentaLon HolisLc Input: Poselet RGB patches
5
5
conv
5
norm+pool
56
56x
3
3
64
12
12x
64
64
6x6
5
conv
norm+pool
3
5
64
conv
3
64
12
12x
6x6
64
5
conv
56
56x
3
5
64
attribute 1
128
attribute 2
attribute N
128
attribute 1
128
attribute 2
fc_attr
5
3
128
128
3
28
28x
64
12
12x
conv
3
norm+pool
3
64
6x6
64
3x3
576
128
Linear SVM attribute N
3x3
Input: Poselet RGB patches
norm+pool
128
3
norm+pool
3
28
28x
64
576
5
attribute 2
fc_attr
5
3
128
3x3
Input: Poselet RGB patches
56
56x
attribute 1
3
norm+pool
3
28
28x
64
conv
3
5
576
5
128
fc_attr
attribute N
gender short sleeves wear hat wear shorts long hair Dataset: A5ribute 25k DistribuLon of ground truth labels baby short sleeves sunglasses PosiLve dress NegaLve glasses UnspecLfied hat long hair male 2061 training examples per poselet on average RESULTS 100 90 80 70 60 50 40 30 20 10 0 Average Precision (AP) on A5ribute 25k male long hair hat Bourdev et al. ICCV13 glasses dress sunglasses short sleeves DPD Zhang et al. ICCV 13 baby 100 90 80 70 60 50 40 30 20 10 0 Average Precision (AP) on A5ribute 25k 100% improvement male long hair hat Bourdev et al. ICCV13 glasses dress sunglasses DPD Zhang et al. ICCV 13 short sleeves PANDA baby Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselet only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselets only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselets only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Poselets vs DPM Forced to fire no ma"er what Frontal face poselet Head DPM Mixes different poses Alignment noise Transfer learning Input: Poselet RGB patches
What about new a5ributes? 5
5
conv
5
norm+pool
3
56
56x
3
5
64
128
attribute 1
128
attribute 2
fc_attr
3
28
28x
64
12
12x
conv
3
norm+pool
3
64
6x6
64
3x3
576
128
attribute N
smiling: AP 84.7% walking: AP 26.0% (frequency baseline 40.67% ) (frequency baseline 4.34%) Adding new a"ributes and retrain CNNs Use the same CNNs only retrain SVM classifier sibng: AP 25.70% (frequency baseline 7.65%) AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-­‐Pure DeCAF PANDA PANDA-­‐
parts-­‐
only AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-­‐Pure HolisQc (DeCAF) PANDA PANDA-­‐
parts-­‐
only AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-­‐Pure DeCAF PANDA PANDA-­‐
parts-­‐
only The part-­‐level CNNs are trained using A"ribute 25k data. Top scoring examples wear glasses short hair female Top scoring examples wear hat wear shorts wear jeans Hard to see skin Failure Cases Unusual pose Predicted: Long sleeves, Ground truth: short sleeves Predicted: short pants, ground truth: Long pants AnnotaLon errors ambiguous Gender RecogniQon on Labeled Faces in the Wild Much easier dataset – no occlusion, high resolution, centered frontal faces
Method Kumar et al Frontal face poselet PANDA Gender AP 95.52 96.43 99.54 [Kumar et al, ICCV 2009]
Gender RecogniQon on Labeled Faces in the Wild Much easier dataset – no occlusion, high resolution, centered frontal faces
Method Kumar et al Frontal face poselet PANDA Gender AP 95.52 96.43 99.54 Male of female?
[Kumar et al, ICCV 2009]
Does more data help? 97 Average Precision (AP) 96 95 94 93 92 91 90 89 0 20 40 number of training examples in k 60 80 DL-­‐Pure 100 DL-­‐Poselets Comparison Bourdev et al. ICCV 11 •  Use poselet as part-­‐based model •  Has context-­‐level a5ribute classifier •  Use HOG+color+skin+part masks PANDA •  Use poselets as part-­‐based model •  A5ributes are jointly trained •  Training part-­‐level CNN for powerful discriminaQve feature •  Generalized much be5er to new a5ributes Conclusion Part-based models
Pose normalization
Deep convolutional networks
Discriminative feature representation
•  Pose-­‐normalizaQon significantly helps deep convoluQonal networks in the task of a5ribute classificaQon. •  Mid-­‐level parts remain important in the context of CNNs. Thanks! •  Code and pre-­‐trained models will be released soon. *None of the images in this slides are taken from Facebook. Running Qme •  Single CPU •  13s (poselet detecQon) +2s( feature extracQon)