PANDA: Pose Aligned Networks for Deep A5ribute Modeling Ning Zhang1,2 Manohar Paluri1 Marc’Aurelio Ranzato1 Trevor Darrell2 Lubomir Bourdev1 1 Facebook AI Research 2 EECS, UC Berkeley Why is a5ribute classificaQon challenging? Low resoluQon Pose variaQons Occlusion Toward a5ribute classificaQon Transfer knowledge Facial A5ribute Farhadi et al. (CVPR 09) [Lampert et al. (CVPR 09), Farhadi et al. (CVPR 09)] [Kumar et al. (ICCV 09)] Part-‐based approach [Bourdev et al. (ICCV11), Zhang et al. (ICCV 13) Joo et al. (ICCV 13)] Progress in deep learning image classificaQon object detecQon [Krizeshsky et al. NIPS 12, Zeiler et al. ICLR 14] [Girshick et al. CVPR 14] human pose esQmaQon face verificaQon [Toshev et al. CVPR 14] [Taigman et al. CVPR 14 ] Can we train CNN from scratch? a"ribute Lack of training data! bounding box images method mean AP Joo et al. ICCV 2013 CNN from scratch 70.7 58.11 What if we finetune from ImageNet? DeCAF classifier gender short sleeves wear hat wear shorts long hair method Joo et al from scratch from ImageNet mean AP 70.7 58.11 67.49 How can we simplify the task? [Donahue et al. ICML 2014] Decompose the image into parts Part-‐based approach [Bourdev et al. (ICCV11), Zhang et al. (ICCV 13) Joo et al. (ICCV 13)] Decompose the image into parts gender long hair wear long pants wear jeans wear hat Wear glasses is sihng is playing tennis is dancing … Our approach Part-‐based models Pose normalizaQon Deep convoluLonal networks DiscriminaQve feature representaQon Pose Aligned Networks for Deep A5ribute modeling (PANDA) Poselets capture part of the pose from a given viewpoint [Bourdev & Malik, ICCV 2009] PANDA whole person region representaQon poselet 1 SVM Classifier CNN poselet 2 gender short sleeves wear hat wear shorts long hair CNN CNN Final representaQon Part-‐Level CNN each poselet CNN Input: Poselet RGB patches fc_attr 5 5 conv 5 norm+pool 3 56 56x 3 5 64 3 28 28x 64 12 12x conv 128 attribute 1 128 attribute 2 Generic a"ribute layer 3 norm+pool 3 64 6x6 64 3x3 576 128 attribute N Final RepresentaLon HolisLc Input: Poselet RGB patches 5 5 conv 5 norm+pool 56 56x 3 3 64 12 12x 64 64 6x6 5 conv norm+pool 3 5 64 conv 3 64 12 12x 6x6 64 5 conv 56 56x 3 5 64 attribute 1 128 attribute 2 attribute N 128 attribute 1 128 attribute 2 fc_attr 5 3 128 128 3 28 28x 64 12 12x conv 3 norm+pool 3 64 6x6 64 3x3 576 128 Linear SVM attribute N 3x3 Input: Poselet RGB patches norm+pool 128 3 norm+pool 3 28 28x 64 576 5 attribute 2 fc_attr 5 3 128 3x3 Input: Poselet RGB patches 56 56x attribute 1 3 norm+pool 3 28 28x 64 conv 3 5 576 5 128 fc_attr attribute N gender short sleeves wear hat wear shorts long hair Dataset: A5ribute 25k DistribuLon of ground truth labels baby short sleeves sunglasses PosiLve dress NegaLve glasses UnspecLfied hat long hair male 2061 training examples per poselet on average RESULTS 100 90 80 70 60 50 40 30 20 10 0 Average Precision (AP) on A5ribute 25k male long hair hat Bourdev et al. ICCV13 glasses dress sunglasses short sleeves DPD Zhang et al. ICCV 13 baby 100 90 80 70 60 50 40 30 20 10 0 Average Precision (AP) on A5ribute 25k 100% improvement male long hair hat Bourdev et al. ICCV13 glasses dress sunglasses DPD Zhang et al. ICCV 13 short sleeves PANDA baby Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselet only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselets only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Component EvaluaQon method PANDA (HolisQc + Poselets) HolisQc only Poselets only HolisQc + DPM mean AP 70.74 44.97 64.72 61.20 Poselets vs DPM Forced to fire no ma"er what Frontal face poselet Head DPM Mixes different poses Alignment noise Transfer learning Input: Poselet RGB patches What about new a5ributes? 5 5 conv 5 norm+pool 3 56 56x 3 5 64 128 attribute 1 128 attribute 2 fc_attr 3 28 28x 64 12 12x conv 3 norm+pool 3 64 6x6 64 3x3 576 128 attribute N smiling: AP 84.7% walking: AP 26.0% (frequency baseline 40.67% ) (frequency baseline 4.34%) Adding new a"ributes and retrain CNNs Use the same CNNs only retrain SVM classifier sibng: AP 25.70% (frequency baseline 7.65%) AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-‐Pure DeCAF PANDA PANDA-‐ parts-‐ only AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-‐Pure HolisQc (DeCAF) PANDA PANDA-‐ parts-‐ only AP on Berkeley A5ributes of People Dataset 85 78.98 80 75 69.88 70 65 70.7 69.15 67.49 65.18 60 58.11 55 50 Bourdev et al. DPD Joo et al. DL-‐Pure DeCAF PANDA PANDA-‐ parts-‐ only The part-‐level CNNs are trained using A"ribute 25k data. Top scoring examples wear glasses short hair female Top scoring examples wear hat wear shorts wear jeans Hard to see skin Failure Cases Unusual pose Predicted: Long sleeves, Ground truth: short sleeves Predicted: short pants, ground truth: Long pants AnnotaLon errors ambiguous Gender RecogniQon on Labeled Faces in the Wild Much easier dataset – no occlusion, high resolution, centered frontal faces Method Kumar et al Frontal face poselet PANDA Gender AP 95.52 96.43 99.54 [Kumar et al, ICCV 2009] Gender RecogniQon on Labeled Faces in the Wild Much easier dataset – no occlusion, high resolution, centered frontal faces Method Kumar et al Frontal face poselet PANDA Gender AP 95.52 96.43 99.54 Male of female? [Kumar et al, ICCV 2009] Does more data help? 97 Average Precision (AP) 96 95 94 93 92 91 90 89 0 20 40 number of training examples in k 60 80 DL-‐Pure 100 DL-‐Poselets Comparison Bourdev et al. ICCV 11 • Use poselet as part-‐based model • Has context-‐level a5ribute classifier • Use HOG+color+skin+part masks PANDA • Use poselets as part-‐based model • A5ributes are jointly trained • Training part-‐level CNN for powerful discriminaQve feature • Generalized much be5er to new a5ributes Conclusion Part-based models Pose normalization Deep convolutional networks Discriminative feature representation • Pose-‐normalizaQon significantly helps deep convoluQonal networks in the task of a5ribute classificaQon. • Mid-‐level parts remain important in the context of CNNs. Thanks! • Code and pre-‐trained models will be released soon. *None of the images in this slides are taken from Facebook. Running Qme • Single CPU • 13s (poselet detecQon) +2s( feature extracQon)
© Copyright 2024 ExpyDoc