Perception, Prediction, and Control for Social Machines Research Statement Hyun Soo Park www.cs.cmu.edu/~hyunsoop [email protected] Carnegie Mellon University Humans interact by sending visible social signals, such as facial expressions, body gestures, and gaze movements. Such scenes are very common in our daily life and, increasingly, artificial agents are entering these spaces. For artificial agents to cohabit in these scenes with humans as collaborating team members, it is necessary that they are equipped with social intelligence1 — ability to understand social signals and to behave in a socially acceptable manner. However, designing these machines is challenging because (1) social signals are often too subtle to detect by current action recognition systems; (2) social signals are related to each other in complicated ways; (3) immediate responses to the signals are a key factor producing socially acceptable behaviors in dynamic social scenes. The overarching goal of my research is to address these key challenges — to design machines that integrate social intelligence into their functions. Designing such machines requires three fundamental research thrusts: (1) Subtle motion perception: the machines must be able to measure subtle social signals from visual input and transform them to computational representations; (2) Predictive social modeling: the machines must be able to build a predictive model of social behaviors by reasoning about relationships between the social signals; (3) Control and design with social intelligence: the machines must be able to produce the desired output from the received social signals by incorporating the predictive social model. These three thrusts are highly inter-related because a closed loop system which integrates perception, prediction, and control for social intelligence is needed for adaptive behaviors in dynamic social scenes. In my prior work, I have studied these thrusts: • Subtle motion perception: 3D reconstruction of social signals provides a computational representation that allows the machines to analyze social scenes, e.g., to build a model, reason about relationships, and predict social behaviors. As social signals evolve in the form of motion such as body gestures, my colleagues and I have presented a method to reconstruct 3D scene motion from a collection of images [1]2 and extended this work to reconstruct human body motion [2]3 . To address detailed subtle motion, we have designed the Panoptic Studio for markerless motion capture using 480 synchronized cameras that enables us to reconstruct subtle motion in 3D with high resolution [3]4 as shown in Figure 1(a). • Predictive social modeling: Predictive ability of social behaviors is a fundamental measure of social intelligence because it enables the machines to plan their functions with respect to the predicted behaviors. I am particularly interested in gaze directions that reflect attentive behaviors. We have presented a predictive field model that allows a machine to predict gaze directions at any locations in social scenes [4]5 . The field model encodes the relationship between gaze directions and sources of attention and we estimated this field from the gaze directions of a few observed social members [5]6 as shown in Figure 1(b). We exploited head-mounted cameras to estimate gaze directions as ego-motion of body-mounted cameras exhibit 3D body motion [6]7 . 1 E. L. Thorndike, “Intelligence and its use”, Harper’s Magazine, 1920. 2 http://www.cs.cmu.edu/~hyunsoop/trajectory_reconstruction.html 3 http://www.cs.cmu.edu/~hyunsoop/articulated_trajectory.html 4 The password protected video can be found from https://vimeo.com/82754209 (pw: dome). 5 http://www.cs.cmu.edu/~hyunsoop/socialcharge.html 6 http://www.cs.cmu.edu/~hyunsoop/gaze_concurrence.html 7 https://www.youtube.com/watch?v=xbI-NWMfGPs 1 Motors and encoders Tail Compliant feet Sources of attention Gaze directions (a) Motion reconstruction (b) Predictive social model (c) Water running robot Figure 1: (a) We reconstructed subtle motion in 3D using a markerless motion capture system. Top row: sampled image input and bottom row: reconstructed trajectories. (b) We estimated a predictive model of the social scene that encodes the relationship between gaze directions and sources of attention. Top row: sampled video input from head-mounted cameras and bottom row: estimated sources of attention. (c) We studied control and design of the quadrupedal water running robot. Top row: robot design and bottom row: high speed images of the robot running on the water surface. • Control and design with social intelligence: Given this computational basis, a fundamental question on my research is “how do we integrate social signals into a control loop system to produce desired function?”. This question opens a new domain of design problems that address the relationship between social interactions and system controls. I plan on tackling this question by leveraging my previous research on robot design and control [7–10]. My colleagues and I have studied dynamics of the quadrupedal water running robot via mathematical modeling of interactions between the water surface and footpads [8,10] as shown in Figure 1(c). This model is used to design a control loop system of the robot for locomotion balance on the water surface [7, 9]. My prior work provides a computational foundation for social scene understanding and robot control. By leveraging this work, I will address design problems of machines with social intelligence. In particular, I plan on three goals that align with my research direction: A) To reconstruct subtle social signals or intent of interactions: Subtle social signals such as an instant cynical smile, small nod, or finger gesture often convey an important intent or message during human interactions while machines are blind to them. I aim to reconstruct such signals in detail from cameras worn or held by members in a social group as demonstrated by dense motion capture in the Panoptic Studio [3]. The main challenge of subtle motion capture is to establish correspondences of moving points across cameras because baselines between these cameras are usually larger than cameras in the Panoptic Studio. I am interested in studying the relationship between reconstruction quality or ambiguity and camera motion, which will characterize desired camera motion [1]. When desired camera motion is not provided, reconstructing subtle motion can benefit from data captured in the Panoptic Studio. I also plan on investigating spatiotemporal features tailored for human interactions. B) To reason about a causal/reciprocal relationships between social interactions and to predict social affordance: Social signals are multi-directional signals. A signal transmitted by a social member may trigger the response of other members and the responses can, in turn, affect the behavior of the member. For instance, a pointing gesture of a lecturer with his finger (one social signal) triggers change of gaze directions of students (responses) and the lecturer can identify attention of the students by the alignment of gaze directions. I have been involved in the Computational Behavioral Science program led by NSF Expedition8 where we extract social signals and find a correlation between signals. Based on this experience and the 3D representation of social signals, my research will aim to discover causal and reciprocal relationships between social signals to build a predictive model of social scenes. 8 http://www.cbs.gatech.edu/ 2 Social affordance of space is another relationship of social signals with respect to environments. Some scene structures such as chair, table, or sofa are strongly related to human interactions and the spatial arrangement of such structures characterizes social space. For example, human interactions are more likely to take place at a sofa in a lobby than a corner of wall or under the table. By leveraging the social field model based on sources of attention [4, 5], I will study how the social interactions are spatially related to 3D scene structure and measure its social affordance. This will provide a richer predictive model of the human interactions. Architect can benefit from this social affordance because it will quantify how the space is social-friendly. C) To build a multi-agent system that reacts according to social interactions: Social intelligence enables machines to use social signals in their tasks. As a first step, I will design a multi-agent system that fully exploits social intelligence from previous two tasks. A predictive model [4] provides how humans will behave at the location and time. Using this predictive model, I plan on creating an automatic broadcasting/capturing system using small navigational robot platforms such as manipulators, wheeled mobile robots, or quadroters for sports, search and rescue, and medical scenes. Sources of attention will be estimated by measuring gaze directions of social members [5] and the paths of the robots will be planned based on the predictive model. Broad Impact This work on social intelligence will impact various domains of research including computer vision, graphics, architecture, and sociology. In computer vision, classic scene understanding has focused on understanding the structure of scenes, e.g., where a sofa is, how to navigate the room, or what an object is useful for. My research will provide a new interpretation on scene understanding in terms of social intelligence, e.g., what people attend to, what people want to accomplish, and whom people are interacting with. Reconstruction of subtle motion will facilitate high accuracy of reconstruction and recognition systems while existing frameworks primarily address gross motion. In graphics, reconstructing intent from these social signals will introduce a new axis of motion capture, i.e., capturing human intent and the predictive social model will allow us to animate virtual characters in massive multiplayer online games (MMOGs) or movies in a socially engaging way. In architecture, the social affordance will define the relationship between social interactions and space, which can provides architectural criteria to design social space. The study of human social behavior can also benefit from social scene understanding. Previous behavioral studies have heavily relied on manually annotated action predicates. My approach will provide automatic action annotations and learned relationships between actions, which will enable a machine to detect early stage of behavioral diseases, such as autism and provide an empirical foundation for behavioral research. References [1] H. S. Park, T. Shiratori, I. Matthews, and Y. Sheikh, “3D reconstruction of a moving point from a series of 2D projections,” in European Conference on Computer Vision, 2010. [2] H. S. Park and Y. Sheikh, “3D reconstruction of a smooth articulated trajectory from a monocular image sequence,” in International Conference on Computer Vision, 2011. [3] H. Joo, H. S. Park, and Y. Sheikh, “Optimal visibility estimation for large-scale dynamic 3D reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition (submitted), 2014. [4] H. S. Park, E. Jain, and Y. Sheikh, “Predicting gaze behavior using social saliency fields,” in International Conference on Computer Vision, 2013. [5] ——, “3D social saliency from head-mounted cameras,” in Advanced in Neural Information Processing Systems, 2012. [6] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. Hodgins, “Motion capture from body-mounted cameras,” Transactions on Graphics (SIGGRAPH), 2012. [7] H. S. Park, S. Floyd, and M. Sitti, “Roll and pitch motion analysis of a biologically inspired quadruped water runner robot,” International Journal of Robotics Research, 2010. [8] H. S. Park and M. Sitti, “Compliant footpad design analysis for a bio-inspired quadruped amphibious robot,” in IEEE/RSJ International Conference on Intelligent Robots and System, 2009. [9] H. S. Park, S. Floyd, and M. Sitti, “Dynamic modeling and analysis of pitch motion of a basilisk lizard inspired quadruped robot running on water,” in International Conference on Robotics and Automation, 2009. [10] ——, “Dynamic modeling of a basilisk lizard inspired quadruped robot running on water,” in IEEE/RSJ International Conference on Intelligent Robots and System, 2008. 3