Perception, Prediction, and Control for Social Machines

Perception, Prediction, and Control for Social Machines
Research Statement
Hyun Soo Park
www.cs.cmu.edu/~hyunsoop
[email protected]
Carnegie Mellon University
Humans interact by sending visible social signals, such as facial expressions, body gestures, and gaze
movements. Such scenes are very common in our daily life and, increasingly, artificial agents are entering
these spaces. For artificial agents to cohabit in these scenes with humans as collaborating team members,
it is necessary that they are equipped with social intelligence1 — ability to understand social signals and to
behave in a socially acceptable manner. However, designing these machines is challenging because (1) social
signals are often too subtle to detect by current action recognition systems; (2) social signals are related to
each other in complicated ways; (3) immediate responses to the signals are a key factor producing socially
acceptable behaviors in dynamic social scenes. The overarching goal of my research is to address these key
challenges — to design machines that integrate social intelligence into their functions.
Designing such machines requires three fundamental research thrusts: (1) Subtle motion perception: the
machines must be able to measure subtle social signals from visual input and transform them to computational representations; (2) Predictive social modeling: the machines must be able to build a predictive
model of social behaviors by reasoning about relationships between the social signals; (3) Control and design
with social intelligence: the machines must be able to produce the desired output from the received social
signals by incorporating the predictive social model. These three thrusts are highly inter-related because a
closed loop system which integrates perception, prediction, and control for social intelligence is needed for
adaptive behaviors in dynamic social scenes. In my prior work, I have studied these thrusts:
• Subtle motion perception: 3D reconstruction of social signals provides a computational representation that allows the machines to analyze social scenes, e.g., to build a model, reason about
relationships, and predict social behaviors. As social signals evolve in the form of motion such as
body gestures, my colleagues and I have presented a method to reconstruct 3D scene motion from a
collection of images [1]2 and extended this work to reconstruct human body motion [2]3 . To address
detailed subtle motion, we have designed the Panoptic Studio for markerless motion capture using 480
synchronized cameras that enables us to reconstruct subtle motion in 3D with high resolution [3]4 as
shown in Figure 1(a).
• Predictive social modeling: Predictive ability of social behaviors is a fundamental measure of
social intelligence because it enables the machines to plan their functions with respect to the predicted
behaviors. I am particularly interested in gaze directions that reflect attentive behaviors. We have
presented a predictive field model that allows a machine to predict gaze directions at any locations
in social scenes [4]5 . The field model encodes the relationship between gaze directions and sources of
attention and we estimated this field from the gaze directions of a few observed social members [5]6 as
shown in Figure 1(b). We exploited head-mounted cameras to estimate gaze directions as ego-motion
of body-mounted cameras exhibit 3D body motion [6]7 .
1 E.
L. Thorndike, “Intelligence and its use”, Harper’s Magazine, 1920.
2 http://www.cs.cmu.edu/~hyunsoop/trajectory_reconstruction.html
3 http://www.cs.cmu.edu/~hyunsoop/articulated_trajectory.html
4 The
password protected video can be found from https://vimeo.com/82754209 (pw: dome).
5 http://www.cs.cmu.edu/~hyunsoop/socialcharge.html
6 http://www.cs.cmu.edu/~hyunsoop/gaze_concurrence.html
7 https://www.youtube.com/watch?v=xbI-NWMfGPs
1
Motors and encoders
Tail
Compliant feet
Sources of attention
Gaze directions
(a) Motion reconstruction
(b) Predictive social model
(c) Water running robot
Figure 1: (a) We reconstructed subtle motion in 3D using a markerless motion capture system. Top row: sampled
image input and bottom row: reconstructed trajectories. (b) We estimated a predictive model of the social scene
that encodes the relationship between gaze directions and sources of attention. Top row: sampled video input from
head-mounted cameras and bottom row: estimated sources of attention. (c) We studied control and design of the
quadrupedal water running robot. Top row: robot design and bottom row: high speed images of the robot running
on the water surface.
• Control and design with social intelligence: Given this computational basis, a fundamental
question on my research is “how do we integrate social signals into a control loop system to produce
desired function?”. This question opens a new domain of design problems that address the relationship
between social interactions and system controls. I plan on tackling this question by leveraging my
previous research on robot design and control [7–10]. My colleagues and I have studied dynamics of
the quadrupedal water running robot via mathematical modeling of interactions between the water
surface and footpads [8,10] as shown in Figure 1(c). This model is used to design a control loop system
of the robot for locomotion balance on the water surface [7, 9].
My prior work provides a computational foundation for social scene understanding and robot control.
By leveraging this work, I will address design problems of machines with social intelligence. In particular,
I plan on three goals that align with my research direction:
A) To reconstruct subtle social signals or intent of interactions: Subtle social signals such as
an instant cynical smile, small nod, or finger gesture often convey an important intent or message during
human interactions while machines are blind to them. I aim to reconstruct such signals in detail from
cameras worn or held by members in a social group as demonstrated by dense motion capture in the
Panoptic Studio [3]. The main challenge of subtle motion capture is to establish correspondences of moving
points across cameras because baselines between these cameras are usually larger than cameras in the
Panoptic Studio. I am interested in studying the relationship between reconstruction quality or ambiguity
and camera motion, which will characterize desired camera motion [1]. When desired camera motion is not
provided, reconstructing subtle motion can benefit from data captured in the Panoptic Studio. I also plan
on investigating spatiotemporal features tailored for human interactions.
B) To reason about a causal/reciprocal relationships between social interactions and to predict
social affordance: Social signals are multi-directional signals. A signal transmitted by a social member
may trigger the response of other members and the responses can, in turn, affect the behavior of the
member. For instance, a pointing gesture of a lecturer with his finger (one social signal) triggers change
of gaze directions of students (responses) and the lecturer can identify attention of the students by the
alignment of gaze directions. I have been involved in the Computational Behavioral Science program led
by NSF Expedition8 where we extract social signals and find a correlation between signals. Based on this
experience and the 3D representation of social signals, my research will aim to discover causal and reciprocal
relationships between social signals to build a predictive model of social scenes.
8 http://www.cbs.gatech.edu/
2
Social affordance of space is another relationship of social signals with respect to environments. Some
scene structures such as chair, table, or sofa are strongly related to human interactions and the spatial
arrangement of such structures characterizes social space. For example, human interactions are more likely
to take place at a sofa in a lobby than a corner of wall or under the table. By leveraging the social field
model based on sources of attention [4, 5], I will study how the social interactions are spatially related to
3D scene structure and measure its social affordance. This will provide a richer predictive model of the
human interactions. Architect can benefit from this social affordance because it will quantify how the space
is social-friendly.
C) To build a multi-agent system that reacts according to social interactions: Social intelligence
enables machines to use social signals in their tasks. As a first step, I will design a multi-agent system
that fully exploits social intelligence from previous two tasks. A predictive model [4] provides how humans
will behave at the location and time. Using this predictive model, I plan on creating an automatic broadcasting/capturing system using small navigational robot platforms such as manipulators, wheeled mobile
robots, or quadroters for sports, search and rescue, and medical scenes. Sources of attention will be estimated by measuring gaze directions of social members [5] and the paths of the robots will be planned based
on the predictive model.
Broad Impact
This work on social intelligence will impact various domains of research including computer vision, graphics,
architecture, and sociology. In computer vision, classic scene understanding has focused on understanding
the structure of scenes, e.g., where a sofa is, how to navigate the room, or what an object is useful for. My
research will provide a new interpretation on scene understanding in terms of social intelligence, e.g., what
people attend to, what people want to accomplish, and whom people are interacting with. Reconstruction
of subtle motion will facilitate high accuracy of reconstruction and recognition systems while existing
frameworks primarily address gross motion. In graphics, reconstructing intent from these social signals
will introduce a new axis of motion capture, i.e., capturing human intent and the predictive social model
will allow us to animate virtual characters in massive multiplayer online games (MMOGs) or movies in
a socially engaging way. In architecture, the social affordance will define the relationship between social
interactions and space, which can provides architectural criteria to design social space. The study of human
social behavior can also benefit from social scene understanding. Previous behavioral studies have heavily
relied on manually annotated action predicates. My approach will provide automatic action annotations
and learned relationships between actions, which will enable a machine to detect early stage of behavioral
diseases, such as autism and provide an empirical foundation for behavioral research.
References
[1] H. S. Park, T. Shiratori, I. Matthews, and Y. Sheikh, “3D reconstruction of a moving point from a series of 2D projections,”
in European Conference on Computer Vision, 2010.
[2] H. S. Park and Y. Sheikh, “3D reconstruction of a smooth articulated trajectory from a monocular image sequence,” in
International Conference on Computer Vision, 2011.
[3] H. Joo, H. S. Park, and Y. Sheikh, “Optimal visibility estimation for large-scale dynamic 3D reconstruction,” in IEEE
Conference on Computer Vision and Pattern Recognition (submitted), 2014.
[4] H. S. Park, E. Jain, and Y. Sheikh, “Predicting gaze behavior using social saliency fields,” in International Conference on
Computer Vision, 2013.
[5] ——, “3D social saliency from head-mounted cameras,” in Advanced in Neural Information Processing Systems, 2012.
[6] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. Hodgins, “Motion capture from body-mounted cameras,” Transactions
on Graphics (SIGGRAPH), 2012.
[7] H. S. Park, S. Floyd, and M. Sitti, “Roll and pitch motion analysis of a biologically inspired quadruped water runner
robot,” International Journal of Robotics Research, 2010.
[8] H. S. Park and M. Sitti, “Compliant footpad design analysis for a bio-inspired quadruped amphibious robot,” in IEEE/RSJ
International Conference on Intelligent Robots and System, 2009.
[9] H. S. Park, S. Floyd, and M. Sitti, “Dynamic modeling and analysis of pitch motion of a basilisk lizard inspired quadruped
robot running on water,” in International Conference on Robotics and Automation, 2009.
[10] ——, “Dynamic modeling of a basilisk lizard inspired quadruped robot running on water,” in IEEE/RSJ International
Conference on Intelligent Robots and System, 2008.
3