header image

We present a novel approach to computational modeling of social interactions based on modeling of essential social interaction predicates (ESIPs) such as joint attention and entrainment. Based on sound social psychological theory and methodology, we collect a new “Tower Game” dataset consisting of audio-visual capture of dyadic interactions labeled with the ESIPs. We expect this dataset to provide a new avenue for research in computational social interaction modeling. We propose a novel joint Discriminative Conditional Restricted Boltzmann Machine (DCRBM) model that combines a discriminative component with the generative power of CRBMs. Such a combination enables us to uncover actionable constituents of the ESIPs in two steps. First, we train the DCRBM model on the labeled data and get accurate (76%-49% across various ESIPs) detection of the predicates. Second, we exploit the generative capability of DCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific ESIP that closely matches the actual training data (with mean square error 0.01-0.1 for generating 100 frames). We are thus able to decompose the ESIPs into their constituent actionable behaviors. Such a purely computational determination of how to establish an ESIP such as engagement is unprecedented. Preprint

under: Journals, Publications

Communicating ideas and information from and to humans is a very important subject. In our daily life, human interact with variety of entities, such as, other humans, machines, and media. Constructive interactions are needed for good communication, which would result in successful outcomes, such as answering a query, learning a new skill, getting a service done, and communicating emotions. Each of these entities invokes a set of signals. Current research has focused on analyzing one entity’s signals with no respect to the other entities in a unidirectional manner. The computer vision community focused on detection, classification and recognition of humans and their poses and gestures progressing onto actions, activities, and events but it does not go beyond that. The signal processing community focused on emotion recognition from facial expressions or audio or both combined. The HCI community focused on making easier interfaces for machines to ease their usage. The goal of this workshop is to bring multiple disciplines together, to process human directed signals holistically, in a bidirectional manner, rather than isolation. This workshop is positioned to display this rich domain of applications, which will provide the necessary next boost for these technologies. At the same time, it seeks to ground computational models on theory that would help achieve the technology goals. This would allow us to leverage decades of research in different fields and to spur interdisciplinary research thereby opening up new problem domains for the multimedia community.  Call for papers

Organized by:

Dr. Mohamed R. Amer (SRI International)
Dr. Ajay Divakaran (SRI International)
Prof. Shih-Fu Chang (Colombia University)
Prof. Nicu Sebe (University of Trento)

under: Workshops

This paper presents an approach to estimating the 2.1D sketch from monocular, low-level visual cues. We use a low-level segmenter to partition the image into regions, and, then, estimate their 2.1D sketch, subject to figure-ground and similarity constraints between neighboring regions. The 2.1D sketch assigns a depth ordering to image regions which are expected to correspond to objects and surfaces in the scene. This is cast as a constrained convex optimization problem, and solved within the optimization transfer framework. The optimization objective takes into account the curvature and convexity of parts of region boundaries, appearance, and spatial layout properties of regions. Our new optimization transfer algorithm admits a closed-form expression of the duality gap, and thus allows explicit computation of the achieved accuracy. The algorithm is efficient with quadratic complexity in the number of constraints between image regions. Quantitative and qualitative results on challenging, real-world images of Berkeley segmentation, Geometric Context, and Stanford Make3D datasets demonstrate our high accuracy, efficiency, and robustness. Preprint  Supplement  Code

under: Journals, Publications

This dissertation addresses the problem of recognizing human activities in videos. Our focus is on activities with stochastic structure, where the activities are characterized by variable space-time arrangements of actions, and conducted by a variable number of actors. These activities occur frequently in sports and surveillance videos. They may appear jointly in multiple instances, at different spatial and temporal scales, under occlusion, and amidst background clutter. These challenges have never been addressed in the literature. Our hypothesis is that these challenges can be successfully addressed using expressive, hierarchical models explicitly encoding activity parts and their spatio-temporal relations. Our hypothesis is formalized using two novel paradigms. One specifies a new constrained hierarchical model of activities allowing efficient activity recognition. Specifically, we formulate Sum-Product Networks (SPNs) for modeling activities, and develop two new learning algorithms using variational learning. The other paradigm considers a more expressive (unconstrained) hierarchical model, And-Or Graphs (AOGs), requiring cost-efficient algorithms for activity recognition. In particular, we develop a new, Monte Carlo Tree Search based inference of AOGs. Our theoretical and empirical studies advance computer vision through demonstrated advantages of each paradigm, compared to the state-of-the-art. Dissertation

under: Dissertation
This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependencies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework. Our evaluation on the benchmark New Collective Activity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art. Paper
under: Main Conference, Publications

We propose a novel staged hybrid model for emotion detection in speech. Hybrid models exploit the strength of discriminative classifiers along with the representational power of generative models. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative representations. Our proposed hybrid model consists of a generative model, which is used for for unsupervised representation learning of short term temporal phenomena and a discriminative model,which is used for for event detection and classification of long range temporal dynamics. We evaluate our approach on multiple audio datasets (AVEC, VAM, and SPD) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications

Humans form a multitude of social groups through their life and regularly interact with other humans in these groups producing social behavior. Social behavior is behavior that is socially relevant or is situated in an identifiable social context. Interacting or observant humans sense, interpret and understand these behaviors mostly using aural and visual sensory stimuli. Most previous research has focused on detection, classification and recognition of humans and their poses progressing onto actions, activities and events but it mostly lacks grounding in socially relevant contexts. Moreover, this research is largely driven by applications in security & surveillance or in search & retrieval. The time is ripe to ground these technologies in richer social contexts and milieus. This workshop is positioned to show case this rich domain of applications, which will provide the necessary next boost for these technologies. At the same time, it seeks to ground computational models of social behavior in the sociopsychological and neuroscientific theories of human action and behavior. This would allow us to leverage decades of research in these theoretically and empirically rich fields and to spur interdisciplinary research thereby opening up new problem domains for the vision community. Call For Papers

Organized by:

Ajay Divakaran (SRI International)
Maneesh Singh (SRI International)
Mohamed R. Amer (Oregon State University)
Behjat Siddiquie (SRI International)
Saad Khan (SRI International)

under: Workshops

We propose a novel staged hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying data sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a Conditional Random Field based temporal discriminative model for event detection, classification, and generation, which enables modeling long range temporal dynamics. We evaluate our approach on multiple audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications
This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference. Paper Poster Code
under: Main Conference, Publications

1st Workshop on Understanding Human Activities: Context and Interactions (ICCV2013)

Activity recognition is one of the core problems in computer vision. Recently it has attracted the attention of many researchers in the field. It is significant to many vision related applications such as surveillance, video search, human-computer interaction, and human-human, or social, interactions. Recent advances in feature representations, modeling, and inference techniques led to a significant progress in the field.

Motivated by the rich and complex temporal, spatial, and social structure of human activities, activity recognition today features several new challenges, including modeling group activities, complex temporal reasoning, activity hierarchies, human-object interactions and human-scene interactions. These new challenges aim to answer questions regarding the semantic understanding and high-level reasoning of image and video content. At this level, other classical problems in computer vision, like object detection and tracking, not only impact, but are often intertwined with activity recognition. This inherent complexity prompts more time and thought to be spent on developing solutions to tackle auxiliary problems to the human activity recognition problem. Call for papers

Organized by:

Sameh Khamis (University of Maryland)
Mohamed R. Amer (Oregon State University)
Wongun Choi (NEC-Labs)
Tian Lan (Stanford University)

under: Workshops

Older Posts »

Categories