header image

We propose a novel staged hybrid model for emotion detection in speech. Hybrid models exploit the strength of discriminative classifiers along with the representational power of generative models. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative representations. Our proposed hybrid model consists of a generative model, which is used for for unsupervised representation learning of short term temporal phenomena and a discriminative model,which is used for for event detection and classification of long range temporal dynamics. We evaluate our approach on multiple audio datasets (AVEC, VAM, and SPD) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications

1st Workshop on Computational Models of Social Interactions and Behavior: Scientific Grounding, Sensing, and Applications (CVPR 2014)

Humans form a multitude of social groups through their life and regularly interact with other humans in these groups producing social behavior. Social behavior is behavior that is socially relevant or is situated in an identifiable social context. Interacting or observant humans sense, interpret and understand these behaviors mostly using aural and visual sensory stimuli. Most previous research has focused on detection, classification and recognition of humans and their poses progressing onto actions, activities and events but it mostly lacks grounding in socially relevant contexts. Moreover, this research is largely driven by applications in security & surveillance or in search & retrieval. The time is ripe to ground these technologies in richer social contexts and milieus. This workshop is positioned to show case this rich domain of applications, which will provide the necessary next boost for these technologies. At the same time, it seeks to ground computational models of social behavior in the sociopsychological and neuroscientific theories of human action and behavior. This would allow us to leverage decades of research in these theoretically and empirically rich fields and to spur interdisciplinary research thereby opening up new problem domains for the vision community. Call For Papers

Organized by:

Ajay Divakaran (SRI International)
Maneesh Singh (SRI International)
Mohamed R. Amer (Oregon State University)
Behjat Siddiquie (SRI International)
Saad Khan (SRI International)

under: Workshops

We propose a novel staged hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying data sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a Conditional Random Field based temporal discriminative model for event detection, classification, and generation, which enables modeling long range temporal dynamics. We evaluate our approach on multiple audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications
This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference. Paper Poster Code
under: Main Conference, Publications

1st Workshop on Understanding Human Activities: Context and Interactions (ICCV2013)

Activity recognition is one of the core problems in computer vision. Recently it has attracted the attention of many researchers in the field. It is significant to many vision related applications such as surveillance, video search, human-computer interaction, and human-human, or social, interactions. Recent advances in feature representations, modeling, and inference techniques led to a significant progress in the field.

Motivated by the rich and complex temporal, spatial, and social structure of human activities, activity recognition today features several new challenges, including modeling group activities, complex temporal reasoning, activity hierarchies, human-object interactions and human-scene interactions. These new challenges aim to answer questions regarding the semantic understanding and high-level reasoning of image and video content. At this level, other classical problems in computer vision, like object detection and tracking, not only impact, but are often intertwined with activity recognition. This inherent complexity prompts more time and thought to be spent on developing solutions to tackle auxiliary problems to the human activity recognition problem. Call for papers

Organized by:

Sameh Khamis (University of Maryland)
Mohamed R. Amer (Oregon State University)
Wongun Choi (NEC-Labs)
Tian Lan (Stanford University)

under: Workshops

This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end,we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors – called α process; 2) bottom-up inference based on detecting activity parts – called β process; and 3) top-down inference based on detecting activity context – called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus. Paper Presentation Code Dataset

under: Main Conference, Publications

This paper addresses recognition of human activitieswith stochastic structure, characterized by variable spacetimearrangements of primitive actions, and conducted by avariable number of actors. We demonstrate that modelingaggregate counts of visual words is surprisingly expressiveenough for such a challenging recognition task. An activityis represented by a sum-product network (SPN). SPN is amixture of bags-of-words (BoWs) with exponentially manymixture components, where subcomponents are reused bylarger ones. SPN consists of terminal nodes representingBoWs, and product and sum nodes organized in a numberof layers. The products are aimed at encoding particularconfigurations of primitive actions, and the sums serve tocapture their alternative configurations. The connectivityof SPN and parameters of BoW distributions are learnedunder weak supervision using the EM algorithm. SPN inferenceamounts to parsing the SPN graph, which yields themost probable explanation (MPE) of the video in terms ofactivity detection and localization. SPN inference has linearcomplexity in the number of nodes, under fairly generalconditions, enabling fast and scalable recognition. A newVolleyball dataset is compiled and annotated for evaluation.Our classification accuracy and localization precision andrecall are superior to those of the state-of-the-art on thebenchmark and our Volleyball datasets. Paper Poster Code Dataset

under: Main Conference, Publications

Marine biologists commonly use underwater videos fortheir research on studying the behaviors of sea organisms.Their video analysis, however, is typically based on visualinspection. This incurs prohibitively large user costs, andseverely limits the scope of biological studies. There is aneed for developing vision algorithms that can address specificneeds of marine biologists, such as fine-grained categorizationof fish motion patterns. This is a difficult problem, because of very small inter-class and large intra-classdifferences between fish motion patterns. Our approachconsists of three steps. First, we apply our new fish detectorto identify and localize fish occurrences in each frame, underpartial occlusion, and amidst dynamic texture patternsformed by whirls of sand on the sea bed. Then, we conducttracking-by-detection. Given the similarity between fish detections,defined in terms of fish appearance and motionproperties, we formulate fish tracking as transitively linkingsimilar detections between every two consecutive frames,so as to maintain their unique track IDs. Finally, we extracthistograms of fish displacements along the estimated tracks.The histograms are classified by the Random Forest techniqueto recognize distinct classes of fish motion patterns.Evaluation on challenging underwater videos demonstratesthat our approach outperforms the state of the art. Paper Poster

under: Publications, Workshop

This is a theoretical paper that proves that probabilisticevent logic (PEL) is MAP-equivalent to its conjunctivenormal form (PEL-CNF). This allows us to address theNP-hard MAP inference for PEL in a principled manner.We first map the confidence-weighted formulas from a PEL knowledge base to PEL-CNF, and then conduct MAP inferencefor PEL-CNF using stochastic local search. Our MAP inference leverages the spanning-interval data structure forcompactly representing and manipulating entire sets of timeintervals without enumerating them. For experimental evaluation,we use the specific domain of volleyball videos. Ourexperiments demonstrate that the MAP inference for PEL-CNF successfully detects and localizes volleyball events inthe face of different types of synthetic noise introduced inthe ground-truth video annotations. Paper

under: Publications, Workshop

Given a video, we would like to recognize group activities,localize video parts where these activities occur, anddetect actors involved in them. This advances prior workthat typically focuses only on video classification. We makea number of contributions. First, we specify a new, midlevel,video feature aimed at summarizing local visual cuesinto bags of the right detections (BORDs). BORDs seek toidentify the right people who participate in a target groupactivity among many noisy people detections. Second, weformulate a new, generative, chains model of group activities.Inference of the chains model identifies a subset ofBORDs in the video that belong to occurrences of the activity,and organizes them in an ensemble of temporal chains.The chains extend over, and thus localize, the time intervalsoccupied by the activity. We formulate a new MAP inferencealgorithm that iterates two steps: i) Warps the chainsof BORDs in space and time to their expected locations,so the transformed BORDs can better summarize local visualcues; and ii) Maximizes the posterior probability of thechains. We outperform the state of the art on benchmarkUT-Human Interaction and Collective Activities datasets,under reasonable running times. Paper Poster Code

under: Main Conference, Publications

Older Posts »

Categories