header image

Archive for Main Conference

This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependencies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework. Our evaluation on the benchmark New Collective Activity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art. Paper
under: Main Conference, Publications

We propose a novel staged hybrid model for emotion detection in speech. Hybrid models exploit the strength of discriminative classifiers along with the representational power of generative models. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative representations. Our proposed hybrid model consists of a generative model, which is used for for unsupervised representation learning of short term temporal phenomena and a discriminative model,which is used for for event detection and classification of long range temporal dynamics. We evaluate our approach on multiple audio datasets (AVEC, VAM, and SPD) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications

We propose a novel staged hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying data sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a Conditional Random Field based temporal discriminative model for event detection, classification, and generation, which enables modeling long range temporal dynamics. We evaluate our approach on multiple audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art. Paper

under: Main Conference, Publications
This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference. Paper Poster Code
under: Main Conference, Publications

This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end,we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors – called α process; 2) bottom-up inference based on detecting activity parts – called β process; and 3) top-down inference based on detecting activity context – called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus. Paper Presentation Code Dataset

under: Main Conference, Publications

This paper addresses recognition of human activities with stochastic structure, characterized by variable space-time arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for such a challenging recognition task. An activity is represented by a sum-product network (SPN). SPN is a mixture of bags-of-words (BoWs) with exponentially many mixture components, where sub components are reused by larger ones. SPN consists of terminal nodes representing BoWs, and product and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. The connectivity of SPN and parameters of BoW distributions are learned under weak supervision using the EM algorithm. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video in terms of activity detection and localization. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. A new Volleyball dataset is compiled and annotated for evaluation.Our classification accuracy and localization precision and recall are superior to those of the state-of-the-art on the benchmark and our Volleyball datasets. Paper Poster Code Dataset

under: Main Conference, Publications

Given a video, we would like to recognize group activities,localize video parts where these activities occur, anddetect actors involved in them. This advances prior workthat typically focuses only on video classification. We makea number of contributions. First, we specify a new, midlevel,video feature aimed at summarizing local visual cuesinto bags of the right detections (BORDs). BORDs seek toidentify the right people who participate in a target groupactivity among many noisy people detections. Second, weformulate a new, generative, chains model of group activities.Inference of the chains model identifies a subset ofBORDs in the video that belong to occurrences of the activity,and organizes them in an ensemble of temporal chains.The chains extend over, and thus localize, the time intervalsoccupied by the activity. We formulate a new MAP inferencealgorithm that iterates two steps: i) Warps the chainsof BORDs in space and time to their expected locations,so the transformed BORDs can better summarize local visualcues; and ii) Maximizes the posterior probability of thechains. We outperform the state of the art on benchmarkUT-Human Interaction and Collective Activities datasets,under reasonable running times. Paper Poster Code

under: Main Conference, Publications

Multiobject Tracking as Maximum-Weight Independent Set (CVPR 2011)

Posted by: | March 24, 2011 Comments Off on Multiobject Tracking as Maximum-Weight Independent Set (CVPR 2011) |

This paper addresses the problem of simultaneous tracking of multiple targets representing occurrences of distinct object classes in complex scenes. We apply object detectors to every frame, and build a graph of tracklets, defined as pairs of detection responses from every two consecutive frames. The graph helps transitively link the best matching detections that do not violate hard and soft contextual constraints between the resulting tracks. We prove that this data association problem can be formulated as finding the heaviest subset of non-adjacent tracklets in the graph, called the maximum-weight independent set (MWIS). We present a new, polynomial-time MWIS algorithm, and prove that it converges to an optimum. Similarity between object detections, and the contextual constraints between the tracks, used for data association, are learned online from object appearance and motion properties. Long-term occlusions are addressed by iteratively repeating MWIS to hierarchically merge smaller tracks into longer ones. We outperform the state of the art on the benchmark datasets, and show the advantages of simultaneously accounting for soft and hard constraints in multitarget tracking. Paper Presentation Code

under: Main Conference, Publications

Monocular Estimation of 2.1D Sketch (ICIP 2010)

Posted by: | March 24, 2011 Comments Off on Monocular Estimation of 2.1D Sketch (ICIP 2010) |
The 2.1D sketch is a layered representation of occluding and occluded surfaces of the scene. Extracting the 2.1D sketch from a single image is a difficult and important problem arising in many applications. We present a fast and robust algorithm that uses boundaries of image regions and T-junctions, as important visual cues about the scene structure, to estimate the scene layers. The estimation is a quadratic optimization with hinge-loss based constraints, so the 2.1D sketch is smooth in all image areas except on image contours, and image regions forming “stems” of the T-junctions correspond to occluded surfaces in the scene. Quantitative and qualitative results on challenging, real-world images—namely, Stanford depthmap and Berkeley segmentation dataset—demonstrate high accuracy, efficiency, and robustness of our approach. Paper Poster Code
under: Main Conference, Publications

Categories