Our inference begins with noisy detectors that attempt to localize time intervals occupied by primitive events in the video. We apply a semi-supervised tracker to obtain the spatiotemporal tracks of the players and the ball. A pose-based detector is then used for detecting and localizing primitive events in each extracted track. These detections are combined with the PEL domain knowledge, including hard and soft constraints, and MAP inference is applied to produce a video interpretation in terms of the occurrence intervals for all observable and hidden event types of interest. An interpretation is characterized by a score, which is a sum of the weights of valid formulas in the interpretation.
Our inference seeks to iteratively maximize the interpretation score by probabilistically adding and deleting formulas, or adding and deleting time intervals for which a particular formula is valid in a current interpretation. This probabilistic editing of the interpretation takes into account all soft and hard constraints in the KB, and thus generates a holistic, consistent video interpretation. The weights of PEL formulas are iteratively learned on labeled training videos, by comparing the PEL’s interpretation with the ground truth, and adjusting the weights so as to reduce the score of wrong interpretations, and increase the score of the correct ones.