We investigate the tracking of 2-D human poses in a video stream to determine the spatial configuration of body parts in each frame, but this is not a trivial task because people may wear different kinds of clothing and may move very quickly and unpredictably. The technology of pose estimation is typically applied, but it ignores the temporal context and cannot provide smooth, reliable tracking results. Therefore, we develop a tracking and estimation integrated model (TEIM) to fully exploit temporal information by integrating pose estimation with visual tracking. However, joint parsing of multiple articulated parts over time is difficult, because a full model with edges capturing all pairwise relationships within and between frames is loopy and intractable. In previous models, approximate inference was usually resorted to, but it cannot promise good results and the computational cost is large.
We overcome these problems by exploring the idea of divide and conquer, which decomposes the full model into two much simpler tractable submodels.In addition, a novel two-step iteration strategy is proposed to efficiently conquer the joint parsing problem. Algorithmically, we design TEIM very carefully so that: 1) it enables pose estimation and visual tracking to compensate for each other to achieve desirable tracking results; 2) it is able to deal with the problem of tracking loss; and 3) it only needs past information and is capable of tracking online. Experiments are conducted on two public data sets in the wild with ground truth layout annotations, and the experimental results indicate the effectiveness of the proposed TEIM framework.