Autonomous robotic systems operating in human environments must understandtheir surroundings to make accurate and safe decisions. In crowded human sceneswith close-up human-robot interaction and robot navigation, a deepunderstanding requires reasoning about human motion and body dynamics over timewith human body pose estimation and tracking. However, existing datasets eitherdo not provide pose annotations or include scene types unrelated to roboticapplications. Many datasets also lack the diversity of poses and occlusionsfound in crowded human scenes. To address this limitation we introduceJRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimationand tracking using videos captured from a social navigation robot. The datasetcontains challenge scenes with crowded indoor and outdoor locations and adiverse range of scales and occlusion types. JRDB-Pose provides human poseannotations with per-keypoint occlusion labels and track IDs consistent acrossthe scene. A public evaluation server is made available for fair evaluation ona held-out test set. JRDB-Pose is available at https://jrdb.erc.monash.edu/ .
Multi-view approaches to people-tracking have the potential to better handleocclusions than single-view ones in crowded scenes. They often rely on thetracking-by-detection paradigm, which involves detecting people first and thenconnecting the detections. In this paper, we argue that an even more effectiveapproach is to predict people motion over time and infer people’s presence inindividual frames from these. This enables to enforce consistency both overtime and across views of a single temporal frame. We validate our approach onthe PETS2009 and WILDTRACK datasets and demonstrate that it outperformsstate-of-the-art methods.