Insights into Top Paper Nominee, “Ego-Body Pose Estimation via Ego-Head Pose Estimation”

A Q&A with the Authors

Jiaman Li, Karen Liu, Jiajun Wu

Paper Presentation: Tuesday, 20 June, 3:00 p.m. PDT, East Exhibit Halls A-B

In CVPR 2023 paper, “Ego-Body Pose Estimation via Ego-Head Pose Estimation,” the authors aim to enable more dynamic and realistic human motion synthesis compared to prior work. The following Q&A interview details how they are able to meet this objective.

CVPR: Will you please share a little more about your work and results? How is it different than the standard approaches to date?

Our work enables us to reconstruct human motion everywhere with a portable and lightweight device (an RGB camera). This is critical to VR/AR applications where motion reconstruction from sparse signals is needed. Also, this pipeline can be leveraged to facilitate the understanding of human behaviors from egocentric videos. For example, estimated human poses can be used as another modality to improve action recognition for egocentric videos.

Specifically, our work aims to address the problem of full-body human motion estimation from egocentric video. This is a challenging problem. By decomposing the problem into two stages with the head pose as an intermediate representation, we can learn to generate realistic human motions from the given egocentric video without training on paired egocentric video and ground truth full-body human motion. This decomposition enables us to learn each stage separately, effectively leveraging datasets with different modalities.

The standard approaches for human motion estimation usually employ third-person view videos. However, to capture a desired monocular video for motion estimation, the camera position needs to be carefully selected to ensure the human is in view and not occluded. Also, the camera position needs to be frequently adjusted when the human is moving elsewhere. Compared to the third-person view settings, the egocentric device is more portable and more convenient which only requires mounting on a human’s forehead without any adjustments.

CVPR: How did your model outperform state-of-the-art methods? What was the key factor in these results?

The prior works usually address the problem by collecting a paired dataset first, then training a regression model on the collected dataset. Since the data collection requires motion capture devices, the collected dataset is hard to scale and constrained to a lab-like environment, lacking motion and scene diversity. In our work, we do not rely on such data and thus are not constrained by the scale and diversity of the paired dataset. In addition, the mapping of egocentric videos and full-body poses is complex as the human body is usually unobserved in the egocentric video. Directly applying a regression model to learn the mapping is insufficient to model the one-to-many mapping. In our work, we propose to use the head pose as an intermediate representation and employ a conditional diffusion model to synthesize multiple plausible full-body poses given the same head poses extracted from the egocentric video input.

A key factor in our work is that we decompose the problem into two sub-problems, head pose estimation from egocentric video and full body pose estimation from the predicted head pose, eliminating the need for a paired dataset. And at the same time, we formulate the problem using an advanced generative model, conditional diffusion, enabling more dynamic and realistic human motion synthesis compared to prior work.

CVPR: So, what’s next? What do you see as the future of your research?

Currently, our motion estimation does not consider the interaction with the environments. A promising direction is to include environmental constraints in our current framework. For example, we can leverage reconstructed scenes from egocentric videos and enforce penetration constraints or contact constraints to our estimated human motion. In addition, our current approach does not leverage hand information. Another interesting direction is to extract hand trajectory from the egocentric video and add hand information to produce more accurate full-body poses.

CVPR: What more would you like to add?

Please check our project webpage (https://lijiaman.github.io/projects/egoego/) for more visualization results.

Annually, CVPR recognizes top research in the field through its prestigious “Best Paper Awards.” This year, from more than 9,000 paper submissions, the CVPR 2023 Paper Awards Committee selected 12 candidates for the coveted honor of Best Paper. Join us for the Award Session on Wednesday, 21 June at 8:30 a.m. to find out which nominees take home the distinction of “Best Paper” at CVPR 2023.