Novel View Synthesis of Human Interactions From Sparse Multi-view Videos

Paper Code Dataset


Given sparse multi-view videos of human performers, our approach is able to generate high-fidelity novel views and accurate instance masks even for crowded scenes. This scene is captured by 8 GoPro cameras.

Results on ZJUMoCap

Results in the wild



Video comes from 8 GoPro cameras.

Download the example data. For convenient downloads, we just upload the compressed videos, you should first extract images from the videos:

data=<path/to/example/data>
# extract the images
python3 apps/preprocess/extract_image.py ${data}

Then you should extract the vertices from the SMPL parameters:

python3 apps/postprocess/write_vertices.py ${data}/output-smpl-3d/smpl ${data}/output-smpl-3d/vertices --cfg_model ${data}/output-smpl-3d/cfg_model.yml --mode vertices

Install

First you should install the easymocap environment. This project depends on: pytorch-lightning, spconv. See requirements_neuralbody.txt for more details.

pip install -r requirements_neuralbody.txt

Train

data=/path/to/dataset
# Recommand training with 4x3090
python3 apps/neuralbody/demo.py --mode soccer1_6 ${data} --gpus 0,1,2,3
# Reduce the number of rays if you train with RTX 1080Ti/3060
python3 apps/neuralbody/demo.py --mode soccer1_6 ${data} --gpus 0, data_share_args.sample_args.nrays 1024

Demo

# render with 4x3090
python3 apps/neuralbody/demo.py --mode soccer1_6 ${data} --gpus 0,1,2,3 --demo
# (not recommand)
python3 apps/neuralbody/demo.py --mode soccer1_6 ${data} --gpus 0, data_share_args.sample_args.nrays 1024 --demo

Limitations and future work

Currently, the proposed approach is limited to the setting of multiple human performers, only balls as objects, a simple background and a calibrated camera array. As future work, the system can be enhanced in several ways to handle more general settings.

  1. Recovering the human interaction from moving cameras or even a monocular video can be further investigated.
  2. More general objects can be handled by tracking the 6DoF poses with object pose trackers.
  3. If offline scanning of the background is available, the rendering quality of the background can be further improved.

There are lots of wonderful works that inspired our work:

Bibtex

@inproceedings{shuai2022multinb,
  title={Novel View Synthesis of Human Interactions from Sparse
Multi-view Videos},
  author={Shuai, Qing and Geng, Chen and Fang, Qi and Peng, Sida and Shen, Wenhao and Zhou, Xiaowei and Bao, Hujun},
  booktitle={SIGGRAPH Conference Proceedings},
  year={2022}
}

Acknowledgement

The authors would like to acknowledge support from NSFC (No. 62172364).

We would like to thank Haian Jin for the work in processing the instance segmentation.

We thank Zhengdong Hong’s advice for generating the visualizations.

Special thanks to the Women’s campus football team of Zhejiang University and Beijia Chen.


Table of contents