High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

Minghan Qin*, Yifan Liu*, Yuelang Xu, Xiaochen Zhao, Yebin Liu#, Haoqian Wang#
* indicates equal contribution, # means Co-corresponding author
Tsinghua University

Abstract

One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned neural radiance field can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets.

Video

If the video does not play, please click here to watch it.

Method

Given a portrait video, We first track the global expression parameters ϵ using 3DMM. After the pre-processing, given the sampled 3D points po in observation space, we apply the generation network G to extend the global expression parameters ϵ with the spatial positional features of each position po in 3D space. Then, through a deformation network D, we transform po from the observation space to the pc in the canonical space conditioned on ϵ'. Subsequently, we use ϵ' conditioned NeuS to predict the SDF values and color c corresponding to pc. Finally, we obtain the rendered RGB image and normal using volumetric rendering.

Result

Self-Reenactment

Cross-Identity Reenactment

BibTeX

@article{qin2023high,
    title={High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field},
    author={Qin, Minghan and Liu, Yifan and Xu, Yuelang and Zhao, Xiaochen and Liu, Yebin and Wang, Haoqian},
    journal={arXiv preprint arXiv:2310.06275},
    year={2023}
  }