From Blurry to Believable

Abstract

Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently bound to an underlying parametric head model for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.

Method

Overview of SuperHead. Given a low-resolution 3D head avatar driven by a morphable model, we first reconstruct static 3D head in the canonical space with multi-view 3D GAN inversion. We then refine mesh geometry and rig 3D Gaussians onto mesh surface to enable animation. We further include anchor images with diverse camera poses and expressions for dynamics-aware 3D refinement, ensuring the robustness of the 3D head model across viewing angles and complex facial motions.

Comparisons to Baselines

We compare our approach with other enhancement methods. SuperHead synthesizes high-quality facial details across diverse expressions, clearly outperforming baselines and in some cases approaching the pseudo ground-truth head avatar. All methods are driven and rendered with novel camera poses and expressions.

Video Comparisons

< >

Our method recovers high-quality facial details across diverse expressions and head poses, while preserving the subject identity and multi-expression consistency.

From Blurry to Believable

Enhancing Low-quality Talking Heads with 3D Generative Priors

TL;DR: Given a low-resolution 3D head avatar reconstructed from low-quality captures, SuperHead super-resolves high-fidelity geometry and detailed textures while ensuring multiview and temporal consistency under diverse facial expressions.

Abstract

Method

Comparisons to Baselines

Video Comparisons

References