WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild

1Imperial College London, United Kingdom 2Shanghai Jiao Tong University, China


teaser

Abstract

In recent years, 3D hand pose estimation methods has garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without modeling any temporal components.

Overview

WiLoR is a mutli-hand localization and reconstruction framework. Using a real-time detection and localization model, WiLoR can reconstruct multiple hands with high-fidelity.

method

Given an image \( \mathbf{I}_h \) represented as a series of feature tokens \( \mathbf{T}_{img} \) along with a set of learnable camera tokens \( \mathbf{T}_{cam} \), pose tokens \( \mathbf{T}_{pose} \), and shape tokens \( \mathbf{T}_{shape} \), we initially predict a rough estimation of the MANO \cite{mano} and camera parameters \( \mathbf{K}_{cam} \) using a ViT backbone (light blue). The updated image tokens are then reshaped and upsampled through a series of deconvolutional layers to form a set of multi-resolution feature maps \( \{ \mathbf{F}_0, \dots, \mathbf{F}_n \} \). We then project the estimated 3D hand to the generated feature maps and sample image-aligned multi-scale features through a novel refinement module (purple) that will be used to predict pose and shape residuals \( \Delta\theta, \Delta\beta \). Using this coarse-to-fine pose estimation strategy, we facilitate image alignment and achieve better reconstruction performance.

Reconstruction Performance

WiLoR significantly advances state-of-the-art 3D hand pose estimation, outperforming previous methods on benchmark FreiHAND and HO3D datasets.

method

BibTeX

@misc{potamias2024wilor,
      title={WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild},
      author={Rolandos Alexandros Potamias and Jinglei Zhang and Jiankang Deng and Stefanos Zafeiriou},
      year={2024},
      eprint={2409.12259},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }

Acknowledgements

S. Zafeiriou was supported by EPSRC Project DEFORM (EP/S010203/1) and GNOMON (EP/X011364). R.A. Potamias was supported by EPSRC Project GNOMON (EP/X011364).