EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

Anonymous Submission

Abstract

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present EgoPressDiff, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy.

teaser

Figure 1: EgoPressDiff generates dynamic hand pressure from a single egocentric RGB video. It outputs a sequence of UV-pressure maps that could be warpped onto the 3D MANO model for intuitive 3D visualization of contact forces. The "Side View" is provided to better illustrate the hand’s contact with the touchpad.

A. Network Architecture

The training pipeline of our method is illustrated in Figure 2 (a). The network consists of several key components, including the PoseNet, Vertex Encoder, and Distribution-Calibrated (DC) Spatial Layer. These modules work together to extract, fuse, and align multi-modal features. First, PoseNet is employed to extract hand pose features, which are added to the input latent to explicitly enhance the model's awareness of the hand's posture. Second, since depth information is critical for inferring hand contact with the environment, we use VAE to encode the depth maps, projecting its features into the same feature space as input latents. To further capture the hand's spatial structure, we propose a Vertex Encoder that uses the 3D vertex coordinates of the MANO hand model as a geometric prior, enabling the model to learn the correlation between 3D hand geometry and pressure. Furthermore, a CLIP image encoder processes the egocentric RGB frames. The resulting image embeddings serve a dual purpose: they provide global visual context and, within the DC Spatial Layer, calibrate the distribution of hand-vertex embeddings, ensuring distributional consistency across modalities. In what follows, we will present the architectural details of these key components.


pipeline

Figure 2: Overview of EgoPressDiff. (a) The training pipeline of EgoPressDiff. The model processes five input streams through dedicated encoders: a PoseNet for hand pose, a VAE for depth and UV-pressure, a CLIP image encoder for RGB frames, and a Vertex Encoder for hand vertices. To align features from the vertex and image embeddings, we introduce a DC Spatial Layer, which replaces the original spatial layer in the U-Net. The pipeline is trained end-to-end using a reconstruction loss, with a UV mask employed to up-weight the hand UV pixels. (b) The architecture of Vertex Encoder. (c) The architecture of PoseNet. (d) Visualization of the reconstruction target, from top to bottom: The input egocentric RGB image with 3D MANO hand mesh; A 3D visualization of the ground-truth pressure on the hand surface; The pressure represented as a texture on the unwrapped UV layout; The final 2D UV-pressure map, which serves as the reconstruction target.


(1) Vertex Encoder.

To extract a compact and coherent representation from MANO hand vertices, we propose a Vertex Encoder. The input is a sequence of hand-mesh vertices with shape (B, N, 778, 3), where B, N, 778, and 3 correspond to batch size, number of frames, number of vertices, and (x,y,z) coordinates. The encoder outputs a feature sequence of shape (B, N, 1024). As shown in Figure 2 (b), the process has three stages:

  1. Spatial feature extraction: Each frame is normalized to remove global translation and scale. A shared MLP maps the 3D coordinates to a higher-dimensional space, followed by symmetric max-pooling to form a frame-level embedding.
  2. Temporal feature mixing: The embeddings are processed by depthwise-separable convolution blocks, capturing local temporal dynamics efficiently.
  3. Projection: A linear layer with LayerNorm projects the features to the final 1024-dimensional representation.

This hierarchical design enables efficient distillation of MANO vertex data into structured features suitable for denoising models.

(2) PoseNet.

Many generative models integrate human pose features via ControlNet, but this adds significant computational cost. We propose a lightweight PoseNet for extracting hand pose map features. As shown in Figure 2 (c), PoseNet consists only of convolutional and SiLU layers. For stable training, the network uses Gaussian weight initialization, and its final projection layer is a zero-initialized convolution.

(3) Distribution-Calibrated Spatial Layer.

To integrate multimodal control signals, we introduce the DC Spatial Layer. As shown in Figure 3 (a), unlike a standard diffusion U-Net spatial layer that conditions features on text, our design uses dual branches for image and vertex embeddings. Because these embeddings lie in different feature spaces, we add a calibration block to align them before fusion. Let z be the latent input. After self-attention, z passes through two cross-attention blocks, yielding zimg and zvtx. We compute their channel-wise mean and standard deviation: (μimg, σimg) and (μvtx, σvtx). We align them by enforcing: \[ \frac{z^{img} - \mu_{img}}{\sigma_{img}} = \frac{z^{vtx} - \mu_{vtx}}{\sigma_{vtx}} \] From this, we derive the calibrated vertex latent: \[ \bar{z}^{vtx} = \frac{z^{vtx} - \mu_{vtx}}{\sigma_{vtx}} \times \sigma_{img} + \mu_{img} \] Finally, the calibrated vtx is fused with zimg via element-wise addition and passed to the next temporal layer.


pipeline

Figure 3: (a) The original U-Net block. (b) Our proposed Distribution-Calibrated (DC) Spatial Layer integrated into the U-Net block. Here, μ and σ denote the mean and standard deviation, respectively.


B. Training Loss

As illustrated in Figure 2 (a), the model is trained end-to-end with a reconstruction loss. The loss is optimized over the trainable parameters of the UNet, Vertex Encoder, and PoseNet. To enhance the physical plausibility of the output, we introduce a UV mask as a spatial prior. This mask compels the model to prioritize reconstruction accuracy within the valid hand region defined by the UV map. The training objective is formulated as a weighted mean squared error: \[ \mathcal{L} = \mathbb{E}_{\varepsilon}\left(\left\|\left(z_{gt}-z_{\varepsilon}\right)\odot\left(1+\mathbf{M}_{uv}\right)\right\|^2\right), \] where zgt is the ground truth latent, zε is the denoised latent, and Muv is the binary UV mask.

C. Qualitative Comparison

Figure 4 presents a qualitative comparison across different methods. The first row illustrates a scenario where the middle finger makes contact with the surface while the index finger does not, as confirmed by the side view. Baseline methods erroneously predict pressure on the non-contacting index finger. In contrast, our method, which incorporates depth, hand geometry and temporal information, yields the correct prediction. As highlighted in the red box in the second row, our diffusion-based model generates a UV-pressure map with smoother pressure transitions, closely aligning with the ground truth. This is because baseline methods typically quantize pressure into a fixed number of classes (e.g., nine) and treat the problem as a pixel-wise classification task, resulting in coarse pressure gradients. Overall, our method demonstrates superior performance in terms of both the accuracy of the contact regions and the fidelity of the pressure distribution.


pipeline

Figure 4: Qualitative comparison on the EgoPressure dataset. The "Ego View" serves as the input, while the "Side View" is provided to better illustrate the hand's contact with the touchpad. The first row shows a case where only the middle finger is in contact and only our method predicts this correctly. In the second row, magnified results of the palm (red box) demonstrate that our method produces smoother pressure transitions, more closely aligning with the ground truth (GT).


D. Video Results

We present the raw UV-pressure maps generated by EgoPressDiff, along with their corresponding visualizations projected onto the 3D MANO hand model.

case 1:

Example results of our method for generating pressure maps on different regions of the hand. The generated pressure maps maintain good temporal continuity and accurately capture the contact areas and pressure distribution between the hand and the touchpad.

case 2:

Example results under the same gesture with varying contact regions and pressure levels. As the applied pressure changes, the generated pressure maps reflect corresponding variations in color and distribution, indicating different pressure values and contact patterns.

case 3:

Example results of long video generation with our method. During a complete gesture, the contact areas and pressure distribution between the hand and the touchpad change over time. The generated pressure maps maintain good temporal continuity and accurately capture these dynamic variations.

Acknowledgements

The website template was borrowed from Michaël Gharbi and Mip-NeRF.