TAPIP3D

Tracking Any Point in Persistent 3D Geometry

1 Carnegie Mellon University
2 Peking University
3 Stanford University
* Equal Contribution
Input: 2D Monocular Video
Output: 3D Point Tracking in 3D Reconstructed World Space

TL;DR: Our method TAPIP3D enables robust feed-forward 3D point tracking for monocular/RGB‑D videos by representing them as camera‑stabilized spatio‑temporal 3D feature clouds. Local Pair Attention is proposed to iteratively refine multi‑frame 3D motion estimates.

Abstract

We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons. To handle the irregular structure of 3D point distributions, we propose a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism - a 3D-aware contextualization strategy that builds informative, spatially coherent feature neighborhoods to support precise trajectory estimation. Our 3D-centric formulation significantly improves performance over existing 3D point tracking methods and even surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. The model supports inference in both camera-centric (unstabilized) and world-centric (stabilized) coordinates, with experiments showing that compensating for camera motion leads to substantial gains in tracking robustness. By replacing the conventional 2D square correlation windows used in prior 2D and 3D trackers with a spatially grounded 3D attention mechanism, TAPIP3D achieves strong and consistent results across multiple 3D point tracking benchmarks.

Interactive Demos

With Sensor Depth

Results on DexYCB dataset using accurate depth information

Loading point cloud data...

With Monocular Videos

For monocular videos, we adapt MegaSaM, a powerful 3D reconstruction method, with the monocular depth prior from MoGe to obtain consistent depth maps and camera parameters.

Loading point cloud data...

DexYCB‑Pt Results

3D point‑tracking comparison on the DexYCB‑Pt benchmark using RGB-D videos with SD (Sensor Depth).

Method DexYCB‑Pt
AJ3D APD3D OA ↑
CoTracker3 + SD 14.926.170.9
SpatialTracker 5.511.466.8
DELTA 26.443.372.8
TAPIP3D (Ours) 30.3 52.4 71.3

TAPVid‑3D Results

Long‑term 3D point‑tracking comparison on the full split of TAPVid‑3D benchmark with monocular videos
For our TAPIP3D-world, we leverage the camera pose estimation from MegaSaM.
ADT dataset has significant camera motions while PStudio is with the static camera.

Method ADT DriveTrack PStudio
AJ3D APD3D AJ3D APD3D AJ3D APD3D
CoTracker3 + MegaSaM 20.430.1 14.120.3 17.427.2
SpatialTracker + MegaSaM 15.923.8 7.713.5 15.325.2
DELTA + MegaSaM 21.029.3 14.622.2 17.727.3
TAPIP3D‑camera + MegaSaM (Ours) 21.631.0 14.621.3 18.127.7
TAPIP3D‑world + MegaSaM (Ours) 23.532.8 14.921.8 18.127.7

Video Results Comparison

Citation

@article{tapip3d,
  title={TAPIP3D: Tracking Any Point in Persistent 3D Geometry},
  author={Zhang, Bowei and Ke, Lei and Harley, Adam W and Fragkiadaki, Katerina},
  journal={arXiv preprint arXiv:2504.14717},
  year={2025}
}