1CUHK MMLab 2Centre for Perceptual and Interactive Intelligence 3Zhejiang University
* denotes equal contributions
We tackle the problem of Persistent Independent Particles (PIPs), also called Track- ing Any Point (TAP), in videos, which specifically aims at estimating persistent long-term trajectories of query points in videos. Previous methods attempted to estimate these trajectories independently to incorporate longer image sequences, therefore, ignoring the potential benefits of incorporating spatial context features. We argue that independent video point tracking also demands spatial context features. To this end, we propose a novel framework Context-PIPs, which effec- tively improves point trajectory accuracy by aggregating spatial context features in videos. Context-PIPs contains two main modules: 1) a SOurse Feature En- hancement (SOFE) module, and 2) a TArget Feature Aggregation (TAFA) module. Context-PIPs significantly improves PIPs all-sided, reducing 11.4% Average Tra- jectory Error of Occluded Points (ATE-Occ) on CroHD and increasing 11.8% Average Percentage of Correct Keypoint (A-PCK) on TAP-Vid-Kinectics.
FlyingThings++ and CroHD. Context-TAP ranks 1st on all metrics and presents significant performance superiority compared with previous methods. Specifically, Context-TAP achieves 7.06 ATE-Occ and 4.28 ATE-Vis on the CroHD dataset, 11.4% and 9.5% error reductions from PIPs, the runner-up. On the FlyingThings++ dataset, our Context-TAP decreases the ATE-Vis and ATE-Occ by 0.96 and 2.18, respectively.
TAP-Vid-DAVIS (first) and TAP-Vid-Kinectics (first). A-PCK, the average percentage of correct key points, is the core metric. Context-TAP ranks 1st in terms of A-PCK on both benchmarks. Specifically, Context-TAP outperforms TAP-Net by 24.1% on the TAP-Vid-DAVIS benchmark and improves PIPs by 11.8% on the TAP-Vid-Kinectics benchmarks.