Context-PIPs: Tracking Any Point Demands Spatial Context Features

Context-PIPs: Persistent Independent Particles Demands Spatial Context Features

Weikang Bian^1,2, Zhaoyang Huang^1, Xiaoyu Shi¹, Yitong Dong³, Yijin Li³, Hongsheng Li^1,2

¹CUHK MMLab ²Centre for Perceptual and Interactive Intelligence ³Zhejiang University

^* denotes equal contributions

Abstract

We tackle the problem of Persistent Independent Particles (PIPs), also called Track- ing Any Point (TAP), in videos, which specifically aims at estimating persistent long-term trajectories of query points in videos. Previous methods attempted to estimate these trajectories independently to incorporate longer image sequences, therefore, ignoring the potential benefits of incorporating spatial context features. We argue that independent video point tracking also demands spatial context features. To this end, we propose a novel framework Context-PIPs, which effec- tively improves point trajectory accuracy by aggregating spatial context features in videos. Context-PIPs contains two main modules: 1) a SOurse Feature En- hancement (SOFE) module, and 2) a TArget Feature Aggregation (TAFA) module. Context-PIPs significantly improves PIPs all-sided, reducing 11.4% Average Tra- jectory Error of Occluded Points (ATE-Occ) on CroHD and increasing 11.8% Average Percentage of Correct Keypoint (A-PCK) on TAP-Vid-Kinectics.

Quantitative Comparison

FlyingThings++ and CroHD. Context-TAP ranks 1st on all metrics and presents significant performance superiority compared with previous methods. Specifically, Context-TAP achieves 7.06 ATE-Occ and 4.28 ATE-Vis on the CroHD dataset, 11.4% and 9.5% error reductions from PIPs, the runner-up. On the FlyingThings++ dataset, our Context-TAP decreases the ATE-Vis and ATE-Occ by 0.96 and 2.18, respectively.

TAP-Vid-DAVIS (first) and TAP-Vid-Kinectics (first). A-PCK, the average percentage of correct key points, is the core metric. Context-TAP ranks 1st in terms of A-PCK on both benchmarks. Specifically, Context-TAP outperforms TAP-Net by 24.1% on the TAP-Vid-DAVIS benchmark and improves PIPs by 11.8% on the TAP-Vid-Kinectics benchmarks.

Context-PIPs: Persistent Independent Particles Demands Spatial Context Features

Weikang Bian^1,2, Zhaoyang Huang^1, Xiaoyu Shi¹, Yitong Dong³, Yijin Li³, Hongsheng Li^1,2

Abstract

Qualitative Comparison

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

Quantitative Comparison

Context-PIPs: Persistent Independent Particles Demands Spatial Context Features

Weikang Bian1,2*, Zhaoyang Huang1*, Xiaoyu Shi1, Yitong Dong3, Yijin Li3, Hongsheng Li1,2

Abstract

Qualitative Comparison

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

PIPs

Context-TAP (Ours)

Quantitative Comparison

Weikang Bian^1,2, Zhaoyang Huang^1, Xiaoyu Shi¹, Yitong Dong³, Yijin Li³, Hongsheng Li^1,2