Accurate tissue point tracking in endoscopic videos is crucial for robotic-assisted surgical navigation and scene understanding, yet remains challenging due to complex tissue deformations, instrument occlusions, and the scarcity of dense trajectory annotations. Existing methods struggle with robustness in long-term tracking under these conditions due to the insufficient utilization of multimodal features and their heavy dependence on sparse annotations. We present Endo-TTAP, addressing these challenges through a two-stage hybrid supervision approach. Our method introduces: (1) A Multi-Facet Guided Attention (MFGA) module that fuses multi-scale optical flow features, semantic embeddings, and motion patterns via guided attention to jointly predict point positions, occlusion states, and tracking uncertainty; (2) An Auxiliary Curriculum Adapter (ACA) enabling progressive domain adaptation from synthetic to surgical data through exponential coefficient scheduling; (3) A Pseudo Label Generator (PLG) that creates high-quality dense annotations from sparse surgical data. Our two-stage training strategy first initializes components using synthetic datasets with optical flow ground truth, then transitions to real surgical data through unsupervised flow consistency and semi-supervised pseudo-label learning. We further contribute the Endo-TTAPC5 dataset, comprising 250 video segments across five clinically meaningful challenges: tissue deformation, instrument occlusion, camera jitter, surface reflection, and cauterization smoke. Extensive validation on two public datasets (SurgT, STIR) and our newly curated Endo-TTAPC5 dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions.
Overview of our Endo-TTAP framework for endoscopic tissue tracking. (a) Two-stage Hybrid Dataset: Stage I employs synthetic data with optical flow ground truth for supervised initialization; Stage II transitions to real surgical videos with sparse annotations enhanced by PLG-generated pseudo-labels to enable semi-supervised learning. (b) Main Framework: Built on frozen SEA-RAFT backbone with Auxiliary Curriculum Adapter (ACA) for progressive domain adaptation and Multi-Facet Guided Attention (MFGA) for heterogeneous feature fusion. DINOv2 provides semantic embeddings complementing motion information. Training uses stage-specific losses (L1, L2) with Point Head active only in Stage II. (c) Multi-Facet Guided Attention (MFGA): Fusing multi-scale flow features and semantic embeddings via guided attention, where hybrid features serve as queries to aggregate intermediate representations for uncertainty and occlusion prediction. (d) Pseudo Label Generator (PLG): Four-step process for creating dense annotations from sparse surgical data: 1.SAM2 tissue segmentation with sparse point prompts; 2.XFeat anchor point matching with confidence filtering; 3. Dual-tracker (MFT + CoTracker3) trajectory propagation; 4. Distance-based Trajectory Filtering (DTF) removes unreliable trajectories by comparing endpoints, ensuring high-quality pseudo-supervision.
The left side shows predictions from MFT, and the right side shows predictions from Endo-TTAP. And we conducted evaluations across five challenging scenarios.
Endo-TTAP substantially improves tracking accuracy, maintaining consistent performance despite frequent occlusions. This robustness is mainly due to uncertainty-aware prediction and semantic guidance, which help preserve point identities during temporary obstructions.
Endo-TTAP demonstrates improved stability and robustness in tracking tissue boundary points over extended sequences, especially under non-rigid motion. Even when partially occluded by surgical instruments, Endo-TTAP significantly enhances tracking accuracy and consistency. This improvement is primarily attributed to uncertainty-aware prediction and semantic guidance, which help maintain point identities during temporary occlusions.
Endo-TTAP demonstrates a clear improvement over MFT, despite the overall error remaining elevated due to sudden endoscope movements. These abrupt shifts introduce jitter that leads to temporary misalignments, making long-term tracking more challenging.
Endo-TTAP performs robustly under reflection conditions, showing improved tracking accuracy and demonstrating resilience to illumination artifacts. This robustness is largely attributed to the semantic embeddings from DINOv2, which provide texture-independent visual representations that help maintain tracking performance.
Endo-TTAP similarly achieves superior accuracy, indicating its effectiveness in low-visibility regions. The method’s ability to handle these challenging scenarios again benefits from the semantic guidance, ensuring reliable tracking despite visual degradation.
Shows the results of full trajectory annotation of a certain point on the SurgT and Endo_TAPC5 datasets.