Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision

Abstract

Accurate tissue point tracking in endoscopic videos is crucial for robotic-assisted surgical navigation and scene understanding, yet remains challenging due to complex tissue deformations, instrument occlusions, and the scarcity of dense trajectory annotations. Existing methods struggle with robustness in long-term tracking under these conditions due to the insufficient utilization of multimodal features and their heavy dependence on sparse annotations. We present Endo-TTAP, addressing these challenges through a two-stage hybrid supervision approach. Our method introduces: (1) A Multi-Facet Guided Attention (MFGA) module that fuses multi-scale optical flow features, semantic embeddings, and motion patterns via guided attention to jointly predict point positions, occlusion states, and tracking uncertainty; (2) An Auxiliary Curriculum Adapter (ACA) enabling progressive domain adaptation from synthetic to surgical data through exponential coefficient scheduling; (3) A Pseudo Label Generator (PLG) that creates high-quality dense annotations from sparse surgical data. Our two-stage training strategy first initializes components using synthetic datasets with optical flow ground truth, then transitions to real surgical data through unsupervised flow consistency and semi-supervised pseudo-label learning. We further contribute the Endo-TTAPC5 dataset, comprising 250 video segments across five clinically meaningful challenges: tissue deformation, instrument occlusion, camera jitter, surface reflection, and cauterization smoke. Extensive validation on two public datasets (SurgT, STIR) and our newly curated Endo-TTAPC5 dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions.

Method

Overview of our Endo-TTAP framework for endoscopic tissue tracking. (a) Two-stage Hybrid Dataset: Stage I employs synthetic data with optical flow ground truth for supervised initialization; Stage II transitions to real surgical videos with sparse annotations enhanced by PLG-generated pseudo-labels to enable semi-supervised learning. (b) Main Framework: Built on frozen SEA-RAFT backbone with Auxiliary Curriculum Adapter (ACA) for progressive domain adaptation and Multi-Facet Guided Attention (MFGA) for heterogeneous feature fusion. DINOv2 provides semantic embeddings complementing motion information. Training uses stage-specific losses (L1, L2) with Point Head active only in Stage II. (c) Multi-Facet Guided Attention (MFGA): Fusing multi-scale flow features and semantic embeddings via guided attention, where hybrid features serve as queries to aggregate intermediate representations for uncertainty and occlusion prediction. (d) Pseudo Label Generator (PLG): Four-step process for creating dense annotations from sparse surgical data: 1.SAM2 tissue segmentation with sparse point prompts; 2.XFeat anchor point matching with confidence filtering; 3. Dual-tracker (MFT + CoTracker3) trajectory propagation; 4. Distance-based Trajectory Filtering (DTF) removes unreliable trajectories by comparing endpoints, ensuring high-quality pseudo-supervision.

Robustness in Challenging Scenarios

The left side shows predictions from MFT, and the right side shows predictions from Endo-TTAP. And we conducted evaluations across five challenging scenarios.

Tissue Deformation

Endo-TTAP substantially improves tracking accuracy, maintaining consistent performance despite frequent occlusions. This robustness is mainly due to uncertainty-aware prediction and semantic guidance, which help preserve point identities during temporary obstructions.

Instrument Occlusion

Endo-TTAP demonstrates improved stability and robustness in tracking tissue boundary points over extended sequences, especially under non-rigid motion. Even when partially occluded by surgical instruments, Endo-TTAP significantly enhances tracking accuracy and consistency. This improvement is primarily attributed to uncertainty-aware prediction and semantic guidance, which help maintain point identities during temporary occlusions.

Camera Jitter

Endo-TTAP demonstrates a clear improvement over MFT, despite the overall error remaining elevated due to sudden endoscope movements. These abrupt shifts introduce jitter that leads to temporary misalignments, making long-term tracking more challenging.

Surface Reflection

Endo-TTAP performs robustly under reflection conditions, showing improved tracking accuracy and demonstrating resilience to illumination artifacts. This robustness is largely attributed to the semantic embeddings from DINOv2, which provide texture-independent visual representations that help maintain tracking performance.

Cauterization Smoke

Endo-TTAP similarly achieves superior accuracy, indicating its effectiveness in low-visibility regions. The method’s ability to handle these challenging scenarios again benefits from the semantic guidance, ensuring reliable tracking despite visual degradation.

Full trajectory tracking visualization

Shows the results of full trajectory annotation of a certain point on the SurgT and Endo_TAPC5 datasets.