Surgical tool tracking by on-line selection of structural correlation filters

In visual tracking of surgical instruments, correlation filtering finds the best candidate with maximal correlation peak. However, most trackers only consider capturing target appearance but not target structure. In this paper we propose surgical instrument tracking approach that integrates prior knowledge related to rotation of both shaft and tool tips. To this end, we employ rigid parts mixtures model of an instrument. The rigidly composed parts encode diverse, pose-specific appearance mixtures of the tool. Tracking search space is confined to the neighbourhood of tool position, scale, and rotation with respect to previous best estimate such that the rotation constraint translates into querying subset of templates. Qualitative and quantitative evaluation on challenging benchmarks demonstrate state-of-the-art results.


I. INTRODUCTION
Visual tracking of surgical instruments is an active research topic. It is a promising technology to be applied for virtual measurements and overlays [1], maintaining the operated tissue areas in view [2], guidance and recognition of risk situations [3], surgical skills assessment [4] and better understanding of surgical workflow [5]. Many methods have been proposed to deal with this challenging problem, whereas most of them concentrate on appearance term representing the whole tool. For review, we refer to [6].
The correlation filters based approaches aim at representing appearance of the target object using the learned templates. They are becoming popular due to low computational cost and robustness. Additionally, to improve tracking performance, recent approaches focus on exploiting structural information, e.g. relation among target parts. The structural information is usually used to improve the robustness in handling object deformation and partial occlusion.
For instance in [7], [8] the authors promote an object structure by introducing smooth constraint of confidence maps. In [9] the authors proposed to account for object transformation, translation change and scale variation by performing scalespatial correlation jointly using a novel block-circulant structure for the object template and modeling target rotation in polar coordinate system. In [10] a structural correlation filter model is proposed, promoting a local change of scale, rotation and translation of all parts to be similar, while accounting for outliers part with different motion behaviour by introducing sparsity assumption. In [11] the authors proposed deformable part based correlation filter tracking. In this collaborative framework, the structure is represented by decomposing an object into several local parts, while deformation model is encoded by a global filter. In [12] the authors proposed not only to consider pairwise geometric relations between local parts, but to exploit high-order relations using the geometric hypergraph. Whereas in [13] the authors proposed to combine correlation filter framework with spatial-temporal angle matrix to account for object global rotation and deformation in visual tracking.
All above mentioned methods are designed for general tracking purpose. However, in some scenarios, the structure of the object to be tracked is known in advance, and it is possible to incorporate the specific prior knowledge when designing the tracker. Well studied examples of known in advance structure is a face recognition problem [14]. In this paper, we investigate a surgical tool. We make a use of its particular structure by decoding change of its pose between two subsequent frames into three relative rotation. To this end, we employ rigidly structured model of instrument parts proposed in [15] presented within adaptive tracking-bydetection framework [16]. The appearance is modeled within template-space representing HOG features. We introduce an injective function mapping each template related to shaft into two values coding rotation and scale of target object and each template related to end-effector into vector of four values coding relative rotations of tips and its scale. Using our model, a tracking search space is reduced to a local neighborhood of a tool position, scale, and rotation with respect to current state. A rotation constraint translates into the model in limited number of templates to be queried. We sample a resulting searchspace with a pixel-wise dense patch sampling technique [17]. At each time step, both shaft and end-effector position, scale and orientation are updated. We propose a unified objective function to integrate these two sparse representation problems together. The function combines spatio-temporal constraints related to shaft and end-effector [18], [19], [20] with confidence maps of co-appearance of different part together. The resulting optimization problem can be well solved by efficient dynamic programming.
The remaining of the paper is organized as follows: In Section II, we present the general form of the addressed problem and introduce the notation used in this work. We then present the proposed method in Section III. Simulations are performed in Section IV, showing the good performance of the proposed approach. Finally, Section V concludes the paper.

A. Structural Maximum Margin Correlation Filters
At the frame t, given the target template set (dictionary) related to Shaft parts, and End-effector poses, respectively, let z t = {z 1,t , z 2,t , . . . , z o,t } denote the observed, corresponding candidate target patch in target template space (e.g. HOG). In a correlation filters framework, a filtering/classification is based on searching a maximal correlation peak between observed data z i,t and a template d i S . In the case of dictionary of templates the problem is to find a maximal correlation peak over all considered z i,t and all templates in dictionary D S and D E .
. . , E O,t } denote a set of blocks E i,t combining N + 1 observed patches from z t such that the first N patches altogether represent the complete structure of the shaft and the last one is representing end-effector. For all additional constraints are imposed to enforce all shaft-related parts within one block to be encoded with the same single template from dictionary D S . Note that the combination of the templates in the part related template-space to be the same was enforced in [21], where the problem was defined within sparsity tracking framework. In the following we limit our investigation to correlation filter framework. In the proposed approach a structure of a tool is controlled by: (a) linear coincidence of position of N + 1 patches formatting each block E i,t and (b) representation of shaft by the same template. In addition we introduce in our model the learned biases matrix B ∈ R n S ×n E favouring solutions combining reliable co-appearance of end-effector and shaft poses. For joint learning of dictionaries and biases we adopt a maximum margin correlation filters combining good localization properties of correlation filters with the very good generalization abilities of support vector machines [22]. For a detailed description of learning procedure we refer to [15].
With the selective assumption, the local blocks of patches within the considered set of blocks E t can be represented through elements of the dictionaries by solving the following multi-criteria optimization problem: where x i,t ∈ R n S and y i,t ∈ R n E are selecting vectors with one non-zero coefficient equal to 1 related to a pose of single shaft part and end-effector part, respectively.
Note that in the above defined problem one variable x t is assign to all parts related to shaft in given E i,t , and hence the shaft is enforced to be represented by one template. One of the drawbacks of the problem formulation proposed in (1) We assume that the following properties hold: • The semantic structure of dictionaries -The mapping from D S to I S and from D E to I E are injective, i.e. at given scale there is only one appearance template assign to given rotation of shaft and only one appearance template assign to given pair of rotations of end-effector. • The consistent motion property -The displacements of target object's part is close within some distance measure in a space of: position of rotation center of the tool motion p ∈ R 2 (for 2D case), considered rotations of shaft r 1 , rotations of tips scissors r 2 , r 3 , and scales of shaft a 1 ∈ A and end-effector a 2 ∈ A. The relation between motion parameters and variables x, y can be expressed by function ϕ : (R n S , R n E ) → R 2 , (0, 2Π) , A, (0, 2Π) 2 , A defined as: A dynamic state of surgical tool defined by its position and rotation was propoused in tracking approachby Zhou and Payande [23]. However in this work the authors considered much simpler and known to be less robust appearance model grounded only on edge extractions. Additionally, the authors did not use an information about structure of the tool, and hence only a global orientation parameter was used. In some other works the orientation of surgical tool was inexplicitly taken into account, for instance by examing variuous orientations of evaluation window [24].

C. Tracking framework
Our visual tracking problem is formulated within the Bayesian inference framework [7], [20] with spatio-temporal constraints [18], [19]. Similar to [23], we use the affine motion model with six parameters to describe the object's state u t = [p, r 1 , a 1 , r 2 , r 3 , a 2 ]. In the following, we define our reconstruction problem in term of variables x t , y t and then we recover signal u t form x t , y t using function ϕ defined in (2). In the context of inverse problems, the target state variable u t can be recovered by minimizing a penalized criterion: where Φ is the so-called data fidelity term and Ψ is a regularization function incorporating a priori information, so as to guarantee the stability of the solution w.r.t. the observation noise. In the Bayesian framework, this allows us to compute the maximum a posteriori (MAP) estimate [25] of the original signal. In the context of our problem, the data fidelity term is given by observation model, i.e.
while regularization term incorporates dynamic and structural constraints, i.e. Ψ(x t , y t ) = Ψ S (x t , y t ) + Ψ D (x t , y t ), where and Ω denotes some distance measure.

III. PROPOSED ALGORITHM
One of the drawbacks of the general problem formulation proposed in (3) is the computational complexity of a related solver. To alleviate this shortcoming, we propose to concentrate on an interesting special case, where Ω is defined as an indicator function ι C , where and C is a subset of R 2 , (−2Π, 2Π) , R, (−2Π, 2Π) 2 . Such choice of distance measure is equivalent with an assumptions that in some neighbourhood defined by C around u t all the solutions are equally probably (from a point of view of dynamic constraints) and outside this neighbourhood all the solutions are not possible. Hence, the constraints on p t−1 − p t translates into tracking search space to be reduced to some region "bounding box" around p t−1 , i.e. set of blocks E t include only E t,i combining N + 1 observed patches z t,i centred in a bounding box defined by p t−1 − p t and C. Similarly, the constrained imposed on [r 1 , a 1 , r 2 , r 3 , a 2 ] translates into problem over some smaller dictionary D S t and D E t cut down from D S and D E by removing some of its columns, i.e.
where C S C and C E C denote convex sets defining an expected range of [r 1 , a 1 ] and [r 1 , r 2 , a 2 ], respectively.
Fast numerical method for computing state estimates u t .
Find D S , D E , B using learning techniques from [15] Main loop: ∈ D E using (9,10) Find B t as a smaller matrix cut down from B taking into account (9,10) Find E t around tool motion center u t−1 Find x t , y t as a solution of (1) using dynamic programming Find x t , y t as a complement of x t , y t Find u t as a solution of (4)

IV. EXPERIMENTS
We evaluate our tracker on the task of in-vivo tracking of the center of a single instrument in (i) Retinal Microsurgery (dataset with 3 sequences [26]), and (ii) Laparoscopy (dataset with 1 sequence [24]). Both datasets and tool center annotations are publicly available. We train our structural correlation filters on the first halves of the sequences and test them on the reminaing halves, as in [27]. We manually initialize the state variable u (0) . We set interframe changes of scale a 1 , a 2 to ×0.9 ×1.0, and ×1.1, of shaft rotation r 1 to ±20 • and of end-effector rotation r 2 , r 3 to ±10 • .
Evaluation metric We compare our tracker to state-of-theart methods in surgical tool tracking: POSE [27], ITOL [28], DDVT [26], and our previous method from [15]. To this end, we follow the standard evaluation metric of thresholded detections to evaluate the proposed method. Namely, a candidate detection c i is true positive when it falls within the circle of radius T pixels that is anchored at the ground truth tool center location c i , such that c i − c i < T . We refer to this metric as Keypoint Threshold (KT), after [27].
Quantitative and qualitative results The results presented in Fig. 2 show significant improvement of the proposed tracker over our previous tracking-by-detection method [15]. Our single-thredead, Matlab implementation requires ∼ 0.2s−1.0s to process an image frame while [15] requires dozens of seconds. Moreover, it is either on par or better than other methods. Method ITOL is most competitive with respect to Fig. 1. On-line selection of correlation filters for articulated tool pose estimation. At given time instant and correlation filter, we select a compact subset of correlation filters, which are in a priori defined proximity to the filter. We then match the selected filters to entry image and pick a new filter at given location and scale that best matches current image evidence. As the filters encode pose-specific tool appearance, repeating the procedure generates a tracejctory (solid line) of articulating and rotating tool (here, from Y-open to I-closed forceps). In practice, our experiments show that working already with a small fraction of correlation filters out of a pool of hundreds of filters suffices to produce a stable tool track. our tracker. Our tracker is better on RM1 but worse on Lapa sequence. However, our method has richer output than ITOL. Apart from tool center, it outputs end-effector articulation and shaft orientation, as in [27]. Additionally, the results presented in Fig. 3 show that our method can successfully track tool pose (e.g., RM 1 and RM 2), recover from lost track (RM 3), while being disrupted by smoke and strong shaft truncations (Lapa).

V. CONCLUSION
We have described a surgical instrument tracking procedure that achieves state-of-the-art results on two public benchmarks. It combines the strengths of structural, collaborative filtering of dictionaries of discriminative features, generalization properties of SVMs, and Bayesian tracking framework. In particular, we have formulated an appearance model that promotes consistent tool structures by: (a) enforcing shaft parts to be represented by the same template; (b) favouring solutions combining consistent co-appearance of end-effector and shaft poses. Next, we have introduced orientation parameters and scale, which jointly control the dynamics of the target surgical instrument. Despite the inevitable appearance changes of the tracked tool, the proposed scheme allows the tracker to output tool pose and position reliably. In future work, we will extend our work by considering more than two consecutive frames. Fig. 2. Quantitative results on Retinal Microsurgery and Laparoscopy datasets (best viewed in color) using KT metric. From left to right: RM1, RM2, RM3, Lapa. Methods: POSE [27], ITOL [28], DDVT [26], Ours15 [15], Ours (black). Our tracker shows significant improvement over our previous tracking-bydetection method (green). Moreover, it is either on par or better than other methods. Method ITOL is most competitive with respect to our tracker. Our tracker is better on RM1 but worse on Lapa sequence. However, our method has richer output than ITOL. Apart from tool center, it outputs end-effector articulation and shaft orientation. Fig. 3. Qualitative results on Retinal Microsurgery (left) and Laparoscopy (right) datasets. Instead of searching for the tool in the whole image frame, our method reduces the search space over location, scale, and appearance templates based on the detected tool pose in the previous frame. The tool pose serves to initiate new bounding box around the whole tool as well as new appearance templates which are in proximity to the detected end-effector and shaft templates.