Other Open Access
Fabio Cuzzolin; Vivek Singh Bawa; Inna Skarga-Bandurova; Mohamed Mohamed; Jackson Ravindran Charles; Elettra Oleari; Alice Leporini; Carmela Landolfo; Armando Stabile; Francesco Setti; Riccardo Muradore
Minimally Invasive Surgery (MIS) involves very sensitive procedures. Success of these procedures depends on the individual competence and degree of coordination between the surgeons. The SARAS (Smart Autonomous Robotic Assistant Surgeon) EU consortium, www.saras-project.eu, is working on methods to assist surgeons in MIS procedures by devising deep learning models able to automatically detect surgeon actions from streaming endoscopic video. This challenge proposal builds on our previous MIDL 2020 challenge on surgeon action detection (https://saras-esad.grand-challenge.org), and aims to attract attention to this research problem and mobilise the medical computer vision community around it. In particular, informed by the challenges encountered in our SARAS work, we decided to focus this year’s challenge on the issue of learning static action detection model across multiple domains (e.g. types of data, distinct surgical procedures).
Despite its huge success, deep learning suffers from two major limitations. Firstly, addressing a task (e.g., action detection in radical prostatectomy, as in SARAS) requires one to collect and annotate a large, dedicated dataset to achieve an acceptable level of performance. Consequently, each new task requires us to build a new model, often from scratch, leading to a linear relationship between the number of tasks and the number of models/datasets, with significant resource implications. Collecting large annotated datasets for every single MIS-based procedure is inefficient, very time consuming and financially expensive.
In our SARAS work, we have captured endoscopic video data during radical prostatectomy under two different settings ('domains'): real procedures on real patients, and simplified procedures on artificial anatomies ('phantoms'). As shown in our MIDL 2020 challenge (over real data only), variations due to patient anatomy, surgeon style and so on dramatically reduce the performance of even state-of-the-art detectors compared to nonsurgical benchmark datasets. Videos captured in an artificial setting can provide more data, but are characterised by significant differences in appearance compared to real videos and are subject to variations in the looks of the phantoms over time. Inspired by these all-too-real issues, this challenge's goal is to test the possibility of learning more robust models across domains (e.g. across different procedures which, however, share some types of tools or surgeon actions; or, in the SARAS case, learning from both real and artificial settings whose list of actions overlap, but do not coincide).
In particular, this challenge aims to explore the opportunity of utilising cross-domain knowledge to boost model performance on each individual task whenever two or more such tasks share some objectives (e.g., some action categories). This is a common scenario in real-world MIS procedures, as different surgeries often have some core actions in common, or contemplate variations of the same movement (e.g. 'pulling up the bladder' vs 'pulling up a gland'). Hence, each time a new surgical procedure is considered, only a smaller percentage of new classes need to be added to the existing ones.
The challenge provides two datasets for surgeon action detection: the first dataset (Dataset-R) is composed by 4 annotated videos of real surgeries on human patients, while the second dataset (Dataset-A) contains 6 annotated videos of surgical procedures on artificial human anatomies. All videos capture instances of the same procedure, Robotic Assisted Radical Prostatectomy (RARP), but with some difference in the set of classes. The two datasets share a subset of 10 action classes, while they differ in the remaining classes (because of the requirements of SARAS demonstrators). These two datasets provide a perfect opportunity to explore the possibility of exploiting multi-domain datasets designed for similar objectives to improve performance in each individual task.