Anisotropic Diﬀusion-Based Enhancement of Scene Segmentation with Instance Labels

. Many visual scene understanding applications, especially in visual servoing settings, may require high quality object mask predictions for the accurate undertaking of various robotic tasks. In this work we investigate a setting where separate instance labels for all objects under view are required, but the available instance segmentation meth-ods produce object masks inferior to a semantic segmentation algorithm. Motivated by the need to add instance label information to the higher ﬁdelity semantic segmentation output, we propose an anisotropic label diﬀusion algorithm that propagates instance labels predicted by an instance segmentation algorithm inside the semantic segmentation masks. Our method leverages local topological and color information to propagate the instance labels, and is guaranteed to preserve the semantic segmentation mask. We evaluate our method on a challenging grape bunch detection dataset, and report experimental results that showcase the applicability of our method.


Introduction
Vision-based understanding of complex natural scenes, e.g in agricultural settings, can provide invaluable information for in-field robotic applications. With the success of deep learning frameworks for end-to-end object detection during the past years, new opportunities arise for more complete and robust field scene understanding. However, the biological entities which are needed to be considered for robotic applications are often highly complex. Especially in vineyards, the unstructured geometrical configuration of the vine plant parts which are more important for robotic applications, e.g grape bunches, trunks, cordons and shoots, lead to significant challenges for the state of the art object detection algorithms.
We are concerned with the instance segmentation problem, where the masks and instance labels for each countable object on the scene must be predicted.
The caveat that motivates our method is that a semantic segmentation algorithm might predict object masks of higher quality than the corresponding masks from an instance segmentation algorithm. Therefore, our goal is the addition of instance labels to the higher fidelity semantic segmentation algorithm, with the guarantee that the original segmentation mask will be preserved. We formulate the 2D instance label enhancement problem of semantic segmentation as an anisotropic label diffusion process, taking inspiration both from the 3D label diffusion process presented in [25] and the generalization of label diffusion to the anisotropic case [13]. More specifically, our method leverages local pixel information to propagate all predictions from an instance segmentation algorithm intersecting with our semantic segmentation predictions, towards the remaining semantic segmentation predictions.
To the best of our knowledge, this is the first work to address the incorporation of instance labels in a deep semantic segmentation framework through the anisotropic diffusion of instance labels generated by a lower fidelity instance segmentation algorithm guided by topological and color information. We experimentally evaluate our results in a vineyard grape bunch setting using two representative state-of-the-art methods, PSPNet [31] for semantic segmentation and Mask R-CNN [11] for instance segmentation. We found PSPNet to achieve higher mask quality than Mask R-CNN in this setting, and thus, we preserved PSPNet predictions while simultaneously enhancing them with instance label information.

Related Work
Although several works combine semantic and instance segmentation for the task of panoptic segmentation [10,19,28,26,7], first described in [14], some works explicitly aim at mask improvement through the combination of the two tasks. The authors of [8] add an Atrous convolution segmentation head to Mask R-CNN to refine the predicted segmentation masks. In [24], a semantic segmentation U-Net [21] head is attached in Mask Scoring R-CNN [12] to facilitate mask prediction performance. The authors of [18] utilize a Bayesian setting to improve [16] with semantic segmentation masks from [23]. The authors of [9] present a novel Instance Mask Projection neural module that enhances segmentation mask prediction by projecting Mask R-CNN masks to the features of the semantic segmentation module. In [5] the authors improve the segmentation results of UNet in medical images first by estimating bounding boxes of anatomical structures with a connected components analysis, and afterwards by combining Mask R-CNN and UNet through a bounding box tracking-based approach. The tracking unfolds across 2D slices of the original Computed Tomography 3D volumes. The authors of [17] enforce consistency between semantic and instance segmentation masks prediction through a specialized loss construction. In [30], the authors add a semantic segmentation head in Mask R-CNN to uncover fine details in a crack detection setting. All the aforementioned works combine semantic and instance segmentation in a fusion setting, where it is assumed that the combination of the two tasks will facilitate the overall mask predictions. We on the other hand aim to preserve only the semantic segmentation masks, while simultaneously adding instance label information.
Two works that perform instance segmentation by applying the watershed transform on a semantic segmentation mask, and thus, like our method, retain the initial semantic segmentation mask quality, are [4] and [29]. In the first paper, the authors train a neural network to learn the energy basins of the watershed transform with an intermediate step of distance transform learning, and apply thresholds at predefined energy values to extract the object instances. The authors of the second work directly calculate the euclidean distance transform of the semantic segmentation output of a UNet [21] (based on the implementation found in [1]), and apply the watershed transform on it with markers provided by the centers of bounding boxes estimated by a modified Region Proposal Network (from Faster R-CNN [20]). Although these works preserve the semantic segmentation mask, the instance segmentation quality depends on the watershed transform performance, while our work aims at a modular design that permits the selection of an appropriate instance segmentation algorithm and offers controllable quality in instance label detection.
The work closest to ours is [3], where the authors combine the output of a semantic segmentation network with bounding boxes predicted by an object detector to perform instance segmentation. Their method retains semantic segmentation masks and assigns to the regions not covered by bounding boxes instance labels based on the mean field approximation of a CRF [15], which is a non-local [6] process. On the other hand, our method is mainly concerned with the utilization of instance mask predictions that have considerable overlap with the ground truth, i.e each visible part of an object instance is covered by a coarse instance mask, and therefore does not aim to assign instance information in a global-scale, but smoothly propagate coarse predictions towards unlabeled pixels. The addition of anisotropic guidance to the diffusion process guarantees the prevention of possible "leakage" of instance label information between different instances.

Methodology
Let L be the set of object instances occurring in an RGB image I of width W and height H, M l I be the estimated mask from the instance segmentation algorithm for label l ∈ L, M S the mask predicted by the semantic segmentation algorithm, and M I the logical OR for all Mask R-CNN masks, i.e ∨ |L| l=0 M l I . For each pixel p i ∈ M S we find the set of its K-nearest neighbors based on the euclidean distance between the pixels, denoted as KN N (p i ). We search for nearest neighbors only inside M S , i.e p j ∈ KN N (p i ) ⇒ p j ∈ M S . We plan to propagate the instance labels predicted by the instance segmentation algorithm via anisotropic diffusion, towards the unlabeled pixels belonging to M S but not to M I . To this end, we first define the set z = {z l } of label vectors z l ∈ {0, 1} W H with l ∈ L. We set the vector element z l (i) corresponding Fig. 1. The overall pipeline of our method: The intersection of the semantic segmentation and the instance segmentation outputs are combined to provide the initial instance labels, which are afterwards anisotropically diffused towards the remaining unlabeled pixels of the Semantic Segmentation output. The ∩ symbol signifies mask intersection.
to pixel p i to be equal to 1 if p i ∈ M j I , and zero otherwise. Then, we define the weighted graph G = (P, E, W ) with the set of nodes P equal to the mask pixels M S , edges E ⊂ P × P connecting the nodes and weights W assigned to every edge. Each z l defines a function connecting the i − th pixel/graph node p i with the value z l i indicating whether p i belongs to instance label l or not. Let |·| 2 denote the euclidean norm and I (p i ) the RGB color value of pixel p i . Then, the anisotropic graph laplacian L D (i) on graph node i is defined as [13]: where w ij is the edge weight connecting nodes i and j: d i is the degree of node i: and q ij is equal to: where σ d , σ c and σ a are the scale hyperparameters of the exponents. Assuming a unit discretization interval, the Euler approximation for anisotropic diffusion on G per instance label class is given by the following iterative update from step t to step t + 1: The total number of iterations is a hyperparameter -following [13], we first apply isotropic diffusion (q ij = 1) for a small number of steps to decrease the number of iterations required for convergence. After the final iteration is completed for every instance l, we select the instance label l pi of every initially unlabeled pixel p i to be equal to: A schematic overview of our method can be seen in Fig. 1.

Experimental Evaluation
Our method was tested on the publicly available Embrapa WGISD [22] dataset. Embrapa WGISD contains 137 RGB images with manually annotated instance segmentation masks of grape bunches. We follow the 88-22-27 training-validationtest images split used by the authors, with the same image resizing and augmentation strategies described in their paper. In our experiments, we found the De-tectron2 [27] implementation of Mask R-CNN to produce superior results to the Matterport [2] implementation used in [22]. We trained both implementations of Mask R-CNN, and the implementation of PSPNet [33], [32], in an Nvidia Tesla K40m GPU, with the maximum available batch size and network backbone. The Detectron2 Mask R-CNN implementation was trained for 6000 iterations, the PSPNet implementation for 10000 iterations and the Matterport Mask R-CNN implementation for 6000 iterations. For our experiments, the instance segmentation masks from Mask R-CNN were predicted using the confidence value that maximized the (object detection) F-measure at 0.3 IoU of the validation set. In all frameworks, the default training parameters were kept. We evaluate the mask quality of PSPNet and Mask R-CNN with total mask Grape Bunch IoU, total mask Background IoU (BG IoU) and per-pixel Fmeasure. We also include the F-measure metric reported in [22] for the Matterport Mask R-CNN implementation. The total scene masks for Mask R-CNN were derived by aggregating all predicted instance masks using the logical OR operator. The experimental results of Table 1 highlight the relevance of our method in this dataset, as the semantic segmentation algorithm predicts higher quality masks compared to the instance segmentation algorithm. Therefore, we apply anisotropic diffusion-based instance label addition to the test set. Based on our experimental hyperparameter search, we utilized the 8 nearest neighbors and set σ d = 2, σ c = 500 and σ a = 0.1. We also used the Detectron2 Mask R-CNN implementation to provide the initial instance labelling. In Table 2 we compare our method to a naive local algorithm that for each unlabeled pixel p, uses the label majority vote of the K nearest labeled neighbors to assign to p an instance label. We refer to this baseline method as Majority Voting KNN (MV-KNN), and set the number of neighbors equal to 9. For labelled neighbors existing on Mask R-CNN mask overlaps, we select the contributing label randomly from one of the overlapping masks. We compare the methods using per-unlabeled-pixel average multi-class precision (Prec), recall (Rec) and F-measure. We found that our method produces adequate results even for approximately 100 diffusion iterations, and reaches convergence with respect to the evaluation metrics at approximately 1000 iterations. However, to have meaningful metrics, we remove the influence of wrong predictions from the semantic and instance segmentation algorithms. We remove the influence of object misclassification from Mask R-CNN by assigning to each pixel of the predicted instance segmentation masks the original ground truth instance labels -i.e, we still propagate from masks predicted by Mask R-CNN, but with instance labels given by the ground truth. Additionally, to disregard the effect of possible PSPNet mask inacaccuracies, we calculate the metrics only on correct PSPNet mask estimations. Our method outperforms the baseline in all metrics, and achieves over 6% improvement in precision. Some qualitative results that highlight the effectiveness of our method can be seen in Fig. 2. The reader should notice that in the fourth column of Fig. 2 several Mask R-CNN instance masks have been improved. Regions from grape bunch masks that were improved by preserving PSPNet mask information have been noted with black arrows in the third column. The resolution of conflicts between adjacent bunch grape masks attempting to cover the same unclaimed PSPNet mask region, for example in the outermost right grape bunches (yellow and purple) in the second row, is guided by the local topological and color information in the diffusion process. It should be noted that the ground truth coloring of grape bunch masks is different from Mask R-CNN and our method, due to the fact that Mask R-CNN usually predicts a different number of instance masks, and in a different order.

Conclusion
An anisotropic diffusion-based instance labelling algorithm for semantic segmentation masks has been presented. Our method utilizes local distance and color cues to anisotropically propagate instance label information predicted by an instance segmentation algorithm towards mask pixels estimated by a semantic segmentation algorithm. The evaluation of our method in the Embrapa WGISD grape bunch instance segmentation dataset showcases the preservation of semantic segmentation mask quality. Our future work will concentrate on the fusion of RGB and depth information in the diffusion process to alleviate possible ambiguities in scene understanding originating from grape bunch instance segmentation misassignments. We also plan to evaluate our method on datasets of larger size and variability.