A DCNN-based Arbitrarily-Oriented Object Detector for a Quality-Control Application

Following the success of machine vision systems for on-line automated quality and process control, in this paper we describe an object recognition solution aiming at detecting the presence of quality control elements in surgery toolboxes prepared by the Sterilization Unit of a hospital. Our solution actually consists in a two-stage arbitrarily-oriented object detection method making use of indirect regression of oriented bounding boxes parameters. The paper describes the design process and reports on the results obtained up to date.


I. INTRODUCTION
Machine vision systems are emerging as more and more popular solutions for on-line automated quality and process control applications. Enabling non-contact, thus nondestructive inspection, optical techniques are especially well suited when the correct manipulation of the object under inspection is crucial. This is precisely the inspection problem that we deal with in this paper: it consists in the detection of a number of control elements that are placed in boxes and bags containing surgical tools that surgeons and nurses have to be supplied with prior to starting surgery by the sterilization unit of the hospital. These elements provide evidence that the tools have been properly submitted to the required cleaning processes. Figure 1 illustrates, from left to right and top to bottom, the four kinds of elements to be detected: the label/bar code used to track a box/bag of tools, the yellowish seal, the paper tape which changes to the stripped appearance when the box/bag has been inside the autoclave, and an internal filter which is placed inside some boxes and creates the white-dotted texture that can be observed (instead of black-dotted).
In this work, we adopt Deep Convolutional Neural Networks (DCNN)-based methodologies as highly robust machinelearning approaches to face different lighting and inspection conditions. As already known, DCNNs have already shown good results for object recognition in images [1], although what is most interesting is the fact that they have shown highly promising performance for inspection applications [2]. In contrast to manually designed image processing solutions, DCNNs automatically generate powerful features, i.e. learn the representation, from training data by means of hierarchical learning strategies with a minimum of human interaction or expert process knowledge.
In more detail, this paper proposes a two-stage arbitrarilyoriented detector based on SSD (Single Shot MultiBox Detector) [3]. The main contributions are as follows: (1) we design a two-stage arbitrarily-oriented multi-category object detector, which can successfully operate in a dense and complicated scene; (2) unlike SSD, we select default bounding boxes by means of an automatic clustering procedure to obtain highquality priors and improve object localization accuracy; and (3) we propose a new method for oriented bounding boxes regression.

II. DETECTOR OVERVIEW
The detector that we propose in this section comprises two stages: in the first stage, we make use of a fine-tuned version of SSD [3] to regress straight bounding boxes containing the objects of interest. This version of SSD employs a set of prior boxes configurations determined after a clustering analysis on the training dataset (unlike the original algorithm), to focus the search on relevant bounding boxes and improve on object localization performance; in the second stage, from the output of the first stage, we regress the parameters of a rotated bounding box maximally contained into one of the straight bounding boxes. For a start, we first describe the parameterization we employ for the two kinds of bounding boxes which are handled. The two stages are described next.  yellow lines describe a 4-side polygon minimally enclosing the object, from which the minimal rotated rectangle/bounding box is generated, indicated by the violet lines in the figure. A minimal straight bounding box is finally obtained from the rotated bounding box, depicted through the red lines. The latter is parameterized by the anchor point coordinates (c x , c y ) and the box size (w b , h b ), as usual, while rotated boxes are described by the intersects (d 1 , d 2 ) of the upper side of the rotated rectangle with the sides of the unrotated rectangle. Optionally we also add parameter h to choose one of the two possible rectangles which may arise from tuple (d 1 , d 2 ). Besides, we also define a clockwise order onto the four corners of the rotated box ( Fig. 2[top,right]). The aforementioned has been used to generate the ground truth necessary during training.

A. Bounding Boxes Parameterization
Also, as part of the ground truth, and for testing purposes, we have defined two different datasets as for the boxes associated to the objects of every image. This is illustrated through Fig. 2[bottom]. On the one hand, dataset A [left] defines one box for every object of the training image. Although this seems natural for relatively squared objects, such as the label and the seal, it is not as straightforward for the paper tape, because of its elongated shape, and hence we define dataset B [right] which splits the object in several parts to favor a better training and latter detection of this kind of objects.
B. Straight Bounding Boxes Regression 1) Base method: For regressing the unrotated bounding boxes containing the objects of interest, we make use of SSD. This algorithm predicts category scores and box offsets for a fixed set of default bounding boxes using a set of convolutional filters applied to feature maps. As a base network, it makes use of a standard VGG16 network [1] whose fully connected layers have been replaced by a set of auxiliary convolutional layers progressively decreasing in size, thus enabling to perform predictions at multiple scales.
For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates (c x , c y , w, h). At each of the m × n locations where the kernel is applied for multiple feature maps, SSD produces an output value, i.e. predicts offsets relative to the default box shapes in the cell ∆(c x , c y , w, h), as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, SSD computes c class scores and the 4 offsets relative to the original default box shape. This results in a total of (c + 4)k filters that are applied around each location in the feature map, yielding (c + 4)kmn outputs for an m × n feature map. Given the large number of boxes generated during a forward pass of SSD at inference time, it prunes most of the bounding boxes by applying non-maximum suppression to keep only the N top predictions.
To finish, the overall loss function is a weighted sum of the localization loss L loc (implemented as a Smooth L1 loss [4] between the predicted box l and the ground truth box parameters g for offset regression) and the confidence loss L conf (implemented as a multi-category Softmax loss): where x denotes the N final predictions as x p ij = {0, 1} indicator functions for matching the i-th default box to the j-th ground truth box of category p. Parameter α is used to balance the impact of the classification and the bounding box regression loss values on the final loss value (α is finally set to 1 by cross validation). SSD thus starts with the priors as predictions and attempt to regress closer to the ground truth bounding boxes. The reader is referred to [3] for more details.
2) Prior boxes selection: SSD predefines a total of 6 default boxes per feature map location, i.e. scale, by imposing different size combinations (w k , h k ) manually picked. Since, on the one hand, the shape of the correct bounding boxes can vary significantly and, on the other hand, SSD regresses the predicted bounding boxes from the prior boxes, a proper selection of default boxes becomes crucial for achieving a high detection success; as already noted in [5], such a proper selection contributes to the stability of the underlying optimization process, converges faster and improves effectively the Intersection over Union (IoU) between predicted and correct boxes. Hence, our object detector makes use of default boxes selected automatically in accordance to the available data.
In more detail, we run the well-known K-means algorithm over the bounding boxes belonging to the ground truth, using box width and height as the clustering features. Instead of the Euclidean distance, typically used by K-means implementations, we define IoU as a distance metric because the former tends to miss large bounding boxes. The distance between a sample box b i and the cluster centroid c j is hence defined as: where o(·, ·) denotes area overlap and a(·) denotes area. Table I shows averages of the IoU metric for hand-picked default boxes and automatically selected boxes by clustering,  for both datasets A and B and a different number of default boxes (for the hand-picked cases, we predefine the boxes similarly to SSD). We can see that 4 clusters automatically selected yields better performance than 10 hand-picked default boxes. This means that we propose high-quality and better parameterized default boxes. As could be expected, the more clusters, the better is the performance (the trend can be observed to continue for # def. boxes ≥ 7), although the number of clusters should not be high to keep reasonable the running time.

C. Arbitrarily-oriented Bounding Boxes Regression
For the second stage of the detector, i.e. regression of oriented bounding boxes parameters, we consider a specifically designed lightweight CNN, since the performance resulting from existing pre-trained networks, e.g. ResNet [6], VGG16 [1], etc. did not reach the desired level.
This network (see Fig. 3), although inspired by a LeNet architecture [7], presents several differences: (1) the input size is 63×63 after incorporating an additional convolutional layer at the beginning of the network, in order to avoid reducing the image to LeNet's 28 × 28 pixels, which means loosing too much information; (2) a combination of batch normalization and scaling layers has been added after each convolutional and innerproduct layer, before the ReLU activation layers, in order to decrease the effect of covariate shift from hidden layers [8]; (3) since the bounding box parameters (d 1 , d 2 , h) are values between 0 and 1, a sigmoid layer lies between the last fully connected layer and the loss layer; and (4) the final layer is an Euclidean loss layer: where d denotes the predicted offsets and height, g denotes the ground truth, and N is the size of the minibatch.

III. EXPERIMENTAL RESULTS
The dataset we have employed comprises 461 original images in total, which has been augmented, as usual, by means of rotation, scaling and mirroring. The dataset has been next split as 2/3 of the dataset for training and the remaining 1/3 for testing. All experiments have been conducted in Caffe, running in a PC fitted with an NVIDIA GeForce GTX 1080 GPU, a 2.9GHz 12-core CPU with 32 Gb RAM and Ubuntu 64-bit. In general, to distinguish true positives from false positives, a threshold for IoU between ground truth and predicted bounding boxes is set to 0.5, as usual. Standard recall (R), precision (P), mean average precision (mAP) and average of IoU (AvIoU) performance metric values [9] are computed for quantitative evaluation and comparison.
On the one hand, we report detection results for SSD using the default boxes automatically defined by the method described in Section II-B2. For efficiency reasons, we have considered clusterings in 4, 5 and 6 clusters. Table II shows the performance data for datasets A and B, highlighting the best results (A or B) in red. Several observations can be made from the previous table: (1) for dataset A, best performance is obtained for 5 default boxes, while for dataset B, performance is better for 6 default boxes; (2) results for dataset B are better not only for paper tape detection, as expected, but in general,  for all classes (10% increase). All experiments have been run using Adam as network optimizer in Caffe, with a maximum number of 2 × 10 5 iterations and a fixed learning rate equal to 10 −5 .
On the other hand, Table III shows results for oriented object detection, using dataset B and 6 clusters, accordingly to the previous results. In the comparison, we consider the loss function of (3) for both 2 terms, i.e. (d 1 , d 2 ), and 3 terms, i.e. (d 1 , d 2 , h). Additionally, we include results for AlexNet as a baseline, properly tuned for the detection problem at hand. Observing the table, one can see: (1) regression of 2 terms yields better results in general than regression of 3 terms (around 20% in excess); (2) results for our network with 2 terms in the loss function outperforms the baseline network significatively (also around 20%); (3) in general, the paper tape is the class for which worst detection results are obtained. The Adam optimizer has also been employed in this case, with a learning rate equal to 0.01 and momentum at 0.9 for speeding network convergence up; during training, we decreased the learning rate by a factor 0.8 every 5000 iterations, for a maximum of 40000.
To finish, Fig. 4 shows, for three images, our method outperforming in a qualitative way TextBoxes++ [10], an arbitrarily- oriented text detector which we have also fine-tuned for the quality control application. As can be observed, test images were not taken under controlled conditions. (TextBoxes++ did not reach IoU ≥ 0.5 in our experiments, this is the reason why it is not included in Table III.) IV. CONCLUSIONS AND FUTURE WORK A two-stage arbitrarily-oriented object detection method making use of indirect regression of oriented bounding boxes parameters has been described. Promising results have been reported. Future work is planned to focus on achieving pixellevel detection within a hybrid solution making use of the bounding box concept and semantic segmentation.