Reducing the need for bounding box annotations in Object Detection using Image Classiﬁcation data

—We address the problem of training Object Detection models using signiﬁcantly less bounding box annotated images. For that, we take advantage of cheaper and more abundant image classiﬁcation data. Our proposal consists in automatically generating artiﬁcial detection samples, with no need of expensive detection level supervision, using images with classiﬁcation labels only. We also detail a pretraining initialization strategy for detection architectures using these artiﬁcially synthesized samples, before ﬁnetuning on real detection data, and experimentally show how this consistently leads to more data efﬁcient models. With the proposed approach, we were able to effectively use only classiﬁcation data to improve results on the harder and more supervision hungry object detection problem. We achieve results equivalent to those of the full data scenario using only a small fraction of the original detection data for Face, Bird, and Car detection.


I. INTRODUCTION
Training Object Detection architectures requires large amounts of images labeled with bounding boxes. When compared with the more traditional Image Classification task, labeling data for Object Detection is much slower and more expensive. Bounding boxes are also more dependent on human intervention, being more vulnerable to labeling mistakes and biases. Thus, it is of great interest to develop effective ways to train models with fewer labeled samples.
Several techniques can be employed to reduce the dependency on large annotated datasets. Transfer Learning from other tasks [1] (e.g. ImageNet classification) may lead to a good initialization before finetuning on labeled target data. Data Augmentation techniques can be used to generate additional samples by applying random, label preserving transformations to existing ones. Augmentation is effective, but limited in the sense that no object instances beyond those already present in the dataset will ever be seen. Sample Synthesis is a potentially more general approach, in which completely new instances are artificially created. Evidence for the effectiveness of Sample Synthesis has been reported for several Computer Vision tasks [2]- [8]. In this work, we turn our attention to Object Detection Sample Synthesis.
Some works have investigated the use of artificial samples for Object Detection [7], [8]. However, they usually depend on other expensive forms of supervision in order to generate samples. Additionally, they all focus only on how to create samples, while little attention has been given on how best to incorporate these samples during Object Detection training procedures.
In this work, we set out to investigate ways to generate and make use of artificial detection samples that require no expensive supervision. In particular, we propose taking advantage of existing cheaper image classification data, in such a way as to improve data efficiency on the Object Detection problem.
Our method consists in combining classification images with a generative unsupervised technique [9], to build a sample synthesis pipeline, capable of automatically generating an infinite stream of artificial samples with bounding box annotations. In order to avoid expensive supervision, all stages of this synthesis pipeline are trained using classification data only.
On the issue of how best to use these artificial samples, our main finding is that pretraining detection models with artificial samples before finetuning them on real images is very effective. This is in contrast to the simpler approach followed by related works [7], [8], of simply training with mixed real and artificial samples. We thoroughly demonstrate how this simple pretraining approach works as a powerful initialization strategy, resulting in a more data efficient training, which in turn, allows competitive detection results using only a small fraction of the original real labeled detection data.
The contributions of this work are the following: (1) We show that it is possible to automatically generate artificial labeled detection samples using a simple pipeline of already existing techniques, all of which can be trained with only classification level supervision. (2) We show how such artificial samples are a viable way to reduce the dependency of detection models on labeled data. And (3), we propose using these samples as a pretraining initialization strategy for detection models, and experimentally show how this approach leads to more data efficient training.
The remainder of this paper is structured as follows. Section II discusses how existing works deal with the task of sample synthesis, and justifies the choices we made in this project. Section III describes our method. Section IV presents a series of experiments we conducted in order to evaluate and better understand our method. Finally, Sections V and VI present, respectively, discussions and our concluding remarks.

II. BACKGROUND
There are several options to automatically generate artificial labeled samples to train Computer Vision models. A straightforward form of Sample Synthesis is to use Computer Graphics [10] to render instances of objects paired with the respective labels. However, this approach requires heavy human intervention, as one needs access to some graphical model of the objects, which needs to be manually designed most of the time.
Another option is to use some generative model, such as GANs [11], to synthesize image samples. This approach is already well established in the context of Image Classification, as demonstrated for instance in [2], [3]. There are also a few works that attempted GAN based synthesis for Detection and Segmentation on medical [4], [5] and aerial [6] images. However, they all require expensive supervision like bounding boxes or masks for training these generative models, so they are not ideal as a means of reducing the dependency on annotated data.
In the case of Detection, a popular approach is to start from images of the objects of interest, and then crop and paste the object regions on top of random background scenes. For instance, [7] showed improved results on the Pascal VOC Detection dataset [12] simply training with additional artificial images that were created by placing cropped instances of the objects on top of background scenes. Another similar approach [8] was applied to the problem of Instance Detection, a form of Object Detection that involves discriminating between different instances of the same object class. The problem formulation in [8] allowed them to train a segmentation model using masks available for other instances of the same objects.
The major limitation of the above mentioned works is the need for segmentation mask annotations in order to crop the object regions. Although the intention in [7], [8] was not primarily to reduce the need for bounding boxes for general Object Detection, their results suggest that, if the samples could be generated without expensive supervision, it could be possible to perform detection without requiring as much expensive masks or bounding boxes. In this work, we demonstrate how it is possible to use a combination of already existing techniques to generate artificial detection samples, starting from only classification data.
Moreover, a question that is not addressed by existing works is how best to incorporate these artificial samples into the training of regular detection models. In this work, we propose using artificial samples as a form of pretraining initialization, before finetuning the model on real labeled data. We experimentally show how this strategy leads to a more data efficient training.

III. PROPOSED METHOD
This Section presents our proposed method. We first describe our Sample Synthesis pipeline (III-A), which we use to generate artificial samples starting from only classification data. Then, we describe our proposal for using these artificial samples as a pretraining initialization strategy (III-B). Next, we explain how we combine these two ideas together to train detection models using less bounding box annotations (III-C). Figure 1 illustrates the whole method.

A. Sample Synthesis
Traditionally, artificial detection samples are generated by cropping object instances from existing images, and pasting them on top of background scenes. Existing works [7], [8] do this by using mask annotations to crop the object instances. These masks are obtained from existing segmentation annotations [7], or extracted by a segmentation model previously trained on similar annotations [8]. In order to use only classification supervision, we turned our attention to Unsupervised Segmentation.
Unsupervised Segmentation: Some recent works proposed deep learning based unsupervised segmentation methods that rely only on classification level annotations. For instance, in [13], segmentation is performed through an iterative optimization method. In "Copy-Pasting" GANs [14], unsupervised segmentation could theoretically be achieved as a byproduct of the "object discovery" sub-task, although only results using simplified artificial contexts have been presented there. More notably, Unsupervised Segmentation by Redrawing (ReDO) [9] has demonstrated impressive results on a small set of real world objects (faces, flowers, and birds), using a GAN-inspired adversarial training dynamic which depends only on classification images. We note that, by using any of these segmentation techniques, we could train a segmentation model for the objects of interest, without needing mask or bounding box annotated samples. Then, such model could be used to automatically segment objects from class annotated images, to be then inserted onto random background images.
We also note that we can use any regular generative image model, such as a GAN [11], to generate the object images, from which the object instances are segmented and then cropped. Standard GAN architectures are already trained on classification style images, so this does not incur any additional supervision penalty on the synthesis pipeline. The advantage of using these "fake" object images, instead of existing real ones is that, in doing so, we can treat the synthesis pipeline as a single component, without having to "carry" a classification dataset around. Additionally, by using a generative model instead of a classification dataset, we can synthesize an infinite stream of artificial detection samples where no object instance will be seen more than once, independent of the size of the original classification dataset.
Detection samples synthesized this way will naturally lack real world realism since no coherence between the inserted object and the background image is enforced, as is the case in [7]. Despite the low quality image composition, we experimentally show how pretraining on these "cheap" but abundant samples is very effective for reducing the need for real labeled detection data.

B. Pretraining Initialization
As mentioned above, existing works suggest that artificial samples might help achieve better performing models. In our proposal, we address the question of how best to incorporate these artificial samples into the regular training process of detection models. One could consider a direct approach of simply mixing a certain proportion of these artificial samples with real training images, as done by [7], [8]. Here we investigate a different approach, in which we pretrain detection models on artificial samples before finetuning them on real data. In this regard, one might question whether using exclusively artificial samples to initialize a model could introduce or amplify some bias from the synthesis mechanism. In Section IV-D1, we show how this pretraining initialization strategy works significantly better than the traditional approach of training on mixed and artificial samples. Figure 1 illustrates our overall strategy. In the top part of the figure, either a real object image or a GAN generated 'fake' image goes through an unsupervised segmentation step, which extracts a segmentation mask of the object. Then, in the bottom part, the segmented objects undergo some simple augmentation operations, and are inserted on randomly chosen background images, at random scales and positions. These augmentations are mirrored on the masks when applicable. The bounding box annotations can be automatically extracted from the masks. The masks are also used to blend the object regions with the background scenes, using a straightforward alpha-blending:

C. Complete Method
The resulting detection samples are used on our proposed pretraining initialization.
We highlight here that the main goal for this work is to identify how to use classification supervision to reduce the need for the more expensive detection supervision. We opted to pursue this objective through the idea of Sample Synthesis.
We do not claim (or expect) this synthesis pipeline to be the optimal way to generate detection samples. Instead, we propose this pipeline as a way to demonstrate how it is possible to perform an effective sample synthesis using just a simple combination of already existing techniques, followed by a clever use of such samples during training. To the best of our knowledge, Object Detection Sample Synthesis has not been explored with the goal of reducing the need for annotations.

IV. EXPERIMENTS
We evaluate our synthesis pipeline and pretraining initialization strategy by performing detection using three object classes: Faces, Birds and Cars. Our choice of objects and datasets was strongly guided by what we knew recent Unsupervised Segmentation techniques could handle. Nonetheless, the results obtained still provide evidence for the effectiveness of Sample Synthesis based Pretraining using classification data. We expect Sample Synthesis to become more widely applicable as unsupervised techniques naturally improve.

A. Datasets
To synthesize detection samples, we generate object images using GAN [11] models, then segment them using the unsupervised ReDO method [9]. We also apply a small set of random augmentations on these object images. We paste the segmented objects at random on top of background images that were sampled from the Pascal VOC dataset [12], as described in Section III-C.
Faces: The object images were generated using a StyleGAN trained on FFHQ [15]. Unsupervised segmentation was performed using a ReDO model trained on the LFW dataset [16], made available by [9]. We finetune and evaluate our detection models on real samples from the FDDB Faces dataset [17], with train/valid/test partitions equal to 1449/581/815.
Birds: Object images were generated using a DM-GAN [18], trained on the CUB-200-2011 dataset [19]. Unsupervised segmentation was performed using a ReDO model, also trained on CUB, and made available by [9]. We finetune and evaluate our detection models on real samples from the CUB dataset as well, with train/valid/test partitions equal to 10345/1000/443. Note that we are finetuning/evaluating our detection models on the same dataset that was used to train the components of the sample synthesis pipeline. We point out that the DM-GAN [18] and ReDO [9] models were trained using different partitions of the CUB dataset. In order to avoid test leakage through the artificial samples, we use the intersection between the test sets of DM-GAN and Birds ReDO as our test set. Also note that, despite the CUB dataset having bounding box annotations, the ReDO method did not need them for training.
Cars: The object images were generated using Style-GAN [15] trained on LSUN Cars [20]. As [9] did not provide weights for cars, we trained our own ReDO instance with these GAN generated images. We finetuned and evaluated our detection models on the Stanford Cars dataset [21], with train/valid/test partitions equal to 7144/1000/8041. Figure 2 shows instances of artificial detection images generated by our synthesis pipeline. As expected, these images are far from realistic, but as we demonstrate in Section IV-C, a large number of them can significantly reduce the need for expensive bounding boxes.

B. Training and Evaluation Methodologies
We performed experiments with a Single Shot Detector architecture (SSD) [22], trying both a MobileNet [23] and a ResNet50 [24] networks as backbone CNNs. This detection architecture was trained separately for each object category, both with and without the proposed pretraining initialization. We used the Adam optimizer [25] with 10 −4 learning rate and the standard θ = (0.9, 0.999).
The models with pretraining initialization were first trained on the stream of artificial samples for 500 iterations for Faces and Cars, and 1000 iterations for Birds, as we noticed the loss stagnates after these quantities. The non-pretrained baseline models received standard initialization instead, with ImageNet weights for the backbone CNN and Glorot/Xavier initialization [26] for the detection heads.
Then, both the baselines and the pretrained models were finetuned on the real data, again for 500 iterations for Faces and Cars, and 1000 iterations for Birds. We evaluated the model on the validation set every 25 iterations, and choose the checkpoint with the best result as the final trained model. The final results are computed on the test subsets, and measured in AP@0.5 IOU, following [12].

C. Main Results
For each model configuration, training was repeated for distinct amounts of real samples, making sure that all model versions were trained for each considered amount with the same subset of real images. We repeated training three times for each model configuration, and report means and standard deviations. Figure 3 shows AP values of the final trained models on the test sets as we increase the size of the subset of real images. Tables I and II detail numerical values for some subset sizes, for the MobileNet and ResNet50 backbones, respectively.
As we can see, the models that were initially pretrained on artificial samples achieved either comparable or superior results for all quantities of real samples and on both architectures. We notice that, depending on the object and architecture, the non-pretrained models can close the gap if enough real samples are used, especially with the ResNet50 backbone. But most importantly for our purposes, this advantage is significantly larger, and always present, when considering very few samples. These results support our hypothesis that initializing the models by pretraining them on artificial samples leads to a more data efficient training.

D. Ablation Experiments
1) Importance of Pretraining Initialization: The first, and most important ablation experiment, compared the proposed strategy of pretraining + finetuning with the common approach of training on real and artificial samples mixed together, as done for instance in [7]. For this, we can not use an infinite stream of artificial samples, as that would drown out any effect from the finite real data. Therefore, we fixed 100 real samples while varying the proportion of artificial ones, as the influence of pretraining is more noticeable on these low quantities of real samples.
We trained the models using our pretraining + finetuning strategy and another using the whole mixed set, for each proportion of artificial samples. For fairness, and to compensate  for an eventual "warm-up" effect in the pretraining case, we trained the mixed data models for the sum of the number of iterations in the pretraining and finetuning: 1000 steps for Faces/Cars and 2000 steps for Birds. For each proportion of artificial samples, we used the same artificial and 100 real samples for both pretraining and mixed data cases. We again repeated the training of each model version three times and report means and standard deviations, measured an AP@0.5, following [12]. Results are shown in Tables III and IV. In all cases, pretraining followed by finetuning gives better results than training on real and artificial data mixed together. But regardless of that, both options were always either matched or surpassed by pretraining on the infinite stream of artificial samples.
2) Importance of Unsupervised Segmentation: The next experiment aimed at evaluating the importance of properly segmenting the object instances before pasting them on the background scenes. That is, we tried to analyse the influence of the unsupervised ReDO segmentation step [9]. For this, we created another stream of artificial samples, without using the segmentation step, but instead pasting the whole generated image frames on the background images, and considering all of it to be the bounding boxes. We call this style of samples "Naive Pasting". Detection samples generated using this strategy are shown in Figure 4.
We trained a set of models following our pretraining initialization strategy, but using these naive artificial samples, and compared them against the models trained in the main experiments (Tables I and II). We again used 100 real samples for finetuning, and report results on the test sets. Again, we repeated these experiments three times for each model configuration. Results are presented in Table V.  As we can see, the naive samples already lead to a significant improvement over the non-pretrained models. However, in all cases, they either match, or are outperformed by the models pretrained on samples generated with segmentation, with the largest gap of around 11%, happening for the Birds class on the MobileNet backbone.
These results show that the unsupervised segmentation step is very important for almost all of our cases, although the exact advantage varies widely across dataset and architecture. Further investigation is needed in order to understand which factors are more significant for the final results.
3) Importance of GAN generation: So far, we have opted to use GAN generated object images at the first stage of  our synthesis pipeline instead of real ones. This allowed us to generate an infinite stream of artificial samples, where no object instance appears more than once, while also having the convenience of not requiring us to manipulate a classification dataset during pretraining. However, a natural question is whether we could achieve better results by synthesizing samples starting from real object images instead.
To answer this question, we created another stream of artificial samples, using object images from the datasets that were used to train the GANs used for the main experiments, namely the FFHQ faces [15], non test samples from the CUB-200-2011 dataset [19] for birds, and LSUN Cars [20]. Next, we trained a set of models following our pretraining strategy over this new stream of artificial samples, and compared them against the results from the main experiments. We again used detection quality by using artificial samples composed from GAN generated images. We expect that, as unsupervised segmentation techniques improve and generative image techniques become more data efficient, the advantage of these infinite stream formulation will become more evident, and therefore, artificial samples pretraining will be able to deal with even more extreme low data situations.

E. Experiments on WIDER Face
Finally, we also evaluated our strategy on a more challenging scenario, using a more advanced model. We performed experiments on the WIDER Face dataset [27], a state of the art benchmark for face detection, using the RetinaFace detector [28], a modern architecture designed specifically for face detection 1 . At the time of this writing, RetinaFace was achieving state of the art results on WIDER. Pretraining was done using artificial Face samples generated as described previously, and finetuning was done with varying quantities of real samples.
Results are shown in Table VII, and as can be seen, the models that were pretrained achieved superior results, with the advantage again being larger on very few real samples (even on the harder detection subset), with a gap of ∼ 10% for 100 real samples. This further supports our hypothesis that pretraining reduces the need for bounding boxes.
We note, however, a consistent disadvantage of our pretraining initialization approach as we include all the real data available. Further investigation is needed in order to understand what influence and biases are brought in by artificial samples when real data is already abundant.

V. DISCUSSIONS
The experimental results have shown that artificial samples are a promising direction towards better sample efficient detection models. However, in the presented formulation, our pipeline still has some limitations. First of all, our method still requires a significant dataset of classification images from objects of interest, either to be used directly or to be used for training a GAN. We expect that, as generative image models improve in terms of sample efficiency, this requirement will eventually be relaxed. A recent improvement in this regard comes from [29].
Second, the fact that we have a "pipeline" of steps, with GAN based generation, unsupervised segmentation, and random pasting, can be a significant source of noise on the generated samples. Any development which succeeds in combining these steps into an end-to-end architecture can potentially improve the detection sample generation process.
Despite these limitations, the experiments have provided evidence that our pretraining initialization strategy is a promising way of taking advantage of artificial samples. We expect this strategy to benefit from any improvement on the above listed limitations.

VI. CONCLUSION
In this work, we showed how it is possible to generate artificial labeled detection samples starting from only classification level supervision, by using a simple combination of already existing techniques. Additionally, we proposed using these artificial samples to pretrain detection models before finetuning them on real labeled data. With this approach, using only a small fraction of the available real bounding box annotated data for finetuning, we obtained detection performance on par with those achieved by the models trained on the whole real data. Therefore, we effectively managed to take advantage of the cheap and abundant Classification data in order to achieve competitive results on the harder and more supervision hungry Detection problem.
As a final note, the performance gap between pretrainedonly (the horizontal dashed lines in Figure 3) and pretrained + finetuned models indicates the importance of model finetuning with real data. We expect that, as generative image models such as GANs continue to improve, and it becomes possible to generate increasingly realistic artificial samples, the need for real data will be further reduced.