PhosopNet: An improved grain localization and classification by image augmentation

Rice is a staple food for around 3.5 billion people in eastern, southern and south-east Asia. Prior to being rice, the rice-grain (grain) is previously husked and/or milled by the milling machine. Relevantly, the grain quality depends on its pureness of particular grain specie (without the mixing between different grain species). For the demand of grain purity inspection by an image, many researchers have proposed the grain classification (sometimes with localization) methods based on convolutional neural network (CNN). However, those papers are necessary to have a large number of labeling that was too expensive to be manually collected. In this paper, the image augmentation (rotation, brightness adjustment and horizontal flipping) is appiled to generate more number of grain images from the less data. From the results, image augmentation improves the performance in CNN and bag-of-words model. For the future moving forward, the grain recognition can be easily done by less number of images.


INTRODUCTION
A bowl/dish of cooked rice is easily seen as the cultural gastronomy in many Asian countries, e.g., Japan, China, India, Bangladesh, Pakistan and other ASEAN countries. From the oldest historical evidence, rice-grain (or grain) was grown in [1] Yangtze river, China; longer than 10,000 years ago. A folk wisdom on grain agriculture was originally farmed on volcanic soil in Kyushu, Japan [1] during Yayoi period. And the flow of Mekong river [1] (shared by Vietnam, Laos, Thailand, Myanmar and Cambodia) was also one of the most important grain-cultivated-lands in a long time ago. It is not surprise that most winners of world's best rice conferences within the last 4 years were from the Mekong river shared region: Jasmine [2] (Thailand, 2016-2017), Malys Angkor [2] (Cambodia, 2018) and ST25 [2] (Vietnam, 2019). Traditionally, rice was linked to the goddess belief in Japan [1] who sowed grain in the fields of heaven. In Indian culture as Pongal [1], rice was an offering to the god as a thanksgiving. As well as Thai, Cambodian and Balinese had the similar cultural worship of rice's mother [3,4] as Mae Phosop, Po Ino Nogar and Dewi Sri, respectively. Economically, rice is not only a staple food but also an important agricultural production for 3.5 billion people in Asia (half of the world population). Prior to being rice, the grain is previously husked and/or milled by the milling machine. There are so many grain taxonomies; one of the world's widest diversity is absolutely Asian grain varieties (both paddy and glutinous grain). In the real market, the diversity of grain species looks different physically genetic features (like size, texture and shape) that make them have the different prices. The most well-known trick in grain (and rice) trading is mixing the pure grain specie product with other species [5] for making the higher price by the heavier ton of product. The faulty impurity by mixing absolutely violates the product quality. As to TAS 4004-2017-one of Thai agriculture standard [6] that is defined for the grain purity inspection by randomnizing some samples the 5% of ton. Generally, the validation of many physical grains is still based on human vision as a manual labor.
To expand those previous works (in both bag of words [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] and CNN [24][25][26][27][28][29]), this paper proposes PhosopNet to do more with less labeled data by image augmentation, as shown in Figure 1. The augmentation is proposed to increase the size and variety of training rice-grain (or grain) data by grain rotation in different angels, brightness adjustment in power law distribution and horizontal flipping in x-axis, respectively. For testing, all grains are localized/detected by mask region convolutional neural network (Mask R-CNN). Each grain is classified by densely connected convolutional neural network (DenseNet). Note that the name "Phosop" is dedicated to the rice's mother [3,4] in antique Thai culture who produced the rain over the land; in order to grow those grain seeds. The contribution of PhosopNet can be summarized as follows: − The proposed augmentation operations can generate ten-thoundsand or thoundsand training grains from a little thoundsand or hundred raw labeled grain data, respectively.

481
− To do more with less data, the PhosopNet achieved the high localization and classification performance using only the less number of labeled grain data. − For the grain recognition performance enhancement, not only convolutional neural network but also bag of words can be improved by image augmentation. This paper is organized as follows. Related works are in section 2. Image augmentation and learning model are described in section 3 and 4. Section 5 talks about experimental settings and results. And the conclusion is in section 6.

RELATED WORKS
The history began from "Japanese rice grading problem" [7] that the authors firstly introduced the way to use computer vision as the main solution in 2002. Later, there were many papers concerning rice recognition. Those papers can be categorized by methods [8] into 2 groups: bag of words and convolutional neural network.

Bag of words
For traditional bag of words, Japanese grain grading was originally introduced by handcrafted feature with neural network [7] as a supervised model. Neural network [19] was the main classfier for shape feature [17] and the principal component analysis (PCA) was used for dimensional reduction on the features [9] in both morphology and multi-color channel. The result showed that neural network with PCA provided a better recognition rate. Not only was the grain recognition, neural network also found to be high correctness in germination prediction [18]. In contrast, the complexity of neural network was found to be a main problem in speed and resource consumption. Instead of the long time and resource processing in neural learning, many statistical with image processing techniques were also proposed [12][13]15] as the alternative ways. Zernike polynomials were also orthogonally computed to quickly extract features [10] and the threshold-based segmentation [11] from the physical grain images. By the way, neural network was still the highest accuracy. Until 2004, a novel support vector machine (SVM) was proven to be higher performance (in speed and time) than MLP [35], especially in a larger number of target classes. Moreover, SVM also had [36] transfer adaptation learning (TAL) mechanism as well as convolutional neural network (CNN), called adaptive-SVM (Ada-SVM) [37]. SVM for grain recognition was used to learn features from colors, morphology and texture with sparse coding [16]. One highlight in bag of words was based on SVM [14]: the saturation channel from hue-saturation-value (HSV) model as a threshold for segmentation, the color histogram of the green (−) to red (+) and blue (−) to yellow (+) from international commission on illumination lab (CIELAB) model, the shape description by histogram of curvature and the texture was described by scale invariant feature (SIFT) [38], speed up robust features (SURF) [39] and root-SIFT [40], respectively. In 2012, a convolutional neural network (CNN) in AlexNet architecture [41] was the winner of ImageNet large scale visual recognition challenge (ILSVRC) that outperformed those bag-of-words models, especially in larger data volume [32,33]. Many computer vision papers have been gradually shifted from traditional bag of words to CNN paradigm [8] to solve object localization and classification problems in big data. Argubly, the industrial requirements concern user experience, environmental implementation and software maintenance friendliness; it was sometimes better to be implemented by histogram of gradients (HoG) [42] with traditional machine learning as a bag of words model [20][21][22], for the open-world grain inspection [23].

Convolutional neural network
For convolutional neural network (CNN), grains were calibaratedly acquisted by hyperspectral camera and sent to CNN [24][25] that totally needed the cost for data acquistion. CNN was proven to be higher performance than traditional machine learning like k-NN and SVM [25] based on those hyperspectral images. For a digital image, GoogLeNet [43] (as Inception v.4 [44]) was used as the CNN architecture for germ integrity [26]. Later, the comparison between CNN architectures under the same environment [34] were done for grain image classification and densely connected convolutional networks (DenseNet) [45] showed the highest accuracy; higher than ResNet [46], GoogLeNet [43], Neural architecture search network (NasNet) [47] and visual geometry group (VGG) [48]. Moreover, the deeper model did not guarantee the more correctness of grain image classification (such VGG-16 higher correctness than VGG-19 [49]). Not only classification but also localization was necessary for grain quality inspection. As the highlight, Mask R-CNN [50] with ResNet [46] was used for grain localization and classification (called MIMR [29]). But a large number of manual labeling on too many small grains [51] was still necassary. To do more with less data, this paper named PhosopNet proposes the image augmentation that generates the thoundsand grain data from hundred one, instead of manually labeling those ten-thoundsand small grain images. For the expansion of previous works, the computer vision applied to rice or grain problems (both bag of words [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] and CNN [24][25][26][27][28][29]) can achieve high performance by training the less labeled grain data.

Brightness adjustment
For brightness adjustment ( ℎ ( ) ( , )), power law distribution is used to tune a pixel ( ( , )) into image brightness values by gamma threshold ( ) and a constant ( ). The lower value provides more darkness and the higher one provides more lightness, vice and versa. And the is normally set to 1.

Horizontal flipping
For horizontal flipping (

LEARNING MODEL
Convolutional neural network (CNN) achieves performance over conventional bag of words [32,33], especially in large volume of data. For bag of words, the positions of all grains are localized and transformed into the numerical values by handcrafted feature extraction [52] (e.g., SIFT [38], SURF [39] or HoG [42]) those values are used to classify using traditional supervised machine learning (e.g., MLP [7,35] or SVM [14,35]). For the CNN, all grains within an image are localized by CNN detection (e.g., Faster R-CNN or Mask R-CNN); each grain object is directly represented in term of features and classified by CNN classification (e.g., ResNet [46] or DenseNet [45]). Moreover, CNN conveys the role TELKOMNIKA Telecommun Comput El Control of transfer adaptation learning [36] with pre-trained weights of COCO dataset that model representation can be retrained in many times.

Localization
To localize the grains (or rice-grains) within an image, mask region convolutional network (Mask R-CNN [50]) based on DenseNet [45] was used to detect all grains with their positions in the proposed PhosopNet. Mask R-CNN is one of region-proposal based (two-stage) [53] detection pipeline that was designed to preserve the lowest instance (or pixel) level spatial correspondence. Although two-stage pipeline was shown to be higher average precision (AP) than one-stage pipeline [33], (e.g., you only look once (YOLO), and single shot multibox detector (SSD), one-stage detection was better in speed; and mostly used in real-time applications. For the grain recognition, texture within a grain object was small and very similar between species; the localization accuracy was necessary to use two-stage detection. Originally, the two-stage detection pipeline inherited from R-CNN [54] that firstly introduced to use the regions as CNN features. However, R-CNN had the expensive and slow problem on training support vector machine (SVM) for localization of all grains. For the improvement, Fast R-CNN [55] used region of interest (RoI) pooling, instead of unorganized RoI; and also used soft-max loss, instead of the full SVM classifier. Later, region proposal network (RPN) and multi-reference detection were the main contribution in Faster R-CNN [56] that completely solved the redundancy and bottleneck of Fast R-CNN. Since the flat Faster R-CNN cannot tackle pixel-wise instance in grain localization, Mask R-CNN [50] was extended from both Faster R-CNN and Fast R-CNN that achieved results by including feature pyramid network (FPN) [57] for feature fusion, RoI alignment and bi-linear upsampling. To identify the boundary of grain, Mask R-CNN uses RPN to generate the bounding box of each object as the first stage and the class parallel prediction in the second stage, respectively.

Classification
For the grain (or rice-grain) classification, densely connected convolutional network (DenseNet) [45] is a main architecture in the proposed PhosopNet which enables transfer domain learning. Originally, visual geometry group network (VGGNet) [48] used only 3x3 convolutional kernels. Unlike AlexNet [41], the larger kernel size (such as 5x5 or 7x7) caused the larger model and too many parameters. Moreover, too larger stride made the network lost the useful features from the lower layers. Although VGGNet was proven that the deeper networks obtained better performance, it was later found to spawn the problem as gradient vanishing and explosion that were finally solved by skip connection in residual network (ResNet) [46]. Unfortunately, most architectures are neither hierarchical (e.g., AlexNet [41], VGGNet [48], ResNet [46]) nor parallel (e.g., GoogLeNet [43]) architectures that make the low-level grain features to be disable for reusing in the high-level layers. For the solution by DenseNet, the feature maps from previous layers were also sent to the next convolutional blocks. Moreover, the transition layers after dense layer were proposed to reduce the number of feature maps in grain features that completely made the shallow layers focus on low-level features and the deeper layers focus on high-level features. The DenseNet architectures were shown in Table 1.

Transfer adaptation learning
Transfer adaptation learning [58] enables to transfer knowledge from one training task into another one. For the first training, the pre-trained weights are set as the initial network. The source domain contains some useful grain features that are used for retraining the second time. Technically, all weights from the source domain can be reused and retrained with the new labeled grain data (and sometimes with their target classes). The usefulness of transfer adaptation learning in grain recognition is that the retraining task can be performed in many times. This makes a less number of small labeled grain data to be iteratively trained to the model, instead of one time (or big-bang) training from the large-scale data. Furthermore, transfer adaptation learning [36] can be divided into transfer learning (TL) and domain adaptation (DA), as shown in Figure 3.
For transfer learning ( (•)), the pair of grain feature and class in source

EXPERIMENTAL SETTINGS AND RESULTS
According to the real-world inspection problems in the industry [5,6], this section talked about the experimental results and discussion in PhosopNet. Image augmentation with transfer adaptation learning was used to increase the data volume and variety. The detail could be categorized into 6 the main issues.

Datasets
The raw grains with their target classes in this experiment could be divided into 3 different paradigm settings according to the rice-grain standard inspection [5,6], named "Phosop i-th" (in Table 2). These raw samples were classified and sent from a grain inspection laboratory. Those physical grains were trained to the supervised model in a format of digital image as the primary dataset. The grains were put on the black scene. Within an image, each row contained 10 grains which were the same target class. The distance between image and camera positions was 25 cm.

Experimental settings
For the experimental settings, PhosopNet was such a supervised learning (or supervision). Mask R-CNN [50] and DenseNet [45] were applied for localization and classification, respectively. The supervised model generally consisted of training and testing.

Training
All grains in each row were laid on the same orientation. The image and camera positions were vertical; and the distance between them was 25 cm. Each row refered to one target class that had 10 grain samples, as shown in Figure 4. For the labeling, all cropped grains in each row were labeled one by one in text file and trained by CNN-based supervised model, where the grain same row was the same target class. To do more with less data, each grain was further augmented to increase the dataset size by rotation, brightness adjustment and horizontal flipping.

Testing
All grains could be laid on any orientations but they should not have been overlapped one another. The image and camera positions were either vertical or non-vertical in any background colors. The distance between image and camera could be varying according to the real-world inspection. All grains in any orientations were detected and generated in the same orientations with new positions, as shown in Figure 5. For testing, Mask R-CNN [50] firstly localized all grains within the image. Each grain was classified by DenseNet [45], as shown in Figure 5. With the help of image augmentation, the number of training by manual labeled grain could be less than that of testing. For the localization evaluation, the intersection over union (IoU) between proposal locations and the associated ground-truth labeling was set to 50%; the performance of grain objectiveness localization was evaluated by mean average precision (mAP) metric [53]. In the same way, the accuracy metric was used to measure the classification correctness [33].

Phosop-1: Purity between glutinous and paddy grain
Purity between glutinous and paddy grain was one of the main industrial problem in grain inspection. As to the physical appearance, the glutinous grains were both fatter and longer than the paddy grains. As related to the Phosop-1 problem, the augmentation could improve both localization for 32% and classification for 31%, as shown in Table 3. The augmentation operations (rotation, brightness adjustment and horizontal flipping) increased the data size from 300 grains to 2,400 grains that totally boosted the Mask R-CNN [50] to localize the grain objects better from an image by larger size and variety of image training data. To do more with less data by augmentation, the purity between glutinous and paddy rain in 500 testing grains could be correctly classified as 100%, using only 300 manually-labelled grains. Table 3. Phosop-1 -improved by augmentation using 500 testing grains

Phosop-2: Paddy grain grading
According to the paddy grain standard inspection, the grain grading was also visionally checked by the grain size that could be divided into high quality (or long-paddy grain) and low quality (short-paddy grain). Furthermore, the glutinous grains were fatter than long-paddy grains and short-paddy grains; but the glutinous grains were often mixed in paddy grain products. For the physical difference between long-paddy and glutinous grain, most length of long-paddy grains were equal or longer than that of glutinonous grains; but the glutinous grains were clearly fatter than the long-paddy grains. Both long-paddy and glutinous grains were longer than short-paddy grains. For an important limitation, the length of paddy grain species (like Phitsanulok 80, Chiang Phattalung and Sang Yod Phattalung) occasionally looked half of short-paddy and long-paddy grain grades that made the Phosop-2 model have the overfitting error as 2%. As well as the Phosop-1, augmentation also improved both localization and classification (as shown in Table 4), the data size was increased from 450 to 3,600 training grains.

Phosop-3: Grain specie classification
For the seed growing, grain specie purity was really important for farmers because the different grain species affect the different prizes and volumes of productivities. From Tables 3-5, not only the augmentation operations but also the higher number of data could improve the Mask R-CNN localization performance, almost 100%. Using only 1,320 manually-labeled grains covering 11 species, Phosop-3 achieved the accuracy at 94% with the help of augmentation. However, some very similar grain appearance like Dawk Mali 105, Pathum Thani 1 and RD 79 also could not be classified by experts' inspection that were difficult to be classified by the supervised model. Furthermore, PhosopNet was a transfer learning architecture as a source domain that could be transferred to learn more species/samples in the next target domain.

Experimental comparisons
The proposed PhosopNet was compared to previous highlight paper MIMR [29] that was based on ResNet-50 [46] for classification. Since the PhosopNet classification was DenseNet that also had image augmentation to increase the number of grains in training set, instead of the full labeling by human. Both MIMR and PhosopNet were localized by Mask R-CNN that already had been proven to be the highest mAP for object localization (compared to other two-stage detections, e.g., R-CNN [50], Fast R-CNN [55], Faster R-CNN [56]), especially for the small objects (like grains) within an image. From Table 6, MIMR [29] did not have the augmentation to increase the size of dataset that made it provide lower accuracy trained by the less data. Another reason was ResNet has only skip connection, while DenseNet [45] had with dense block between the layers that could easily use the feature maps from the low-level layers. According to economic condition, the grain inspection based on bag-of-words model (traditional machine learning with feature extraction) is still required [20][21][22] in the open-world industry such as iRSVPred [23]. The previous papers [29] has already showed the high correctness (higher than 0.7) in the problems: purity between glutinous and paddy grain (Phosop-1); and paddy grain grading (Phosop-2). The bag of words models still had a problem on too many target classes, like the 11 classes in grain classification  -3). Furthermore, some grain species (like Dawk Mali 105 and Pathum Thani 1) were looked very similar. For the solution, image augmentation operations could improve the accuracy classification in a large number of images with target classes. Most traditional machine learning algorithms were support vector machine (SVM) and multi-layer perceptron (MLP) that were frequently used in computer vision. SVM was already proven to be stronger than MLP, especially in a larger number of target classes. And SVM also had adaptive-SVM (Ada-SVM) [35] as transfer adaptation learning mechnism like CNN. For localization, most feature extraction algorithms were originated from scale invariant feature transform (SIFT) [38]. There were many versions of SIFTs, e.g., PCA-SIFT (SIFT with dimension reduction by principal component analysis) [8], speed-up robust feature (SURF) [39], root-SIFT (SIFT with ℓ 1 -normalization and square-root) and histogram of gradient (HoG) [42]. From Table 7, not only convolutional neural network but also bag-ofwords model could be improved by image augmentation, where HoG with SVM provided the highest accuracy as 84%.

CONCLUSION
As referred to the expensive labeling on too many small grains, the proposed PhosopNet has achieved the high performance in terms of grain localization and classification using the less labeled data training. The augmentation is the behind technique to generate more grain data by rotation, brightness adjustment and horizontal flipping. PhosopNet has Mask R-CNN for grain localization and DenseNet for grain classification. DenseNet is a transfer learning architecture that consists of transfer learning-learning some data with target classes in one stage and more data with new target classes in the next stage; and domain adaptaion-learning some data with target classes in one stage and more data with the same classes in the next stage. According to the grain standard inspection in the real-world, the experiments are divided into 3 groups: Phosop-1 as glutinous grain and paddy grain classification, Phosop-2 as glutinous grain, long-grain paddy and short-grain paddy classification and Phosop-3 as 11 grain specie classification. Moreover, the augmentation improves not only convolutional neural network but also bag of words. For the main finding, the less labeled data is possible to achieve high correctness in both localization and classification. The shortcoming like the similar grain appearance may be alleviated by pseudo labeling (or self-supervision) that some labeled data is trained in the learning model; another unlabeled data is later classified and pseudo-labeled by the model. For the outlook and direction, the seed recognition (both CNN and bag of words) like rice-grains, weeds or beans will absolutely not needs the iteratively manual labeling process by human labor for training those large-scale small seeds.