Nguyen, Vinh Khuong
Vo, Huynh Quang Nguyen
2021-06-07
<p>1. Introduction</p>
<p>These files contain the proposed implementation for benchmarking to evaluate whether a setup of hardware is feasible for complex deep learning projects.</p>
<p>2. Scope </p>
<ul>
<li>The benchmark evaluates the performance of a setup having a single CPU, a single GPU, RAM and memory storage. The performance of multi-CPUs/multi-GPUs or server-based is included in our scope.</li>
<li>The benchmark is built on the <strong>Anaconda</strong> distribution of Python, and the <strong>Jupyter Notebook</strong> computational environment. The deep learning models mentioned in this benchmarked are implemented using the <strong>Keras</strong> application programming interface (API).</li>
<li>Our goal is to develop a verified approach to conduct the hardware benchmark that is quick and easy to use. To do so, we provide benchmarking programs as well as the installation guide for Anaconda and deep learning-supported packages.</li>
</ul>
<p>3. Evaluation metrics</p>
<p> There are various metrics to benchmark the performance capabilities of a setup for deep learning purposes. Here, the following metrics are used:</p>
<ol>
<li><strong>Total execution time</strong>: the <strong>total execution time</strong> includes both the <strong>total training time</strong> and the <strong>total validation time</strong> of a deep learning model on a dataset after a defined number of epochs. Here, the number of epochs is 100. The lower the <strong>total execution time</strong> the better.</li>
<li><strong>Total inference time</strong>: the <strong>total inference time</strong> includes both the <strong>model loading time</strong> (the time required to fully load a set of pre-trained weights to implement a model) and the <strong>total prediction time</strong> of a deep learning model on a test dataset. Similar to the <strong>total execution time</strong>, the lower the <strong>total inference time</strong> the better.</li>
<li><strong>FLOPS</strong>: the performance capability of a CPU or GPU can be measured by counting the number of floating operation points (FLO) it can execute per second. Thus, the higher the <strong>FLOPS</strong>, the better.</li>
<li><strong>Computing resources issues/errors</strong>: Ideally, a better-performed setup will not encounter any computing resources issues/errors including but not limited to the Out-Of-Memory (OOM) error.</li>
<li><strong>Bottlenecking</strong>: to put it simply, bottlenecking is a subpar performance that is caused by the inability of one component to keep up with the others, thus slowing down the overall ability of a setup to process data. Here, our primary concern is the bottlenecking between CPU and GPU. The <strong>bottlenecking factor</strong> is measured using an online tool: <a href="https://pc-builds.com/calculator/">Bottleneck Calculator</a></li>
</ol>
<p> 4. Methods</p>
<ul>
<li>To evaluate the hardware performance, two deep learning models are deployed for benchmarking purpose. The first model is a modified VGG19 based on a study by Deitsch et al. (<strong>Model A</strong>) [1], and the other model is a modified concatenated model proposed in a study from Rahimzadeh et al. (<strong>Model B</strong>) [2]. These models were previously implemented in Vo et al [3]. The model compilation, training and validation practices are similar to those mentioned in Vo et al [3]. Besides, several optimization practices such as mixed precision policy are applied for model training to make it run faster and consume less memory. The following datasets are used for benchmarking: the <strong>original MNIST dataset</strong> by LeCun et al., and the <strong>Zalando MNIST dataset</strong> by Xiao et al.</li>
<li>On the other hand, we also proposed another approach for benchmarking that is much simpler and quicker: evaluating the <strong>total execution time</strong> for a combination of basic operations. These basic operations include General Matrix to Matrix Multiplication (GEMM), 2D-Convolution (Convolve2D) and Recurrent Neural Network (RNN), and exist in almost all deep neural networks today [4]. We implemented our alternative approach based on the DeepBench work by Baidu [5]:
<ul>
<li>In DMM, we defined matrix C as a product of (MxN) and (NxK) matrices. For example, (3072,128,1024) means the resulting matrix is a product of (3072x128) and (128x1024) matrices. To benchmark, we implemented five different multiplications and measured the overall <strong>total execution time</strong> of these five. These multiplications included (3072,128,1024), (5124,9124,2560), (2560,64,2560), (7860,64,2560), and (1760,128,1760).</li>
<li>In SMM, we defined matrix C as a product of (MxN) and (NxK) matrices, and (100 - Dx100)% of the (MxN) matrix is omitted. For instance, (10752,1,3584,0.9) means the resulting matrix is a product of (10752x1) and (1x3584) matrices, while 10% of the (10752x1) matrix is omitted. To benchmark, we implemented four different multiplications and measured the overall <strong>total execution time</strong> of these five. These multiplications included (10752,1,3584,0.9), (7680,1500,2560,0.95), (7680,2,2560,0.95), and (7680,1,2560,0.95).</li>
<li>In Convolve2D, we defined a simple model containing only convolution layers and pooling layers and measured the resulting <strong>total execution time</strong>. The dataset used for this training this model is the <strong>Zalando MNIST</strong> by Xiao et al.</li>
<li>We did not implement the <strong>RNN</strong> due to several issues caused by the new version of Keras.</li>
</ul>
</li>
<li>To evaluate <strong>total inference time</strong>, we loaded the already trained weights from our models (denoted as <strong>Model A-benchmarked</strong> and <strong>Model B-benchmarked</strong>, respectively) which has the best validation accuracy and conducted a prediction run on the test set from the <strong>Zalando MNIST</strong>. These files are available on Zenodo: <a href="https://zenodo.org/record/4905213#.YL1-P_kzaUk">Inference Models</a></li>
</ul>
<p>5. References</p>
<ul>
<li>[1] S. Deitsch, V. Christlein, S. Berger, C. Buerhop-Lutz, A. Maier, F. Gallwitz, and C. Riess, “Automatic classification of defective photovoltaic module cells in electroluminescence images,” Solar Energy, vol. 185, p. 455–468, 06-2019</li>
<li>[2] M. Rahimzadeh and A. Attar, “A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2,” Informatics in MedicineUnlocked, vol. 19, p. 100360, 2020.</li>
<li>[3] H. Vo, “Realization and Verification of Deep Learning Models for FaultDetection and Diagnosis of Photovoltaic Modules,” Master’s Thesis, Aalto University. School of Electrical Engineering, 2021.</li>
<li>[4] P. Warden, "Why GEMM is at the heart of deep learning," Pete Warden's Blog, 2015. Available at: <a href="https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/">https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/</a></li>
<li>[5] Baidu Research, "Benchmarking Deep Learning operations on different hardware". Available at: <a href="https://github.com/baidu-research/DeepBench">https://github.com/baidu-research/DeepBench</a></li>
<li>[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, 1998.</li>
<li>[7] Xiao, K. Rasul, and R. Vollgraf, “A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” 2017. <a href="https://github.com/zalandoresearch/fashion-mnist">https://github.com/zalandoresearch/fashion-mnist</a></li>
<li>[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.</li>
<li>[9] F. Chollet, “Keras,” 2015. Available at: <a href="https://github.com/fchollet/keras">https://github.com/fchollet/keras</a></li>
<li>[10] ML Commons. Available at: <a href="https://mlcommons.org/en/">https://mlcommons.org/en/</a></li>
<li>[11] W. Dai and D. Berleant, “Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics,” 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019.</li>
</ul>
<p> </p>
<ul>
</ul>
https://doi.org/10.5281/zenodo.4905213
oai:zenodo.org:4905213
Zenodo
https://doi.org/10.5281/zenodo.4905212
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
deep learning
hardware
benchmarking
Hardware Benchmark for Deep Learning Capability
info:eu-repo/semantics/other