Software Open Access

Hardware Benchmark for Deep Learning Capability

Vo, Huynh Quang Nguyen


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.4905213", 
  "title": "Hardware Benchmark for Deep Learning Capability", 
  "issued": {
    "date-parts": [
      [
        2021, 
        6, 
        7
      ]
    ]
  }, 
  "abstract": "<p>1. Introduction</p>\n\n<p>These files contain the proposed implementation for benchmarking to evaluate whether a setup of hardware is feasible for complex deep learning projects.</p>\n\n<p>2. Scope&nbsp;</p>\n\n<ul>\n\t<li>The benchmark evaluates the performance of a setup having a single CPU, a single GPU, RAM and memory storage. The performance of multi-CPUs/multi-GPUs or server-based is included in our scope.</li>\n\t<li>The benchmark is built on the&nbsp;<strong>Anaconda</strong>&nbsp;distribution of Python, and the&nbsp;<strong>Jupyter Notebook</strong>&nbsp;computational environment. The deep learning models mentioned in this benchmarked are implemented using the&nbsp;<strong>Keras</strong>&nbsp;application programming interface (API).</li>\n\t<li>Our goal is to develop a verified approach to conduct the hardware benchmark that is quick and easy to use.&nbsp;To do so, we provide benchmarking programs as well as the installation guide for Anaconda and deep learning-supported packages.</li>\n</ul>\n\n<p>3. Evaluation metrics</p>\n\n<p>&nbsp;There are various metrics to benchmark the performance capabilities of a setup for deep learning purposes. Here, the following metrics are used:</p>\n\n<ol>\n\t<li><strong>Total execution time</strong>: the&nbsp;<strong>total execution time</strong>&nbsp;includes both the&nbsp;<strong>total training time</strong>&nbsp;and the&nbsp;<strong>total validation time</strong>&nbsp;of a deep learning model on a dataset after a defined number of epochs. Here, the number of epochs is 100. The lower the&nbsp;<strong>total execution time</strong>&nbsp;the better.</li>\n\t<li><strong>Total inference time</strong>: the&nbsp;<strong>total inference time</strong>&nbsp;includes both the&nbsp;<strong>model loading time</strong>&nbsp;(the time required to fully load a set of pre-trained weights to implement a model) and the&nbsp;<strong>total prediction time</strong>&nbsp;of a deep learning model on a test dataset. Similar to the&nbsp;<strong>total execution time</strong>, the lower the&nbsp;<strong>total inference time</strong>&nbsp;the better.</li>\n\t<li><strong>FLOPS</strong>: the performance capability of a CPU or GPU can be measured by counting the number of floating operation points (FLO) it can execute per second. Thus, the higher the&nbsp;<strong>FLOPS</strong>, the better.</li>\n\t<li><strong>Computing resources issues/errors</strong>: Ideally, a better-performed setup will not encounter any computing resources issues/errors including but not limited to the Out-Of-Memory (OOM) error.</li>\n\t<li><strong>Bottlenecking</strong>: to put it simply, bottlenecking is a subpar performance that is caused by the inability of one component to keep up with the others, thus slowing down the overall ability of a setup to process data. Here, our primary concern is the bottlenecking between CPU and GPU. The&nbsp;<strong>bottlenecking factor</strong>&nbsp;is measured using an online tool:&nbsp;<a href=\"https://pc-builds.com/calculator/\">Bottleneck Calculator</a></li>\n</ol>\n\n<p>&nbsp;4. Methods</p>\n\n<ul>\n\t<li>To evaluate the hardware performance, two deep learning models are deployed for benchmarking purpose. The first model is a modified VGG19 based on a study by Deitsch et al. (<strong>Model A</strong>) [1], and the other model is a modified concatenated model proposed in a study from Rahimzadeh et al. (<strong>Model B</strong>) [2]. These models were previously implemented in Vo et al [3]. The model compilation, training and validation practices are similar to those mentioned in Vo et al [3]. Besides, several optimization practices such as mixed precision policy are applied for model training to make it run faster and consume less memory. The following datasets are used&nbsp;for benchmarking: the&nbsp;<strong>original MNIST dataset</strong>&nbsp;by LeCun et al., and the&nbsp;<strong>Zalando MNIST dataset</strong>&nbsp;by Xiao et al.</li>\n\t<li>On the other hand, we also proposed another approach for benchmarking that is much simpler and quicker: evaluating the&nbsp;<strong>total execution time</strong>&nbsp;for a combination of basic operations. These basic operations include General Matrix to Matrix Multiplication (GEMM), 2D-Convolution (Convolve2D) and Recurrent Neural Network (RNN), and exist in almost all deep neural networks today [4]. We implemented our alternative approach based on the DeepBench work by Baidu [5]:\n\t<ul>\n\t\t<li>In DMM, we defined matrix C as a product of&nbsp;(MxN)&nbsp;and&nbsp;(NxK)&nbsp;matrices. For example,&nbsp;(3072,128,1024)&nbsp;means the resulting matrix is a product of&nbsp;(3072x128)&nbsp;and&nbsp;(128x1024)&nbsp;matrices. To benchmark, we implemented five different multiplications and measured the overall&nbsp;<strong>total execution time</strong>&nbsp;of these five. These multiplications included&nbsp;(3072,128,1024),&nbsp;(5124,9124,2560),&nbsp;(2560,64,2560),&nbsp;(7860,64,2560), and&nbsp;(1760,128,1760).</li>\n\t\t<li>In SMM, we defined matrix C as a product of&nbsp;(MxN)&nbsp;and&nbsp;(NxK)&nbsp;matrices, and&nbsp;(100 - Dx100)%&nbsp;of the&nbsp;(MxN)&nbsp;matrix is omitted. For instance,&nbsp;(10752,1,3584,0.9)&nbsp;means the resulting matrix is a product of&nbsp;(10752x1)&nbsp;and&nbsp;(1x3584)&nbsp;matrices, while 10% of the&nbsp;(10752x1)&nbsp;matrix is omitted. To benchmark, we implemented four different multiplications and measured the overall&nbsp;<strong>total execution time</strong>&nbsp;of these five. These multiplications included&nbsp;(10752,1,3584,0.9),&nbsp;(7680,1500,2560,0.95),&nbsp;(7680,2,2560,0.95), and&nbsp;(7680,1,2560,0.95).</li>\n\t\t<li>In Convolve2D, we defined a simple model containing only convolution layers and pooling layers&nbsp;and measured the resulting&nbsp;<strong>total execution time</strong>. The dataset used for this training this model is the&nbsp;<strong>Zalando MNIST</strong>&nbsp;by Xiao et al.</li>\n\t\t<li>We did not implement the&nbsp;<strong>RNN</strong>&nbsp;due to several issues caused by the new version of Keras.</li>\n\t</ul>\n\t</li>\n\t<li>To evaluate&nbsp;<strong>total inference time</strong>, we loaded the already trained weights from our models (denoted as&nbsp;<strong>Model A-benchmarked</strong>&nbsp;and&nbsp;<strong>Model B-benchmarked</strong>, respectively) which has the best validation accuracy and conducted a prediction run on the test set from the&nbsp;<strong>Zalando MNIST</strong>. These files are available on Zenodo:&nbsp;<a href=\"https://zenodo.org/record/4905213#.YL1-P_kzaUk\">Inference Models</a></li>\n</ul>\n\n<p>5.&nbsp; References</p>\n\n<ul>\n\t<li>[1]&nbsp;S. Deitsch, V. Christlein, S. Berger, C. Buerhop-Lutz, A. Maier, F. Gallwitz, and C. Riess, &ldquo;Automatic classification of defective photovoltaic module cells in electroluminescence images,&rdquo; Solar Energy, vol. 185, p. 455&ndash;468, 06-2019</li>\n\t<li>[2]&nbsp;M. Rahimzadeh and A. Attar, &ldquo;A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2,&rdquo; Informatics in MedicineUnlocked, vol. 19, p. 100360, 2020.</li>\n\t<li>[3]&nbsp;H. Vo, &ldquo;Realization and Verification of Deep Learning Models for FaultDetection and Diagnosis of Photovoltaic Modules,&rdquo; Master&rsquo;s Thesis, Aalto University. School of Electrical Engineering, 2021.</li>\n\t<li>[4]&nbsp;P. Warden, &quot;Why GEMM is at the heart of deep learning,&quot; Pete Warden&#39;s Blog, 2015. Available at:&nbsp;<a href=\"https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/\">https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/</a></li>\n\t<li>[5]&nbsp;Baidu Research, &quot;Benchmarking Deep Learning operations on different hardware&quot;. Available at:&nbsp;<a href=\"https://github.com/baidu-research/DeepBench\">https://github.com/baidu-research/DeepBench</a></li>\n\t<li>[6]&nbsp;Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, &quot;Gradient-based learning applied to document recognition,&quot; Proceedings of the IEEE, 1998.</li>\n\t<li>[7]&nbsp;Xiao, K. Rasul, and R. Vollgraf, &ldquo;A Novel Image Dataset for Benchmarking Machine Learning Algorithms,&rdquo; 2017.&nbsp;<a href=\"https://github.com/zalandoresearch/fashion-mnist\">https://github.com/zalandoresearch/fashion-mnist</a></li>\n\t<li>[8]&nbsp;F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,&ldquo;Scikit-learn: Machine learning in Python,&rdquo; Journal of Machine Learning Research, vol. 12, pp. 2825&ndash;2830, 2011.</li>\n\t<li>[9]&nbsp;F. Chollet, &ldquo;Keras,&rdquo; 2015. Available at:&nbsp;<a href=\"https://github.com/fchollet/keras\">https://github.com/fchollet/keras</a></li>\n\t<li>[10]&nbsp;ML Commons. Available at:&nbsp;<a href=\"https://mlcommons.org/en/\">https://mlcommons.org/en/</a></li>\n\t<li>[11] W. Dai and D. Berleant, &ldquo;Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics,&rdquo; 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019.</li>\n</ul>\n\n<p>&nbsp;</p>\n\n<ul>\n</ul>", 
  "author": [
    {
      "family": "Vo, Huynh Quang Nguyen"
    }
  ], 
  "version": "1.00.00", 
  "type": "article", 
  "id": "4905213"
}
25
0
views
downloads
All versions This version
Views 2525
Downloads 00
Data volume 0 Bytes0 Bytes
Unique views 1616
Unique downloads 00

Share

Cite as