Published June 7, 2021 | Version 1.00.00
Software Open

Hardware Benchmark for Deep Learning Capability

  • 1. Aalto University

Contributors

Project member:

  • 1. RMIT Vietnam

Description

1. Introduction

These files contain the proposed implementation for benchmarking to evaluate whether a setup of hardware is feasible for complex deep learning projects.

2. Scope 

  • The benchmark evaluates the performance of a setup having a single CPU, a single GPU, RAM and memory storage. The performance of multi-CPUs/multi-GPUs or server-based is included in our scope.
  • The benchmark is built on the Anaconda distribution of Python, and the Jupyter Notebook computational environment. The deep learning models mentioned in this benchmarked are implemented using the Keras application programming interface (API).
  • Our goal is to develop a verified approach to conduct the hardware benchmark that is quick and easy to use. To do so, we provide benchmarking programs as well as the installation guide for Anaconda and deep learning-supported packages.

3. Evaluation metrics

 There are various metrics to benchmark the performance capabilities of a setup for deep learning purposes. Here, the following metrics are used:

  1. Total execution time: the total execution time includes both the total training time and the total validation time of a deep learning model on a dataset after a defined number of epochs. Here, the number of epochs is 100. The lower the total execution time the better.
  2. Total inference time: the total inference time includes both the model loading time (the time required to fully load a set of pre-trained weights to implement a model) and the total prediction time of a deep learning model on a test dataset. Similar to the total execution time, the lower the total inference time the better.
  3. FLOPS: the performance capability of a CPU or GPU can be measured by counting the number of floating operation points (FLO) it can execute per second. Thus, the higher the FLOPS, the better.
  4. Computing resources issues/errors: Ideally, a better-performed setup will not encounter any computing resources issues/errors including but not limited to the Out-Of-Memory (OOM) error.
  5. Bottlenecking: to put it simply, bottlenecking is a subpar performance that is caused by the inability of one component to keep up with the others, thus slowing down the overall ability of a setup to process data. Here, our primary concern is the bottlenecking between CPU and GPU. The bottlenecking factor is measured using an online tool: Bottleneck Calculator

 4. Methods

  • To evaluate the hardware performance, two deep learning models are deployed for benchmarking purpose. The first model is a modified VGG19 based on a study by Deitsch et al. (Model A) [1], and the other model is a modified concatenated model proposed in a study from Rahimzadeh et al. (Model B) [2]. These models were previously implemented in Vo et al [3]. The model compilation, training and validation practices are similar to those mentioned in Vo et al [3]. Besides, several optimization practices such as mixed precision policy are applied for model training to make it run faster and consume less memory. The following datasets are used for benchmarking: the original MNIST dataset by LeCun et al., and the Zalando MNIST dataset by Xiao et al.
  • On the other hand, we also proposed another approach for benchmarking that is much simpler and quicker: evaluating the total execution time for a combination of basic operations. These basic operations include General Matrix to Matrix Multiplication (GEMM), 2D-Convolution (Convolve2D) and Recurrent Neural Network (RNN), and exist in almost all deep neural networks today [4]. We implemented our alternative approach based on the DeepBench work by Baidu [5]:
    • In DMM, we defined matrix C as a product of (MxN) and (NxK) matrices. For example, (3072,128,1024) means the resulting matrix is a product of (3072x128) and (128x1024) matrices. To benchmark, we implemented five different multiplications and measured the overall total execution time of these five. These multiplications included (3072,128,1024), (5124,9124,2560), (2560,64,2560), (7860,64,2560), and (1760,128,1760).
    • In SMM, we defined matrix C as a product of (MxN) and (NxK) matrices, and (100 - Dx100)% of the (MxN) matrix is omitted. For instance, (10752,1,3584,0.9) means the resulting matrix is a product of (10752x1) and (1x3584) matrices, while 10% of the (10752x1) matrix is omitted. To benchmark, we implemented four different multiplications and measured the overall total execution time of these five. These multiplications included (10752,1,3584,0.9), (7680,1500,2560,0.95), (7680,2,2560,0.95), and (7680,1,2560,0.95).
    • In Convolve2D, we defined a simple model containing only convolution layers and pooling layers and measured the resulting total execution time. The dataset used for this training this model is the Zalando MNIST by Xiao et al.
    • We did not implement the RNN due to several issues caused by the new version of Keras.
  • To evaluate total inference time, we loaded the already trained weights from our models (denoted as Model A-benchmarked and Model B-benchmarked, respectively) which has the best validation accuracy and conducted a prediction run on the test set from the Zalando MNIST. These files are available on Zenodo: Inference Models

5.  References

  • [1] S. Deitsch, V. Christlein, S. Berger, C. Buerhop-Lutz, A. Maier, F. Gallwitz, and C. Riess, “Automatic classification of defective photovoltaic module cells in electroluminescence images,” Solar Energy, vol. 185, p. 455–468, 06-2019
  • [2] M. Rahimzadeh and A. Attar, “A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2,” Informatics in MedicineUnlocked, vol. 19, p. 100360, 2020.
  • [3] H. Vo, “Realization and Verification of Deep Learning Models for FaultDetection and Diagnosis of Photovoltaic Modules,” Master’s Thesis, Aalto University. School of Electrical Engineering, 2021.
  • [4] P. Warden, "Why GEMM is at the heart of deep learning," Pete Warden's Blog, 2015. Available at: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
  • [5] Baidu Research, "Benchmarking Deep Learning operations on different hardware". Available at: https://github.com/baidu-research/DeepBench
  • [6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, 1998.
  • [7] Xiao, K. Rasul, and R. Vollgraf, “A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” 2017. https://github.com/zalandoresearch/fashion-mnist
  • [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [9] F. Chollet, “Keras,” 2015. Available at: https://github.com/fchollet/keras
  • [10] ML Commons. Available at: https://mlcommons.org/en/
  • [11] W. Dai and D. Berleant, “Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics,” 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019.

 

Files

Benchmark Inference.ipynb

Files (1.6 GB)

Name Size Download all
md5:6fd4a8503fc2cb528954b31920323ef9
4.6 kB Preview Download
md5:e91d2124420ad7e9fb190c605b0bdad8
11.4 kB Preview Download
md5:d3469cc53a135d2b566892c44773aa1e
16.4 kB Preview Download
md5:c1ae3106f4e1f1b3c1bf9a301aa66c9f
15.5 kB Preview Download
md5:13012a2718bfcd3343955ebaa9d5004a
12.0 kB Preview Download
md5:70c86da2e2050697981862f803ce79f6
202.8 MB Download
md5:b0c20471fa9b14c800cf66e311b8e031
1.4 GB Download

Additional details

References

  • S. Deitsch, V. Christlein, S. Berger, C. Buerhop-Lutz, A. Maier, F. Gallwitz, and C. Riess, "Automatic classification of defective photovoltaic module cells in electroluminescence images," Solar Energy, vol. 185, p. 455–468, 06-2019
  • M. Rahimzadeh and A. Attar, "A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2," Informatics in MedicineUnlocked, vol. 19, p. 100360, 2020.
  • H. Vo, "Realization and Verification of Deep Learning Models for FaultDetection and Diagnosis of Photovoltaic Modules," Master's Thesis, Aalto University. School of Electrical Engineering, 2021.
  • P. Warden, "Why GEMM is at the heart of deep learning," Pete Warden's Blog, 2015. Available at: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
  • Baidu Research, "Benchmarking Deep Learning operations on different hardware". Available at: https://github.com/baidu-research/DeepBench
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, 1998.
  • Xiao, K. Rasul, and R. Vollgraf, "A Novel Image Dataset for Benchmarking Machine Learning Algorithms," 2017. https://github.com/zalandoresearch/fashion-mnist
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,"Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • F. Chollet, "Keras," 2015. Available at: https://github.com/fchollet/keras
  • ML Commons. Availabke at: https://mlcommons.org/en/
  • W. Dai and D. Berleant, "Benchmarking contemporary deep learn-ing hardware and frameworks: A survey of qualitative metrics," 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019.