High Performance Correctly Rounded Math Libraries for 32-bit Floating Point Representations

Lim, Jay P.; Nagarakatte, Santosh

doi:10.5281/zenodo.4579410

Published March 3, 2021 | Version v1

Conference paper Open

High Performance Correctly Rounded Math Libraries for 32-bit Floating Point Representations

1. Rutgers University

Abstract:

We present the artifact for the accepted paper, High Performance
Correctly Rounded Math Libraries for 32-bit Floating Point
Representations. The artifact provides all necessary source code as
well as testing harness to show the correctness and speedup of the
proposed math library, RLibm-32. The readme documentation describes how
to install all necessary tools and explain the steps to run
experiments. We provided scripts to automatically execute each part of
the experiment. As fully evaluation the entire artifact will require
at least 4 days of work (even with some parallelism), we also provide
a list of scripts for the "SHORT" version of the artifact
evaluation. The short version will require roughly 2 hours to
complete. We highly recommend to use the short version instead of the
full version. However, for completeness, we also describe how to
completely evaluate the artifact towards the bottom of the readme
file.

_________________________________________________________________________________________________

* Important note:

Completely evaluating everythign in this artifact will take several
days (4+ days). There are a total of 4 phases in our evaluation:

   Computing reduced intervals
   Generating polynomial
   Checking for correctness
   Testing speedup

Thus, in this artifact, we provide a "short" version of each phase of
evaluation. The short version will take roughly 2 hours to
complete. We HIGHLY HIGHLY recommend you to evaluate the artifact
using the short version.

We also provide methods to perform full evaluation of the 3rd and 4th
phase (checking for correctness and testing speedup) without actually
requiring to generate polynomials. However, checking for correctness
requires roughly 24 hours to complete and testing the speedup of
RLibm-32 requires roughly 8-10 hours to complete.

In conclusion, we strongly recommend you to evaluate the artifact
using the short version.

_________________________________________________________________________________________________

* Hardware recommendation:

All of our evaluations were performed on a machine with 2.10GHz Intel
Xeon Gold 6230R and 187GB of RAM running Ubuntu 18.04. However, the
artifact should be executable using most modern machine with at least
16GB of memory. To run the "short" version, we recommend disk space of
at least 40GB to be safe. To run the "full" version, we recommend
disk space of at least 200GB.

* Software recommendation:

We tested our artifact on Ubuntu-16.04 and Ubuntu-18.04. We provide
installation instructions to install all pre-requisite software.

_________________________________________________________________________________________________

* INSTALLATION GUIDE:

There are two methods to install the artifact: (1) Installing docker
and downloading the docker image for the artifact or (2) Manual
installation. We recommend to use the docker image provided to
expedite the installation process.

** DOWNLOADING PREBUILT DOCKER IMAGE

We have prebuilt a docker image and hosted it in the docker hub.

(1) Install docker if not already installed by following the
installation documentation in this link:
https://docs.docker.com/install/

   We recommend installing docker and evaluating our artifact on
   a machine with Ubuntu. Although docker can be used with
   Windows or MacOS, docker will run on top of a linux VM,
   significantly slowing down all process. All docker commands
   may require "sudo" permission. If you get any errors when
   trying to use docker, prefix all commands using the "sudo"
   command

(2) Download the docker image

$ docker pull jpl169/pldi2021artifact

The docker image is roughly 5.61GB in size

(3) Run the docker image

$ sudo docker run -it jpl169/pldi2021artifact

   You will be placed in the PLDI2021Artifact directory. You
   are now ready to evaluate the artifact. Go down to the
   EVALUATION GUIDE section for artifact evaluation.

(*) NOTE ON WORKING WITH DOCKER (COPYING FILES)

   Later in the EVALUATION GUIDE, you may want to copy files
   from the docker container to your host machine. This will
   allow you to compare or inspect files easier. To copy files
   from container to host machine, follow the steps below:

(a) Remember the <path> and <file name> you wish to copy
(b) exit from docker container by using the command

$ exit

(c) Find the ID of the container you were in

$ docker container ls -a

(d) Copy file from the host terminal:

$ docker cp <container id>:<path>/<filename> <destination>

   For example, if you want to copy "Example.pdf" in the "/home/foo"
   directory from the container Id 12345 to your host machine's
   current directory, the command will look like

       $ docker cp 12345:/home/foo/Example.pdf .

(e) If you want to go back into the container, use the command

$ docker start <container id>
$ docker attach <container id>

** MANUAL INSTALLATION

(1) To install and evaluate the artifact manually, first install a list of prerequisites:

   $ sudo apt-get update
   $ sudo apt-get install build-essential cmake git libgmp3-dev libmpfr-dev zlib1g zlib1g-dev wget python python3 python3-pip parallel
   $ sudo python3 -m pip install matplotlib

(1.1) To prevent gnu parallel from messaging multiple times, run the code to agree with their term:
$ parallel --citate
-> type 'will cite'

(2) Download and install soplex 4.0.1:

   Please download soplex-4.0.1 from https://soplex.zib.de/index.php#download
   ** Make sure that you're downloading version 4.0.1 **

   $ tar -xvf soplex-4.0.1.tar
   $ cd soplex-4.0.1
   $ mkdir build
   $ cd build
   $ cmake ..
   $ make
   $ cd ../..

(3) Download and compile SoftPosit :
   $ git clone https://gitlab.com/cerlane/SoftPosit.git
   $ cd SoftPosit
   $ git checkout 983e821dd30b9da5467bcbea2895fdbacc1ce264
   $ cd build/Linux-x86_64-GCC/
   $ make
   $ cd ../../..

(4) Download and install Intel oneAPI Base Toolkit

* You can access it through this site:
https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit/download.html

   (a) Select the appropriate operating system
   (b) Select "Web & Local" distribution option
   (c) Select Online installer
   (d) On the right hand side (gray background) if you scroll
   down, it will show the steps to install.

If your OS is Linux base, then you might use the command:

$ wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17427/l_HPCKit_p_2021.1.0.2684.sh

$ bash l_HPCKit_p_2021.1.0.2684.sh

   Follow the instruction. The installer will guide you through
   installing intel compiler.

   (a) Make sure to install "Intel® oneAPI DPC++/C++ Compiler &
   Intel® C++ Compiler Classic." Choose to not install any other
   components
   (b) Make sure to remember the Installation directory
   (c) If it shows you any warning about requiring the "Base
   toolkit" you can choose to ignore it.

The installation will take roughly 5~10 minutes.

   Once installation is complete, run script to set variables:
   $ cd <path to intel oneAPI directory>
   $ . setvars.sh

(5) Set the following environment variables:
   $ export SOFTPOSITPATH=<path to SoftPosit directory>
   $ export SOPLEXPATH=<path to soplex-4.0.1 directory>
   $ export ICCPATH=<path to Intel oneAPI directory>

   * If you did not change the installation path while installing
   Intel compiler, then the path to Intel oneAPI directory will
   most likely end with "intel/oneapi"

(6.a) Download PLDI21 artifact from github
$ git clone https://github.com/jpl169/PLDI2021Artifact
$ cd PLDI2021Artifact

(6.b) Or Download PLDI2021 artifact from Zenodo and untar the file:
   - Zenodo page: https://doi.org/10.5281/zenodo.4579410
   $ tar -xvf PLDI2021Artifact.tar
   $ cd PLDI2021Artifact

(7) Installation is complete. Go to EVALUATION GUIDE section.

_________________________________________________________________________________________________

* EVALUATION GUIDE FOR THE ARTIFACT

   If you have followed one of the two installation methods in
   the INSTALLATION GUIDE section, you should be in the
   PLDI2021Artifact directory. As mentioned before, we provide
   both "short" version and "long" version of artifact
   evaluation. We highly recommend to try the "short" version, as
   the "long" version will take several days to complete. We
   first describe how to evaluate our artifact using the "short"
   version. Then we will explain the "long version"

_________________________________________________________________________________________________

* EVALUATION GUIDE (SHORT VERSION)

(1) Computing reduced intervals

   In this part, we will compute reduced intervals necessary to
   generate polynomials of sinh(x) and cosh(x) for float
   representation. To execute this step, run the command

$ ./runFloatIntervalGen_Short.sh

   This process should take roughly 30min. Once this process
   finishes, you will find that there is a new folder "intervals"
   which contains two files, FloatCoshData and FloatSinhData
   which are necessary for the next step

(2) Generating polynomial (Corresponding to Table 2)

   In this part, we generate the polynomials based on the reduced
   intervals we computed in the previous step. To execute this
   step, run the command

   $ ./runFloatFunctionGen_Short.sh

   This process should take roughly 15min ~ 30min. Once this
   process finishes, you will find that there are two files
   "Sinh.h" and "Cosh.h" in the directory
   include/float_headers. You can check the file against the
   reference header files in
   include-reference/float_headers. Although each value in the
   array may slightly differ, the size of the array should be
   exactly the same.

(3) Checking the correctness of RLibm-32 (Corresponding to Table 1)

   In this part, we check the correctness of RLibm-32 functions. To
   expedite the process, we select one million uniformly
   distributed input points. Thus, the exact result and number
   will be different from the result in the paper. To check the
   correctness of math library functions for float, run the
   command

$ ./runFloatLibTest_reference_Short.sh

   This process should take roughly 1 minute. Once this process
   finishes, it will print out whether RLibm-32, GLibc, and Intel
   math library produces correct results. For all functions, you
   should see the message: "RLibm-32 returns correct result for all
   inputs." You should also see that for a number of functions,
   GLibc and Intel's float math library does not produce correct
   results. Because we only tested with 1 million inputs, the
   number of inputs that produce wrong results will be different
   from the paper.

To check the correctness of math library functions for
posit32, run the command

$ ./runPosit32LibTest_reference_short.sh

This process should take roughly 1 minute

   Again, for each function, you should see the message: "RLibm-32
   returns correct result for all inputs." Because we did not
   test all inputs, it will also show that both GLibc and Intel's
   double library produces correct result as well, unlike what is
   reported from the paper.

(4) Speedup of RLibm-32 (Corresponding to Figure 3 and Line 1134)

   * NOTE: Since the submission of the paper, Intel has
   completely changed the toolkit for installing Intel's
   compiler. Thus, the speedup result may differ from what's
   reported in the paper. We will update the result in the final
   version accordingly.

   In this part, we check the speedup of RLibm-32 against Glibc and
   Intel math library. To expedite the process, we select one
   million uniformly distributed input points again. To check the
   speed of RLibm-32's float functions, run the command

$ ./runFloatOverheadTest_reference_Short.sh

This process should take roughly 1 minute

   Once this process finishes, it will create two files,
   (a) floatAgainstGlibcShort.pdf
   (b) floatAgainstIntelShort.pdf

   These two graphs correspond to Figure 3(a) and Figure 3(b),
   respectively. Although the numbers may look different, it
   should still be the case that the bars are above 1x speedup in
   most cases, and average is above 1x speedup in all cases.

To check the speedup of RLibm-32's posit32 functions, run the
command

$ ./runPosit32OverheadTest_reference_Short.sh

This process should take roughly 1 minute

   Once this process finishes, it will print the speedup of
   RLibm-32 against Glibc and Intel's double math library. The
   speedup will be different because it only tests 1 million
   points, but speedup should be roughly 1x ~ 1.15x.

_________________________________________________________________________________________________

* EVALUATION GUIDE (LONG VERSION)

   * NOTE: Before beginning the long version, PLEASE understand
   that the whole process will take roughly 4 days or longer. If
   you do not wish to spend so much time, consider doing the
   SHORT version of the evaluation

(1) Computing reduced intervals for float functions

   This process computes the reduced intervals for all 10
   RLibm-32's float functions. To execute this phase, run the
   command

$ ./runFloatIntervalGen.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 2. If you would like
   to increase the parallelism, use the "-j" flag:

$ ./runFloatIntervalGen.sh -j <# jobs>

However, since this phase is I/O heavy, we recommend to limit
parallelism to 4.

   This process will take roughly 12-13 hours. Once this process
   finishes, you will find that there is a new folder "intervals"
   which contains 10 files, which are necessary files to generate
   polynomials for each float functions.

(2) Generating polynomial for float functions (Corresponding to Table 2)

   In this part, we generate the polynomials based on the reduced
   intervals we computed in the previous step. To execute this
   step, run the command

   $ ./runFloatFunctionGen.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 2. If you would like
   to increase the parallelism, use the "-j" flag:

$ ./runFloatFunctionGen.sh -j <# jobs>

However, since this phase is I/O heavy, we recommend to limit
parallelism to 4.

   This process should take roughly 5-6 hours. Once this process
   finishes, you will find that there are header files in
   include/float_headers. You can check the file against the
   reference header files in
   include-reference/float_headers. Although each value in the
   array may slightly differ, the size of the array should be
   exactly the same.

(3) Computing reduced intervals for posit32 functions

This process computes the reduced intervals for all 3 RLibm-32's
posit32 functions. To execute this phase, run the command

$ ./runPosit32IntervalGen.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 2. If you would like
   to increase the parallelism, use the "-j" flag:

$ ./runFloatIntervalGen.sh -j <# jobs>

However, since there are only 3 jobs total, the maximum should
be 3.

   This process will take roughly 12-13 hours. Once this process
   finishes, you will find that there is a new folder "intervals"
   which contains 3 posit32 data files, which are necessary files
   to generate polynomials for each posit32 functions.

(4) Generating polynomial for posit32 functions (Corresponding to Table 2)

   In this part, we generate the polynomials based on the reduced
   intervals we computed in the previous step. To execute this
   step, run the command

   $ ./runPosit32FunctionGen.sh

   This script will automatically run the underlying scripts
   in parallel. By default, the parallelism is 3. If you would like
   to increase the parallelism, use the "-j" flag:

$ ./runFloatFunctionGen.sh -j <# jobs>

However, since there are only 3 jobs total, the maximum should
be 3.

   This process should take roughly 1 hour. Once this
   process finishes, you will find that there are header files in
   include/posit32_headers. You can check the file against the
   reference header files in
   include-reference/posit32_headers. Although each value in the
   array may slightly differ, the size of the array should be
   exactly the same.

(5) Checking the correctness of RLibm-32 (Corresponding to Table 1)

   In this phase, we check the correctness of RLibm-32 functions.
   Depending on whether you generated the header files in the
   previous (1)-(4) step, you should use different scripts.

(a) If you did generate the polynomials, you can check
correctness of the generated polynomials.

(b) If you did not generate the polynomials, you can check
correctness of the reference polynomial that we generated.

To check the correctness of the polynomials YOU GENERATED
for float, run the command:

$ ./runFloatLibTest.sh

To check the correctness of the REFERENCE polynomial for
float, run the command:

$ ./runFloatLibTest_reference.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 4. If you would like
   to increase/decrease the parallelism, use the "-j" flag.

   This process should take roughly 12 hours. Once this process
   finishes, it will print out whether RLibm-32, GLibc, and Intel
   math library produces correct results. For all functions, you
   should see the message: "RLibm-32 returns correct result for all
   inputs." You should also see that for a number of functions,
   GLibc and Intel's float math library does not produce correct
   results.

   * Note: Because our oracle for Sinpi and Cospi are
   exceptionally slow, we only check the result of inputs x = 0
   to 2.0 for Sinpi and Cospi. This is sufficient to extrapolate
   the result of Sinpi and Cospi to all other inputs, due to the
   properties of Sinpi and Cospi.

To check the correctness of the polynomials YOU GENERATED for
posit32, run the command:

$ ./runPosit32LibTest.sh

To check the correctness of the REFERENCE polynomial for
posit32, run the command:

$ ./runPosit32LibTest_reference.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 3. If you would like
   to increase/decrease the parallelism, use the "-j" flag. Note
   that the script only runs 3 jobs. So the maximum parallelism
   is 3.

   This process should take roughly 10 hours. Once this process
   finishes, it will print out whether RLibm-32, GLibc, and Intel
   math library produces correct results. For all functions, you
   should see the message: "RLibm-32 returns correct result for all
   inputs." You should also see that for a number of functions,
   GLibc and Intel's float math library does not produce correct
   results.

(6) Speedup of RLibm-32 (Corresponding to Figure 3 and Line 1134)

   * NOTE: Since the submission of the paper, Intel has
   completely changed the toolkit for installing Intel's
   compiler. Thus, the speedup result may differ from what's
   reported in the paper. We will update the result in the final
   version accordingly.

   In this part, we check the speedup of RLibm-32 against Intel and
   Glibc math library. Depending on whether you generated the
   header files in the previous (1)-(4) step, you should use
   different scripts.

(a) If you did generate the polynomials, you can check speedup
of the generated polynomials.

(b) If you did not generate the polynomials, you can check
speedup of the reference polynomial that we generated.

To check the speedup of the polynomials YOU GENERATED for
float, run the command:

$ ./runFloatOverheadTest.sh

To check the speedup of the REFERENCE polynomials for float,
run the command:

$ ./runFloatOverheadTest_reference.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 1. If you would like
   to increase/decrease the parallelism, use the "-j" flag. We
   recommend to keep the parallelism low, to roughly 2-4.

   This process should take roughly 5-6 hours. Once this process
   finishes, it will create two files,
   (a) floatAgainstGlibc.pdf
   (b) floatAgainstIntel.pdf

   These two graphs correspond to Figure 3(a) and Figure 3(b),
   respectively. Although the numbers may look different, it
   should still be the case that the bars are above 1x speedup in
   most cases, and average is above 1x speedup in all cases.

To check the speedup of the polynomials YOU GENERATED for
posit32, run the command:

$ ./runPosit32OverheadTest.sh

To check the speedup of the REFERENCE polynomial for posit32,
run the command:

$ ./runPosit32OverheadTest_reference.sh

   This script will automatically run the underlying scripts in
   parallel. By default, the parallelism is 1. If you would like
   to increase/decrease the parallelism, use the "-j" flag. We
   recommend to keep the parallelism low, to roughly 2-3.

   This process should take roughly 2 hours. Once this process
   finishes, it will print the speedup of RLibm-32 against Glibc
   and Intel's double math library. The speedup may be different
   from reported from the paper, but speedup should be roughly 1x
   ~ 1.15x.

Files

Files (4.5 MB)

Name	Size	Download all
PLDI2021Artifact.tar md5:75d344d386abb64832e7f14694ad5cc0	4.5 MB	Download

	All versions	This version
Views	435	81
Downloads	109	17
Data volume	4.7 GB	80.2 MB

High Performance Correctly Rounded Math Libraries for 32-bit Floating Point Representations

Authors/Creators

Description

Files

Files (4.5 MB)