Python file loading time comparison across filetypes
Description
A quick test of loading times of various file formats that can be used to store image data in B x N x M shape (B: number of bands, N, M: number of pixels on both axes). The Jupyter Notebook used to obtain these results is provided.
Methods
I used Python 3.6.15 with the following packages:
- astropy 4.1
- numpy 1.19.5
- h5py 3.1.0
- hdf5 1.10.6
- matplotlib 3.3.4
- tqdm 4.62.3
- _pickle (version packaged with Python 3.6.15)
The tests were run on Fedora 35 with an Intel© Xeon© W-1250 CPU @ 3.30GHz × 6, 16 GB of RAM and the data stored on 3 disks (HGST WD Ultrastar HUS726T4TALE6L4) in a RAID 5 configuration.
Images are generated in (3, 64, 64) shape with pixel values drawn from a normal distribution. Two tests are run, both on 1000 images in total. For the first, those images are saved individually to the four file formats in question: .fits, .npy, .h5 and .pkl and then read one by one, and the operation np.mean() is applied (to prevent any memcaching). In the second test, they are saved in batches of 64, resulting in 64 images of (3, 64, 64) per file. This yields the second plot.
Results
When saving images individually, the .pkl files are loaded the quests at 0.077 ms per file for loading + running np.mean. .npy files take 3.0 times longer, .h5 files 6.1 times longer and .fits files 7.7 times longer. This changes when loading the batched files. Loading + applying np.mean is then fastest with .npy files at 0.034 ms per (3, 64, 64) data unit (the loading file of the batched file divided by the batch size), then .fits 1.4 times longer, .pkl 2.0 times longer and .h5 at 2.1 times longer.
Conclusion
The .fits format widely used in astronomy has a long loading time for individual files, most likely due to the overhead caused by reading the header. It is however one of the fastest file formats when saving images in batches. This should therefore be considered when storing large numbers of images.