Published December 27, 2021 | Version v1
Conference paper Open

Efficient Software for Archiving and Retrieving Results of Massive Bioinformatics Analyses in High-Performance Computing Environments


Abstract—Modern sequencing and computational facilities in
biomedical and agricultural areas generate and analyze hundreds
or thousands of samples every day. At that scale production bioin-
formatics workflows can produce vast amounts of data, which
need to be managed: organized, deleted, moved, stored, and
retrieved. This can create an infrastructure bottleneck especially
when moving or archiving data. The difficulty stems chiefly from
the structure of file collections being produced: frequently very
large number of files, with a highly nested directory structure,
and a heterogeneous distribution of file sizes, with emphasis on
large numbers of very small files. Parallel file systems, such
as Lustre, GPFS, and tape archives, can perform poorly under
these circumstances due to overabundance of metadata. However,
standard packaging utilities, such as tar and zip, do not scale
well with the size of the data for this particular use case. The
present manuscript reviews several recently developed parallel
alternatives, showcasing their performance on a variety of high
performance computing systems.



