Published December 14, 2020 | Version 1
Dataset Open

ATLAS Rucio Transfers Dataset

  • 1. UNLP
  • 2. CERN
  • 3. University of Wuppertal

Description

This dataset is released to encourage the study of ATLAS file transfers in the Worldwide LHC Computing Grid environment, to better understand the transfer processes in this particularly heterogeneous environment.

 

Joaquin Bogado (UNLP)

Mario Lassnig (CERN)

Fernando Monticelli (UNLP)

Thomas Beermann (University of Wuppertal)

Javier Díaz (UNLP)

2020-11-27

jbogado @ linti.unlp.edu.ar

Motivation

This dataset is released to encourage the study of ATLAS file transfers in the Worldwide LHC Computing Grid[1] environment, to better understand the transfer processes in this particularly heterogeneous environment.

Rucio[2] is a Distributed Data Management system. Data from the Rucio ATLAS instance from June and July 2019 was retrieved and summarized in the present dataset. The Rucio ATLAS instance is responsible to keep track of the files of the ATLAS Experiment[3] at CERN. These files are stored all around the world in 100+ data centers. In order to work with the files, physicists around the world need to move them across sites. Rucio delegates the file transfer to another subsystem called FTS[4]. The rules in Rucio are groups of file transfers that are done as a unit, i.e.: a physicist may need a set of files to do an analysis, then they create a rule specifying which files need to be moved to where, and when the rule is done the analysis can start.

If the Rule Time To Complete (RTTC) can be predicted with certain accuracy, this will allow the Rucio system and the ATLAS Experiment to schedule the transfers in a smarter way, eventually helping to optimize the resources the experiment has to do more and faster science.

State of the art

The metric used to calculate the accuracy is the Fraction of Good Predictions (FoGP). Formally, the FoGP is defined as in the equation that follows

FoGP(y, y, τ) = 1/n i = 1ng(yi, yi, τ)

Where y is the vector of observations, y is the vector of predictions, g is a function that returns 1 if the relative error | yi - yi | / yi < τ , and 0 otherwise. For a group of predictions we have that FoGP(y, y, τ = 0.1) = 0.5. This means that 50% of the predictions have less than 10% of relative error. This easy to understand metric allows to compare models directly, independently from their implementation, and only focus on the predictions the model made.

We estimate a FoGP(τ = 0.1) > 0.95 for a model to be useful. However, there are no known models that can predict the RTTC at the rule creation time with such a high accuracy. Models with FoGP(τ = 0.1) ~= 0.5 could be useful to give feedback to the users about how much time the transfers will take. Best models known have a FoGP(τ = 0.1) = 0.14.

Fields description

account

The hashed account name from the user that issued the transfer. This data has been anonymized and does not represent the real name of the user in the system. 

state

The final state of the transfer. 'D' means the transfer is done, 'F' means the transfer has failed. Other states represent internal states from Rucio and are not important. Very few transfers showed other states than D or F.

activity

The activity of the transfer. It's related to the priority of the transfers inside the system. Priorities are based on shares and related to the 'share' field. As transfers requests are queued in Rucio and in FTS, transfers are picked to be served with a probability equal to its share among all the transfers that are in the queue at that time.

SIZE

The size in bytes of the file to be transferred.

src_rse/dst_rse

The source/destination Rucio Storage Element (RSE). An RSE is a logical unit inside Rucio that represents a dedicated storage location of a data center. Usually there are more than one physical machine. Rucio doesn't know how many storage nodes compose an RSE, so this is the minimum logical unit of storage for the system. Both fields have been anonymized.

id

The unique identifier of a transfer request. If a transfer needs to be retried, the next attempt will have a different id. 

previous_attempt_id

If the transfer request is a retry, the id of the previous attempt is filled in. Otherwise, this field is empty.

retry_count

This is the number of times a transfer has been retried. If it is the first attempt, the field is 0. 

rule_id

This is the id of the rule the transfer belongs to. All the transfers in the same rule share the same rule_id.

external_host

This is the hash of the FTS server that will trigger the actual transfer of files between the files. There are several FTS servers and some are shared with other Experiments outside ATLAS. It is known that the server with hash fe1d4db902b6271 is used by ATLAS Experiment exclusively, so this can be a good place to start.

RTIME

This is the time in seconds the transfer spends in the Rucio System, since it is created at the created timestamp, till the transfer is submitted to the FTS system, at the submitted timestamp. This can be calculated as submitted - created. This value is not available to the system until the transfer is submitted, that is the submitted timestamp.

QTIME

This is the time in seconds the transfer spends in the FTS System, since it is submitted by Rucio the submitted timestamp, till the transfer starts its network time at the started timestamp. This can be calculated as started - submitted. This value is not available to the system until the transfer ends, that is until the ended timestamp, because FTS does not propagate the started time of a transfer immediately, but only once the transfer ends or fails.

NTIME

This is the actual time in seconds the file is being transferred, using the network, since the transfer is started by FTS at the started timestamp, till the transfer ends at the ended timestamp. This can be calculated as ended - started. The value is not available to the system until the transfer ends, that is the ended timestamp.

RATE

This is the average rate in bytes per second of each transfer. It is calculated as SIZE/NTIME and is not available till the transfer ends.

link

This is the hash that represents a source/destination RSE pair. Links have peculiarities that make them unique, and likely affect the RTTC, e.g., some links have higher bandwidth, or the disks of the associated storages in the respective source and destination RSEs are faster than the ones on other links. 

created

This is the time at which a transfer request is created in Rucio. For all the transfers that share the same rule_id, the minimum created timestamp is also the rule creation time, at which we want to know the RTTC. All date timestamps have a resolution of 1 second.

submitted

This is the time at which the transfer request is submitted from Rucio to FTS. 

started

This is the time at which the transfer request starts the actual transfer, using the network. This data will not be known until the transfer ends because FTS doesn't publish this data immediately but only once the transfer ends.

ended

This is the time at which the transfer ends. For all the transfers that share the same rule_id, the maximum ended timestamp is also the ending time of the rule. 

share

This is a number between 0 and 1 that represents the weighted probability of a transfer of being picked to be served given its activity.

Target

The target of the study is to know the Rule Time To Complete (RTTC) at the creation time of the rule. The creation time of the rule is the minimum created timestamp of those transfers that share the same rule_id. The RTTC can be computed as the ending time of the rule minus the starting time of the rule, being the ending time of the rule, the maximum ended timestamp of all the transfers that share the same rule_id.

 

 

References

 

  1. Worldwide LHC Computing Grid. https://wlcg.web.cern.ch/ Retrieved 23/11/2020

  2. Rucio Scientific Data Management. https://rucio.cern.ch/ Retrieved 23/11/2020

  3. The ATLAS Experiment. https://atlas.cern/ Retrieved 23/11/2020

File Transfer Service. https://fts.web.cern.ch/fts/ Retrieved 23/11/2020

Notes

Dataset description in the following link. https://docs.google.com/document/d/1pZwob0LXwMZGMuiw7-2az0L4ALzD6PgkV9I9OJV20JI/edit?usp=sharing

Files

transfers-20190606-20190731-anonymized.csv

Files (7.2 GB)

Name Size Download all
md5:75aab02195c1870cb461aa70eea0e742
626.7 kB Preview Download
md5:6404753ef2878c2b555a8b66e04d8abd
7.2 GB Download

Additional details

References