ATLAS Rucio Transfers Dataset
- 1. UNLP
- 2. CERN
- 3. University of Wuppertal
Description
This dataset is released to encourage the study of ATLAS file transfers in the Worldwide LHC Computing Grid environment, to better understand the transfer processes in this particularly heterogeneous environment.
Joaquin Bogado (UNLP)
Mario Lassnig (CERN)
Fernando Monticelli (UNLP)
Thomas Beermann (University of Wuppertal)
Javier Díaz (UNLP)
2020-11-27
jbogado @ linti.unlp.edu.ar
Motivation
This dataset is released to encourage the study of ATLAS file transfers in the Worldwide LHC Computing Grid[1] environment, to better understand the transfer processes in this particularly heterogeneous environment.
Rucio[2] is a Distributed Data Management system. Data from the Rucio ATLAS instance from June and July 2019 was retrieved and summarized in the present dataset. The Rucio ATLAS instance is responsible to keep track of the files of the ATLAS Experiment[3] at CERN. These files are stored all around the world in 100+ data centers. In order to work with the files, physicists around the world need to move them across sites. Rucio delegates the file transfer to another subsystem called FTS[4]. The rules in Rucio are groups of file transfers that are done as a unit, i.e.: a physicist may need a set of files to do an analysis, then they create a rule specifying which files need to be moved to where, and when the rule is done the analysis can start.
If the Rule Time To Complete (RTTC) can be predicted with certain accuracy, this will allow the Rucio system and the ATLAS Experiment to schedule the transfers in a smarter way, eventually helping to optimize the resources the experiment has to do more and faster science.
State of the art
The metric used to calculate the accuracy is the Fraction of Good Predictions (FoGP). Formally, the FoGP is defined as in the equation that follows
FoGP(y, y, τ) = 1/n i = 1ng(yi, yi, τ)
Where y is the vector of observations, y is the vector of predictions, g is a function that returns 1 if the relative error | yi - yi | / yi < τ , and 0 otherwise. For a group of predictions we have that FoGP(y, y, τ = 0.1) = 0.5. This means that 50% of the predictions have less than 10% of relative error. This easy to understand metric allows to compare models directly, independently from their implementation, and only focus on the predictions the model made.
We estimate a FoGP(τ = 0.1) > 0.95 for a model to be useful. However, there are no known models that can predict the RTTC at the rule creation time with such a high accuracy. Models with FoGP(τ = 0.1) ~= 0.5 could be useful to give feedback to the users about how much time the transfers will take. Best models known have a FoGP(τ = 0.1) = 0.14.
Fields description
account
The hashed account name from the user that issued the transfer. This data has been anonymized and does not represent the real name of the user in the system.
state
The final state of the transfer. 'D' means the transfer is done, 'F' means the transfer has failed. Other states represent internal states from Rucio and are not important. Very few transfers showed other states than D or F.
activity
The activity of the transfer. It's related to the priority of the transfers inside the system. Priorities are based on shares and related to the 'share' field. As transfers requests are queued in Rucio and in FTS, transfers are picked to be served with a probability equal to its share among all the transfers that are in the queue at that time.
SIZE
The size in bytes of the file to be transferred.
src_rse/dst_rse
The source/destination Rucio Storage Element (RSE). An RSE is a logical unit inside Rucio that represents a dedicated storage location of a data center. Usually there are more than one physical machine. Rucio doesn't know how many storage nodes compose an RSE, so this is the minimum logical unit of storage for the system. Both fields have been anonymized.
id
The unique identifier of a transfer request. If a transfer needs to be retried, the next attempt will have a different id.
previous_attempt_id
If the transfer request is a retry, the id of the previous attempt is filled in. Otherwise, this field is empty.
retry_count
This is the number of times a transfer has been retried. If it is the first attempt, the field is 0.
rule_id
This is the id of the rule the transfer belongs to. All the transfers in the same rule share the same rule_id.
external_host
This is the hash of the FTS server that will trigger the actual transfer of files between the files. There are several FTS servers and some are shared with other Experiments outside ATLAS. It is known that the server with hash fe1d4db902b6271 is used by ATLAS Experiment exclusively, so this can be a good place to start.
RTIME
This is the time in seconds the transfer spends in the Rucio System, since it is created at the created timestamp, till the transfer is submitted to the FTS system, at the submitted timestamp. This can be calculated as submitted - created. This value is not available to the system until the transfer is submitted, that is the submitted timestamp.
QTIME
This is the time in seconds the transfer spends in the FTS System, since it is submitted by Rucio the submitted timestamp, till the transfer starts its network time at the started timestamp. This can be calculated as started - submitted. This value is not available to the system until the transfer ends, that is until the ended timestamp, because FTS does not propagate the started time of a transfer immediately, but only once the transfer ends or fails.
NTIME
This is the actual time in seconds the file is being transferred, using the network, since the transfer is started by FTS at the started timestamp, till the transfer ends at the ended timestamp. This can be calculated as ended - started. The value is not available to the system until the transfer ends, that is the ended timestamp.
RATE
This is the average rate in bytes per second of each transfer. It is calculated as SIZE/NTIME and is not available till the transfer ends.
link
This is the hash that represents a source/destination RSE pair. Links have peculiarities that make them unique, and likely affect the RTTC, e.g., some links have higher bandwidth, or the disks of the associated storages in the respective source and destination RSEs are faster than the ones on other links.
created
This is the time at which a transfer request is created in Rucio. For all the transfers that share the same rule_id, the minimum created timestamp is also the rule creation time, at which we want to know the RTTC. All date timestamps have a resolution of 1 second.
submitted
This is the time at which the transfer request is submitted from Rucio to FTS.
started
This is the time at which the transfer request starts the actual transfer, using the network. This data will not be known until the transfer ends because FTS doesn't publish this data immediately but only once the transfer ends.
ended
This is the time at which the transfer ends. For all the transfers that share the same rule_id, the maximum ended timestamp is also the ending time of the rule.
share
This is a number between 0 and 1 that represents the weighted probability of a transfer of being picked to be served given its activity.
Target
The target of the study is to know the Rule Time To Complete (RTTC) at the creation time of the rule. The creation time of the rule is the minimum created timestamp of those transfers that share the same rule_id. The RTTC can be computed as the ending time of the rule minus the starting time of the rule, being the ending time of the rule, the maximum ended timestamp of all the transfers that share the same rule_id.
References
-
Worldwide LHC Computing Grid. https://wlcg.web.cern.ch/ Retrieved 23/11/2020
-
Rucio Scientific Data Management. https://rucio.cern.ch/ Retrieved 23/11/2020
-
The ATLAS Experiment. https://atlas.cern/ Retrieved 23/11/2020
File Transfer Service. https://fts.web.cern.ch/fts/ Retrieved 23/11/2020
Notes
Files
transfers-20190606-20190731-anonymized.csv
Files
(7.2 GB)
Name | Size | Download all |
---|---|---|
md5:75aab02195c1870cb461aa70eea0e742
|
626.7 kB | Preview Download |
md5:6404753ef2878c2b555a8b66e04d8abd
|
7.2 GB | Download |
Additional details
References
- Worldwide LHC Computing Grid. https://wlcg.web.cern.ch/ Retrieved 23/11/2020
- Rucio Scientific Data Management. https://rucio.cern.ch/ Retrieved 23/11/2020
- The ATLAS Experiment. https://atlas.cern/ Retrieved 23/11/2020
- File Transfer Service. https://fts.web.cern.ch/fts/ Retrieved 23/11/2020