ATLAS VRA v1 - Training Data and Code

Stevance, Heloise

doi:10.5281/zenodo.15195392

Published April 11, 2025 | Version v2

Dataset Open

ATLAS VRA v1 - Training Data and Code

Stevance, Heloise¹

1. University of Oxford

Quick Summary

The Virtual Research Assistant is a bot (or set of bots) that help ATLAS eyeballers by ordering+prioritising the alerts in the eyeball list, removing the crappiest objects, and sending automatic triggers for transients within 100 Mpc to be followed up with the Mookodi Telescope.
This is the first public release of the data and codes used to train the models that power the VRA.

Who is this repo for and how to use it

General sign-posting

[D] Devs - for reproducibility and book-keeping.
[U] Users (eyeballers) who want to understand the models and its limitation.
[S] Scientist who want to understand the method

For each resource we flag which type of user we think will benefit by using the abbreviations [D/U/S]

Figures from the ATLAS VRA Paper

If you are looking for specific figures from the paper here are the notebooks that created them:

./Crabby/data/Summary_plots.ipynb: Figures 1, 2, 3, 6
./Duck/data/Summary_plots.ipynb : Figure 5, 18, 19, 20, 21, 22, 23, 24
./Duck1.1/Overview.ipynb: Figures 7, 8, 9, 10, 11
./Duck1.1/Interpreting_AT2024lwd.ipynb: Figures 13, 25
./Duck1.1/Key_transients.ipynb: Figure 14
./Duck1.1/Policy_evaluation.ipynb: Figure 15, 16

Requirements

matplotlib, numpy, pandas, scikit-leanrn, joblib, atlasvras, atlasapiclient

Note: All the notebooks call a matplotlib style which is not released here (vra.mplstyle or vra_light.mplstyle). Replace this style choice in the first cell of the notebooks with your own style file or comment it out if you want to actually run the notebooks.

[/!\]* notebooks marked with this warning sign cannot be run without access to the raw JSON files which will have you download nearly 40 GB of data (once unzipped).

------------------------------------------------------------------------------------------------------------------------

/!\ LARGE SIZE - 40 GB unzipped /!\ Unless you have a very good reason to download this we don't recommend it. Even if you want the data for a specific (or a group of) transient that is part of this release you can get the data from the ATLAS Transient Server through the ATLAS API without having to download all of this.

[D] json_files contains the JSON files for the ``Crabby`` data
[D] data_objects_with_decision_NEWEYEBALLLIST contains the JSON files for the extra data added to ``Duck``

Note if you are here for raw data - these do not represent the full range of proporties of the data we see in our complete stream. They are data for objects that made all cuts in the data processing and would have been shown to eyeballers. They are solely intended to train a model that works downstream of previous automation steps.

----------------------------------------------------------------------------------------------------------------------

Duck and Duck 1.1

Cleans up the training data gathered between 18th August 2024 - 22nd Januray 2025, and adds it to the previous Crabby data set to make a larger, near complete (in terms of sky coverage) data set to retrain the models with.

There are two "ducks" because the first one relates to to VRA 1.0 release which contains models trained with a sub-par training set (see Chapter 3 of the Technical Manual) ; that model was in production from 3rd February to 6th March 2025 .

Each directory has their own README.md summarising their content. Below I describe what content to reference if we focus on the current in production version of the VRA and the contents of the paper.

Overview and key figures from the paper: `./Duck1.1`

[D/U/S] Overview.ipynb: A general overview that introduces the VRA, it's day 1 and day N models, the ranking, and looks at feature importance. Many of the plots of the paper come from here.
[D/U/S] Key_transients.ipynb: Looks at how they day 1 models perform on a number of important transients, to see how they fare against our models and chosen eyeballing policies.
[D/U] Policy_evaluation.ipynbNotebook to check how our models and chosen policies work together - how much do we auto-garbage? How much do we eyeball? Do we loose any good transients? Are these acceptable losses?
[D/U/S] Interpreting_AT2024lwd.ipynb

Clean new data `./Duck/data`

Directories
* clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data.
Notebooks
* [D] [/!\] clean_data.ipynb: Notebook used to extract the csv files in ``clean_data_csv`` from the raw JSON data

Features `./Duck1.1/data`

Directories
* clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data.
* features_and_labels_csv: Features and labels for the day 1 and day N models
* figures: pictures
Notebooks
* [D/U] features_day1.ipynb: notebook used to extract the day 1 features
* [D/U]features_day1_train_val_split.ipynb: notebook used to extract the day 1 featuresthen split the new data set into train and validation set, balance the training set with subsamplng and finally combine these new data with the ``Crabby`` training and validation sets.
* [D/U] features_dayN.ipynb notebook used to extract the day N features.
    * [D] add_detmagmedianmin5d_tocrabby.ipynb: notebook to add ``DET_mag_median_min5d`` to ``Crabby`` samples
* [D/U/S] Summary_plots.ipynb Summary tables and plots of the data set and feature distributions. Many of the plots of the paper come from here.
    * [D] Crabby_Vs_Duck.ipynb: Comparison of Duck 1.1 to Crabby, the previous version of the model (ignoring duck1.0). This is really a dev notebook.
    * [D]_animated_score_space.ipynb: Notebook to make a pretty animation of the Day 1 real and gal scores for our training+validatoin set. Creates animation.gif
    * [D] In_prod_verification.ipynb: Dev notebook to check auto-garbage behaviour and in prod results from the VRA.

`./models`

Where we store the models to be used in production - created by the ``Overview.ipynb`` notebook.

----------------------------------------------------------------------------------------------------------------------

Crabby

Training data gathered between 27th March 2024 - 13th August 2024.
These models were in production from 6th December 2024 until mid February 2025.

`./`

[D/U/S] Overview.ipynb: A general overview that introduces the VRA, it's day 1 and day N models, the ranking, and looks at feature importance.
[D/U/S] Key_transients.ipynb: Looks at how they day 1 models perform on a number of important transients, to see how they fare against our models and chosen eyeballing policies.
[D/U] Policy_evaluation.ipynbNotebook to check how our models and chosen policies work together - how much do we auto-garbage? How much do we eyeball? Do we loose any good transients? Are these acceptable losses?
[D]_Gal_candidate_policy.ipynb: The notebook used to test out a new policy to flag objects as being Galactic Candidates. This was later implemented when retraining with the ``Duck`` data set for VRA 1.0 in Februrary 2025.
[D]Policy_evaluation_new_strat.ipynb: Same as the policy evaluation notebook above with the added step of having a galactic candidate eyeball list as tested in the ``_Gal_candidate_policy.ipynb`` notebook.

`./data`

Directories
* clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data.
* features_and_labels_csv: Features and labels for the day 1 and day N models
* figures: pictures
* to_reeyball``: csv files containing the data of the alerts that were flagged for re-eyeballing AND the csv containing the re-eyeballing decisions.
Notebooks
* [D] [/!\] clean_data.ipynb: Notebook used to extract the csv files in ``clean_data_csv`` from the raw JSON data
* [D/U/S] features_day1.ipynb: notebook used to extract the day 1 features, split the train and validation set and do the subsampling of the over-represented classes (for the training set only)
* [D/U/S] features_dayN.ipynb notebook used to extract the day N features.
* [D/U/S] Summary_plots.ipynb: Summary tables and plots of the data set and feature distributions.

`./old_vs_new`

[D]
in\_prod_comparison.ipynb: Comparison of the models created with the `Crabby` data set compared to the `BMO` version. These are unpublished and had training data from 27th March until mid July 2024 - the main differences lay in the features used. This notebook is mostly useful to the dev team and to see how model comparisons are done when upgrading to a new VRA.

`./train_scoring_models/hyperparameter_tuning`

[D]

This directory contains all the nitty gritty of the hyper paramter tuning phase.
The details can be found in the hp_tuning_evaluation.ipynb notebook.

If you only want to see the paramters we used for our production models see the ``Overview.ipynb`` notebook in the top directory.

`./train_scoring_models/prod_models`

[D]

The subdirectories ``day1`` and ``dayN`` contain the production models used in the ATLAS server (before and after re-eyeballing and retraining).
These are mostly there for dev book-keeping.

Final Note

In addition to the main directories and notebooks listed here there are csv files in some of the subdirectories - those are created by the notebookes here and not described in this description or the readme to keep the documentation readable and because their names are self-explanataory (and their source cna be found in the notebooks).

----------------------------------------------------------------------------------------------

Anything Unclear?

Please get in touch if you have any questions about the contents of this data archive. I did my best to documents and explain the process but I did this in hinsight once I'd been working on the project for a year - what is clear to me may not be clear to a fresh reader.

hfstevance@gmail.com

Notes

This project and research are support by Schmidt Sciences.

HFS is supported by a Schmidt AI in Science Fellowship.

Files

Files (6.3 GB)

Name	Size	Download all
Crabby.tar.gz md5:31e56d69cecc202e167e3069e3f9919f	628.4 MB	Download
data_objects_with_decision_NEWEYEBALLLIST.tar.gz md5:59a7d87e87df929d2db4e900abad33ab	1.1 GB	Download
Duck.tar.gz md5:6279f2339468c64d963e5c209934e56d	276.9 MB	Download
Duck1.1.tar.gz md5:99f3b91386c4ab8ad08fa8f90cd1823e	67.4 MB	Download
json_files.tar.gz md5:ea3f9648364e0fe986e83f22dccb89bf	4.2 GB	Download

Additional details

Cites: Software documentation: 10.5281/zenodo.14944208 (DOI)
Is supplement to: Software: 10.5281/zenodo.14363396 (DOI)
Requires: Software: 10.5281/zenodo.14331062 (DOI)

Programming language: Python

	All versions	This version
Views	191	150
Downloads	229	124
Data volume	301.0 GB	159.1 GB

ATLAS VRA v1 - Training Data and Code

Quick Summary

Who is this repo for and how to use it

General sign-posting

Figures from the ATLAS VRA Paper

Requirements

Contents

Raw JSON data

Duck and Duck 1.1

Overview and key figures from the paper: `./Duck1.1`

Clean new data `./Duck/data`

Features `./Duck1.1/data`

`./models`

Crabby

`./`

`./data`

`./old_vs_new`

`./train_scoring_models/hyperparameter_tuning`

`./train_scoring_models/prod_models`

Final Note

Anything Unclear?

Notes

Files

Files (6.3 GB)

Additional details

Related works

Software

ATLAS VRA v1 - Training Data and Code

Creators

Description

Quick Summary

Who is this repo for and how to use it

General sign-posting

Figures from the ATLAS VRA Paper

Requirements

Contents

Raw JSON data

Duck and Duck 1.1

Overview and key figures from the paper: ./Duck1.1

Clean new data ./Duck/data

Features ./Duck1.1/data

./models

Crabby

./

./data

./old_vs_new

./train_scoring_models/hyperparameter_tuning

./train_scoring_models/prod_models

Final Note

Anything Unclear?

Notes

Files

Files (6.3 GB)

Additional details

Related works

Software

Overview and key figures from the paper: `./Duck1.1`

Clean new data `./Duck/data`

Features `./Duck1.1/data`

`./models`

`./`

`./data`

`./old_vs_new`

`./train_scoring_models/hyperparameter_tuning`

`./train_scoring_models/prod_models`