ATLAS VRA v1 - Training Data and Code
Description
Quick Summary
The Virtual Research Assistant is a bot (or set of bots) that help ATLAS eyeballers by ordering+prioritising the alerts in the eyeball list, removing the crappiest objects, and sending automatic triggers for transients within 100 Mpc to be followed up with the Mookodi Telescope.
This is the first public release of the data and codes used to train the models that power the VRA. 
Who is this repo for and how to use it
General sign-posting
- [D] Devs - for reproducibility and book-keeping.
 - [U] Users (eyeballers) who want to understand the models and its limitation.
 - [S] Scientist who want to understand the method
 
For each resource we flag which type of user we think will benefit by using the abbreviations [D/U/S]
Figures from the ATLAS VRA Paper
If you are looking for specific figures from the paper here are the notebooks that created them:
./Crabby/data/Summary_plots.ipynb: Figures 1, 2, 3, 6./Duck/data/Summary_plots.ipynb: Figure 5, 18, 19, 20, 21, 22, 23, 24./Duck1.1/Overview.ipynb: Figures 7, 8, 9, 10, 11./Duck1.1/Interpreting_AT2024lwd.ipynb: Figures 13, 25./Duck1.1/Key_transients.ipynb: Figure 14./Duck1.1/Policy_evaluation.ipynb: Figure 15, 16
Requirements
matplotlib, numpy, pandas, scikit-leanrn, joblib, atlasvras, atlasapiclient
Note: All the notebooks call a matplotlib style which is not released here (vra.mplstyle or vra_light.mplstyle). Replace this style choice in the first cell of the notebooks with your own style file or comment it out if you want to actually run the notebooks. 
[/!\]* notebooks marked with this warning sign cannot be run without access to the raw JSON files which will have you download nearly 40 GB of data (once unzipped).
------------------------------------------------------------------------------------------------------------------------
Contents
Raw JSON data
/!\ LARGE SIZE - 40 GB unzipped /!\ Unless you have a very good reason to download this we don't recommend it. Even if you want the data for a specific (or a group of) transient that is part of this release you can get the data from the ATLAS Transient Server through the ATLAS API without having to download all of this.
- [D] 
json_filescontains the JSON files for the ``Crabby`` data - [D] 
data_objects_with_decision_NEWEYEBALLLISTcontains the JSON files for the extra data added to ``Duck`` 
Note if you are here for raw data - these do not represent the full range of proporties of the data we see in our complete stream. They are data for objects that made all cuts in the data processing and would have been shown to eyeballers. They are solely intended to train a model that works downstream of previous automation steps.
----------------------------------------------------------------------------------------------------------------------
Duck and Duck 1.1
Cleans up the training data gathered between 18th August 2024 - 22nd Januray 2025, and adds it to the previous Crabby data set to make a larger, near complete (in terms of sky coverage) data set to retrain the models with.
There are two "ducks" because the first one relates to to VRA 1.0 release which contains models trained with a sub-par training set (see Chapter 3 of the Technical Manual) ; that model was in production from 3rd February to 6th March 2025 .
Each directory has their own README.md summarising their content. Below I describe what content to reference if we focus on the current in production version of the VRA and the contents of the paper. 
Overview and key figures from the paper:  ./Duck1.1
- [D/U/S]  
Overview.ipynb: A general overview that introduces the VRA, it's day 1 and day N models, the ranking, and looks at feature importance. Many of the plots of the paper come from here. - [D/U/S] 
Key_transients.ipynb: Looks at how they day 1 models perform on a number of important transients, to see how they fare against our models and chosen eyeballing policies. - [D/U] 
Policy_evaluation.ipynbNotebook to check how our models and chosen policies work together - how much do we auto-garbage? How much do we eyeball? Do we loose any good transients? Are these acceptable losses? - [D/U/S] 
Interpreting_AT2024lwd.ipynb 
Clean new data ./Duck/data
- Directories
*clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data. - Notebooks   
* [D] [/!\]clean_data.ipynb: Notebook used to extract the csv files in ``clean_data_csv`` from the raw JSON data 
Features ./Duck1.1/data
- Directories
*clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data.
*features_and_labels_csv: Features and labels for the day 1 and day N models
*figures: pictures
 - Notebooks 
* [D/U]features_day1.ipynb: notebook used to extract the day 1 features
* [D/U]features_day1_train_val_split.ipynb: notebook used to extract the day 1 featuresthen split the new data set into train and validation set, balance the training set with subsamplng and finally combine these new data with the ``Crabby`` training and validation sets.
* [D/U]features_dayN.ipynbnotebook used to extract the day N features.
* [D]add_detmagmedianmin5d_tocrabby.ipynb: notebook to add ``DET_mag_median_min5d`` to ``Crabby`` samples
* [D/U/S]Summary_plots.ipynbSummary tables and plots of the data set and feature distributions. Many of the plots of the paper come from here.
* [D]Crabby_Vs_Duck.ipynb: Comparison of Duck 1.1 to Crabby, the previous version of the model (ignoring duck1.0). This is really a dev notebook.
* [D]_animated_score_space.ipynb: Notebook to make a pretty animation of the Day 1 real and gal scores for our training+validatoin set. Creates animation.gif
* [D]In_prod_verification.ipynb: Dev notebook to check auto-garbage behaviour and in prod results from the VRA. 
 ./models
Where we store the models to be used in production - created by the ``Overview.ipynb`` notebook.
----------------------------------------------------------------------------------------------------------------------
Crabby
Training data gathered between 27th March 2024 - 13th August 2024.
These models were in production from 6th December 2024 until mid February 2025.
 ./
- [D/U/S] 
Overview.ipynb: A general overview that introduces the VRA, it's day 1 and day N models, the ranking, and looks at feature importance. - [D/U/S] 
Key_transients.ipynb: Looks at how they day 1 models perform on a number of important transients, to see how they fare against our models and chosen eyeballing policies. - [D/U] 
Policy_evaluation.ipynbNotebook to check how our models and chosen policies work together - how much do we auto-garbage? How much do we eyeball? Do we loose any good transients? Are these acceptable losses? - [D]
_Gal_candidate_policy.ipynb: The notebook used to test out a new policy to flag objects as being Galactic Candidates. This was later implemented when retraining with the ``Duck`` data set for VRA 1.0 in Februrary 2025. - [D]
Policy_evaluation_new_strat.ipynb: Same as the policy evaluation notebook above with the added step of having a galactic candidate eyeball list as tested in the ``_Gal_candidate_policy.ipynb`` notebook. 
./data
- Directories
*clean_data_csv: Clean csv files containing the contextual information, detection and non detection data, and relevant``tcs_vra_scores`` data.
*features_and_labels_csv: Features and labels for the day 1 and day N models
*figures: pictures
*to_reeyball``: csv files containing the data of the alerts that were flagged for re-eyeballing AND the csv containing the re-eyeballing decisions.
 - Notebooks
* [D] [/!\]clean_data.ipynb: Notebook used to extract the csv files in ``clean_data_csv`` from the raw JSON data
* [D/U/S]features_day1.ipynb: notebook used to extract the day 1 features, split the train and validation set and do the subsampling of the over-represented classes (for the training set only)
* [D/U/S]features_dayN.ipynbnotebook used to extract the day N features.
* [D/U/S]Summary_plots.ipynb: Summary tables and plots of the data set and feature distributions.
 
./old_vs_new
[D]in\_prod_comparison.ipynb: Comparison of the models created with the `Crabby` data set compared to the `BMO` version. These are unpublished and had training data from 27th March until mid July 2024 - the main differences lay in the features used. This notebook is mostly useful to the dev team and to see how model comparisons are done when upgrading to a new VRA. 
 ./train_scoring_models/hyperparameter_tuning
[D]
This directory contains all the nitty gritty of the hyper paramter tuning phase.
The details can be found in the hp_tuning_evaluation.ipynb notebook. 
If you only want to see the paramters we used for our production models see the ``Overview.ipynb`` notebook in the top directory.
./train_scoring_models/prod_models
[D]
The subdirectories ``day1`` and ``dayN`` contain the production models used in the ATLAS server (before and after re-eyeballing and retraining). 
These are mostly there for dev book-keeping.
Final Note
In addition to the main directories and notebooks listed here there are csv files in some of the subdirectories - those are created by the notebookes here and not described in this description or the readme to keep the documentation readable and because their names are self-explanataory (and their source cna be found in the notebooks).
----------------------------------------------------------------------------------------------
Anything Unclear?
Please get in touch if you have any questions about the contents of this data archive. I did my best to documents and explain the process but I did this in hinsight once I'd been working on the project for a year - what is clear to me may not be clear to a fresh reader.
hfstevance@gmail.com
Notes
Files
      
        Files
         (6.3 GB)
        
      
    
    
  Additional details
Related works
- Cites
 - Software documentation: 10.5281/zenodo.14944208 (DOI)
 - Is supplement to
 - Software: 10.5281/zenodo.14363396 (DOI)
 - Requires
 - Software: 10.5281/zenodo.14331062 (DOI)