LSST light curves for constant and variable sources, and for point-like and extended objects microlensing
Description
This repository contains the dataset that accompanies the paper Anomaly Detection to Identify Transients in LSST Time Series Data, which should be consulted for further details, along with the artefacts of the trained machine learning models. The dataset was generated using simulated LSST light curves for the Vera C. Rubin Observatory cadence and observational conditions via rubin-sim
. It comprises approximately 600 000 light curves designed to detect various transient events, including microlensing signals and variable stars, as well as non-variable signal-less sources used to train the anomaly detection model.
The dataset includes six distinct classes: Constant (non-variable signal-less sources), RR Lyrae variables, Point-like Microlensing (ML), Binary Microlensing (Binary ML), Boson Stars (BS), and NFW Subhalos (NFW). The total number of simulated light curves for each class is as follows:
-
BS: 320 494
-
Binary ML: 84 022
-
ML: 53 565
-
RR Lyrae: 49 573
-
NFW: 47 837
-
Constant: 41 522
The light curves incorporate rubin-sim
noise simulation and the LSST 10-year baseline cadence strategy (v2.0). Light curves for Constant, variable, and point-like microlensing events were simulated using MicroLIA, while binary microlensing events were generated using pyLIMA. Light curves for the BS and NFW objects were simulated using the code from this work.
The dataset contains 182 columns covering simulation and generation parameters, observable time series features, the time series itself, and the predictions from the machine learning models used in the paper. The columns are organised by type using prefixes and suffixes:
-
'timestamps', 'mag', 'magerr': Light curve data.
-
'gen': Generation parameters (metadata).
-
'sim': Simulation parameters (metadata).
-
'feature_' prefix: Features extracted from the light curve and its derivative, marked with the suffix 'deriv'.
-
'iforest_output': iForest anomaly score.
-
'pred_': Probabilities and class prediction for the multiclass classifier.
The dataset is provided in 'parquet' format, accessible in Python via 'pandas' by installing the 'parquet' optional dependency (i.e., pip install pandas[parquet]
).
The artefacts were generated in Python 3.9.21
using scikit-learn 1.4.1
. The imputer_train.pkl
file is required to impute missing values before predicting with the iForest model (final_isolation_forest_model.pkl
), as it does not handle missing or nan
values. The multiclass classifier (classifier.pck
) handles missing and nan
values directly and was trained without imputed data.
Please cite the paper alongisde the zenodo entry if you use this dataset:
@article{CrispimRomao:2025pyl,
author = "Crispim Romao, Miguel and Croon, Djuna and Godines, Daniel",
title = "{Anomaly Detection to identify Transients in LSST Time Series Data}",
eprint = "2503.09699",
archivePrefix = "arXiv",
primaryClass = "astro-ph.SR",
reportNumber = "IPPP/25/15",
month = "3",
year = "2025"
}
Files
data_header.txt
Files
(1.5 GB)
Additional details
Related works
- Is derived from
- Preprint: arXiv:2503.09699 (arXiv)
Funding
- UK Research and Innovation
- Proposal for IPPP (UK National Phenomenology Institute), 2020-2023 ST/T001011/1