Published March 11, 2025 | Version v1
Dataset Open

LSST light curves for constant and variable sources, and for point-like and extended objects microlensing

  • 1. ROR icon Durham University
  • 2. ROR icon New Mexico State University

Description

This repository contains the dataset that accompanies the paper Anomaly Detection to Identify Transients in LSST Time Series Data, which should be consulted for further details, along with the artefacts of the trained machine learning models. The dataset was generated using simulated LSST light curves for the Vera C. Rubin Observatory cadence and observational conditions via rubin-sim. It comprises approximately 600 000 light curves designed to detect various transient events, including microlensing signals and variable stars, as well as non-variable signal-less sources used to train the anomaly detection model.

The dataset includes six distinct classes: Constant (non-variable signal-less sources), RR Lyrae variables, Point-like Microlensing (ML), Binary Microlensing (Binary ML), Boson Stars (BS), and NFW Subhalos (NFW). The total number of simulated light curves for each class is as follows:

  • BS: 320 494

  • Binary ML: 84 022

  • ML: 53 565

  • RR Lyrae: 49 573

  • NFW: 47 837

  • Constant: 41 522

The light curves incorporate rubin-sim noise simulation and the LSST 10-year baseline cadence strategy (v2.0). Light curves for Constant, variable, and point-like microlensing events were simulated using MicroLIA, while binary microlensing events were generated using pyLIMA. Light curves for the BS and NFW objects were simulated using the code from this work.

The dataset contains 182 columns covering simulation and generation parameters, observable time series features, the time series itself, and the predictions from the machine learning models used in the paper. The columns are organised by type using prefixes and suffixes:

  • 'timestamps', 'mag', 'magerr': Light curve data.

  • 'gen': Generation parameters (metadata).

  • 'sim': Simulation parameters (metadata).

  • 'feature_' prefix: Features extracted from the light curve and its derivative, marked with the suffix 'deriv'.

  • 'iforest_output': iForest anomaly score.

  • 'pred_': Probabilities and class prediction for the multiclass classifier.

The dataset is provided in 'parquet' format, accessible in Python via 'pandas' by installing the 'parquet' optional dependency (i.e., pip install pandas[parquet]).

The artefacts were generated in Python 3.9.21 using scikit-learn 1.4.1. The imputer_train.pkl file is required to impute missing values before predicting with the iForest model (final_isolation_forest_model.pkl), as it does not handle missing or nan values. The multiclass classifier (classifier.pck) handles missing and nan values directly and was trained without imputed data.

Please cite the paper alongisde the zenodo entry if you use this dataset:

@article{CrispimRomao:2025pyl,
    author = "Crispim Romao, Miguel and Croon, Djuna and Godines, Daniel",
    title = "{Anomaly Detection to identify Transients in LSST Time Series Data}",
    eprint = "2503.09699",
    archivePrefix = "arXiv",
    primaryClass = "astro-ph.SR",
    reportNumber = "IPPP/25/15",
    month = "3",
    year = "2025"
}

Files

data_header.txt

Files (1.5 GB)

Name Size Download all
md5:5a17bc19f54c957d893e2bd59af22e8c
4.5 MB Download
md5:e55b58ba98ea5ab0fc0e222ef8a5e5c2
4.3 kB Preview Download
md5:ee15234217eef8ca6d478689540a8f79
14.0 MB Download
md5:63647f36c8f67d4e4dbf3ac5ec02fa27
27.5 MB Download
md5:643872dcada50354d91c3faed3de886f
1.4 GB Download

Additional details

Related works

Is derived from
Preprint: arXiv:2503.09699 (arXiv)

Funding

UK Research and Innovation
Proposal for IPPP (UK National Phenomenology Institute), 2020-2023 ST/T001011/1