Published November 16, 2025 | Version v1
Dataset Open

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

  • 1. ROR icon Harvard University
  • 2. ROR icon Ahmedabad University
  • 3. ROR icon University of Technology Malaysia
  • 4. ROR icon Université Paris-Saclay

Description

Abstract

Large photometric surveys will image billions of galaxies, but we currently lack quick, reliable automated ways to infer their physical properties like morphology, stellar mass, and star formation rates. Simulations provide galaxy images with ground-truth physical labels, but domain shifts in PSF, noise, backgrounds, selection, and label priors degrade transfer to real surveys. We present a preliminary domain adaptation pipeline that trains on simulated TNG50 galaxies and evaluates on real SDSS galaxies with morphology labels (elliptical/spiral/irregular). We train three backbones (CNN, $E(2)$-steerable CNN, ResNet-18) with focal loss and effective-number class weighting, and a feature-level domain loss $\mathcal{L}_D$ built from \texttt{GeomLoss} (entropic Sinkhorn OT, energy distance, Gaussian MMD, and related metrics). We show that a combination of these losses with an OT-based “top-$k$ soft matching’’ loss that focuses $\mathcal{L}_D$ on the worst-matched source–target pairs can further enhance domain alignment. With Euclidean distance, scheduled alignment weights, and top-$k$ matching, target accuracy (macro F1) rises from $\sim$46% ($\sim$30%) at no adaptation to $\sim$87% ($\sim$ 62.6%), with a domain AUC near 0.5, indicating strong latent-space mixing.

Dataset:

The dataset includes RGB galaxy images and labels for both source and target domains. The galaxy_images_rgb.zip archive contains all galaxy images from the source dataset (simulated IllustrisTNG galaxies) and the target dataset (observed SDSS Galaxy Zoo 2 galaxies). The source_galaxy_labels.csv file contains labels for the simulated galaxies, including image paths, subhalo IDs, stellar masses, star-forming flags, AGN presence flags, compactness flags, metallicity values, morphology classifications (elliptical, spiral, or irregular), and metal-rich indicators.

The target dataset labels are provided in two files: gz2_galaxy_labels.csv contains the top galaxies for morphology classification, selected based on the highest confidence metrics for each class (elliptical, spiral, or irregular), with image identifiers, SDSS object IDs, stellar masses, star-forming flags, AGN flags, and morphology classifications. The gz2_galaxy_labels_master.csv file contains the complete target dataset with the same structure, including all galaxies that passed the classification thresholds. Target galaxy images can be downloaded from the Galaxy Zoo 1 data release, images_gz2.zip.

The processing script that generated the target labels, including the classification thresholds and selection criteria, can be found at gz2-processing-script. The file AGN_GZ2_Hart_DR7_final.csv contains the results of crossmatching the SDSS DR7 object IDs of each galaxy with the AGN catalogue, which was used to extend the labels with AGN-related information

Files

source_galaxy_labels.csv

Files (284.8 MB)

Name Size Download all
md5:e1c4f20aa615eef092363fce57033b99
284.0 MB Preview Download
md5:1e21ed917146ebabfbfa02e117d7c402
173.0 kB Preview Download
md5:3f0c159949083f474e402ef985ef0afa
429.0 kB Preview Download
md5:3389435ab581b220632672f52ce144a9
261.4 kB Preview Download

Additional details

Software

Repository URL
https://github.com/ahmedsalim3/galaxy-da
Programming language
Python