Published June 24, 2025 | Version v1
Dataset Open

Deconstruct To Reconstruct: A Datatset For Parsing Complex CT Assemblies

Authors/Creators

Description

We have generated a high-quality dataset of simulated, physically accurate CT scans with ground truth annotations. The dataset comprises seven complex LEGO assemblies – six car models and one articulate vertebrate (T-Rex) – serving as proxies for real-world industrial assemblies. The number of parts per assembly ranges from 450 to 3600, with individual part catalogs containing between 86 and 205 unique parts.
The task is challenging for the following reasons:

  • The simulated CT scans are physically accurate, and include realistic noise and imaging artifacts.
  • The CT scans are of high resolution ($\sim \! 2000^3$ voxels) and the 3D context is important for the precise localization of boundaries.
  • The CT scans contain complex assemblies with up to $\sim \! 3600$ parts. Adjacent parts fit tightly.
  • Parts vary substantially in size, ranging from small connectors to elongated components.
  • Parts share identical subparts and exhibit symmetries. Two distinct parts may differ in fine details only.

A small exemplary assembly is available as first_assembly in the ct_assembly_dataset_small . 

Furthermore, we make the checkpoint of our 3D-UNet available trained on boundary detection on the annotated CT scans.
The folder unet_model contains the best validation check point of the trained model and a config-file for training and inference.

The dataset for each assembly (e.g. first_assembly) is structured as follows:

  • first_assembly_x10y5z5_dataset.h5: an HDF5 file with the raw CT scan and ground truth instance annotation and semantic labels. 
  • stl_catalog: a folder which contains all the meshes of the part catalog and a `first_assembly_info.json` file which contains all information on how to assemble the parts in the scene (the folder stls_watertight_replacements contains some manually fixed versions of non-watertight catalog parts).

Details on the HDF5 file structure and contents

Each file contains a raw scan (raw_input_volume), a corresponding ground-truth segmentation (gt_instance_volume), together with metadata about the scan setup and the part semantics:

Datasets

  • raw_input_volume (uint16):  The raw volumetric scan data.
  • gt_instance_volume (uint16):  Ground-truth instance labels for each voxel.

Attributes

  • name (str): Name of the scanned assembly, e.g. "first_assembly".
  • semantic_label_list (np.ndarray[str]): List of semantic class identifiers corresponding to instance IDs in the volume.
  • raw_min / raw_max (np.uint16): Intensity range of the raw input volume.
  • clipping_min_corner (np.ndarray[int]): Origin of the cropped region in the original simulated CT scan volume (which included more air around the assebmly), needed for alignment in the 3D scene.
  • relative_scale_to_artist (float): Only needed if a scaling was applied to part meshes before simulating the CT scans (1.0 for all datasets).  
  • shift_to_place_meshes_in_volume (np.ndarray[float]): Offsets for aligning meshes extracted from volumes and meshes from the 3D scene.
  • streak_str_addition (str): String indicating the rotation performed on the assembly to avoid streak artifacts in the CT scan, e.g. "x10y5z5" meaning that a rotation by 10° around the x-axis, by 5° around the y-axis and by 5° around the z-axis.
  • voxelization_scale (np.float64): Resolution of the voxel grid (5.185185 for all datasets).

Citation

The dataset and ML pipeline were introduced in the following publication:
Lippmann, Peter, Roman Remme, and Fred A. Hamprecht. "Deconstruct to reconstruct: an automated pipeline for parsing complex CT assemblies." Machine Vision and Applications 37.1 (2026): 8. https://link.springer.com/article/10.1007/s00138-025-01717-5.

@article{lippmann2026deconstruct,
  title={Deconstruct to reconstruct: an automated pipeline for parsing complex CT assemblies},
  author={Lippmann, Peter and Remme, Roman and Hamprecht, Fred A},
  journal={Machine Vision and Applications},
  volume={37},
  number={1},
  pages={8},
  year={2026},
  publisher={Springer}
}

Files

ct_assembly_dataset_large.zip

Files (39.6 GB)

Name Size Download all
md5:80bd41ed187597fcd70c307a9c8870e5
39.4 GB Preview Download
md5:77dc823252276e22511fe8b385e4c8be
261.3 MB Preview Download

Additional details

Software

Repository URL
https://github.com/sciai-lab/DeconstruCTscans
Programming language
Python