Processed multi-omic dataset of lung neuroendocrine tumours from the lungNENomics cohort

Alcala, Nicolas; Foll, Matthieu; Fernandez-Cuesta, Lynnette; Sexton-Oates, Alexandra; Kalson, Lipika

doi:10.5281/zenodo.19366762

Published April 14, 2026 | Version 1.0.0

Dataset Open

Processed multi-omic dataset of lung neuroendocrine tumours from the lungNENomics cohort

1. Centre International de Recherche sur le Cancer

This dataset contains processed multi-omic bulk data of lung neuroendocrine tumours (lung NETs), a rare and understudied type of lung cancer. The dataset contain molecular data for 201 novel participants, comprising n=104 grade-1 and n=40 grade-2 tumours, multi-regional sequencing for 41 participants, longitudinal sampling of one participant, and reprocessed and harmonized data from previous studies of large-cell neuroendocrine tumors (LCNEC). The dataset consists of somatic variants called from whole-genome sequencing (WGS), gene expression and cell type estimates from RNA-sequencing data (RNA-seq), and M and beta values from DNA methylation arrays, along with metadata including the molecular group (Ca A1, Ca A2, Ca B, and supra-carcinoid enriched). Processed spatial transcriptomics (10X genomics VISIUM) datasets of four samples are also including, consisting of gene expression, inferred CNVs and aneuploidy status, and spatial domain and cell proportions for each spot.

Note that in addition to the 201 newly sequenced samples mentioned above, the processed files also include 276 previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details, and the metadata.csv file, where samples sequenced for this study are denoted as “lungNENomics” in the “source” column, and reprocessed previously published samples are denoted according to the study where they were first published (first author + year of publication).

Also note that corresponding raw sequencing data (fastq files for RNA-seq, cram files for WGS, and idat files for methylation arrays) is hosted on the European Genome-Phenome Archive website, study EGAS00001005979. Additionally, raw spatial data (Digital Spatial Profiling spatial proteomics and 10X Visium transcriptomics) is also hosted in the EGA study, and medical imaging data (Hematoxylin & Eosin stained whole-slide images) from the lungNENomics project is hosted in the EBI bioImage Archive (10.6019/S-BIAD3143).

Methods (English)

See associated publication for a full description of the cohort and methods (Sexton-Oates et al medrxiv 2025; https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1).

Fresh-frozen tumour tissues and adjacent normal lung tissue or whole blood were collected at diagnosis from 12 contributing centres (https://rarecancersgenomics.com/lungnenomics/), and central pathology review by six pathologists was undertaken for 187 of the 201 participants. DNA extraction was performed either using the Gentra Puregene Tissue Kit from Qiagen (between 2018 and 2020), or the DNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. RNA extraction was performed either using the miRNAeasy Mini Kit (between 2018 and 2020), or the RNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. Data processing was performed using the bioinformatic workflows from the Computational Cancer Genomics team at the International Agency for Research on Cancer/World Health Organization (https://github.com/IARCbioinfo/), except for WGS whose primary processing was performed at CNRGH, France.

WGS was performed for 72 participants, all with matched normal tissue or blood by the Centre National de Recherche en Génomique Humaine on 106 fresh-frozen tumours (including 34 multi-region tumour samples). The Illumina TruSeq DNA PCR-Free library preparation kit (20015963; Illumina) was used and sequencing was performed on an Illumina HiSeqX5 platform (target depth of 60x for tumour tissue and 30x for normal samples) as paired-end 150 bp reads. Reads were mapped to reference genome GRCh38 (+ALT and decoy) with bwa-mem v0.7.15-r1140. Small variants (single nucleotide variants, multi-nucleotide variants, and Indels) were called with Mutect2 (v4.2.0.016,17). Indels and multi-nucleotide variants were additionally called using Strelka2 (v2.9.10) and were only retained if they were also identified with Mutect2. Resulting variant calling format (VCF) files were annotated with ANNOVAR (v2020Jun08). Somatic and germline structural variants were identified using consensus calling combining DELLY (v0.8.718), Manta (v1.6.019), and SvABA (v1.1.020). A panel of germline structural variants was generated from normal samples, and somatic variants with overlapping breakpoints (within a 100bp region) of germline events that occurred in more than 1% of samples were removed. Copy-number variants were called using PURPLE (v2.5221), using high quality somatic variant calls to refine estimates.

RNA-seq was performed by the Cologne Center for Genomics, Germany, on 239 tumours (including 61 multi-region samples) from 178 participants. After RNA quality control, libraries were prepared using the Illumina TruSeq Stranded mRNA polyA Kit (20020595; Illumina). Libraries were sequenced using an Illumina Novaseq 6000, as paired-end 100bp reads. Raw reads were mapped to reference genome GRCh38 with annotation gencode v33 (software STAR v2.7.3a), after removing adapter sequences (Trim Galore v0.6.5). Alignments were post-processed with ABRA2 (v2.2223; local realignment) and base quality score recalibration was performed (GATK v4.0.5.124). Gene expression quantification was performed using StringTie (v2.1.225).

Spatial transcriptomic sequencing of four FFPE samples was performed at Centre Léon Bérard using the 10x Genomics Visium v1 platform. Each sample was placed on a Visium slide followed by deparaffinisation, H&E staining and decrosslinking steps, according to 10x Genomics guidelines. Human probes targeting approximately 18,000 genes were hybridised overnight on the slides and captured on each spot after ligation between the LHS and RHS probes. Libraries were produced for each sample following 10x Genomics protocols, prepared and sequenced on an Illumina NovaSeq 6000 machine with a target depth of 50,000 reads per spot. Samples were processed using SpaceRanger (v1.3.0), which performed demultiplexing, reads mapping to reference genome GRCh38, tissue and fiducial detection, and barcode/unique molecular identifier (UMI) counting, generating feature-barcode matrices. These matrices were used to detect spatial domains across all samples simultaneously with cell type location estimation using the IRIS algorithm.

DNA methylation arrays were performed at the International Agency for Research on Cancer, France, on 277 tumours (including multi-region tumour samples) from 191 participants. DNA was bisulphite converted using the Zymo EZ-96 DNA Methylation kit and hybridised to Infinium MethylationEPIC v1.0 BeadChip arrays (WG-317-1003, Illumina). Arrays were scanned using the Illumina iScan, generating raw intensity (IDAT) files. IDAT files were processed using R package minfi (v1.4027), following our standard workflow r. This workflow uses functional normalisation to produce methylation beta- and M-values and incorporates quality control of both samples and probes.

Note that in addition to the samples mentioned above, the processed files also include previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details.

Technical info (English)

RData objects were generated using the save function in R v4.3.0.

Table of contents (English)

Each data is available in R object format (RData) and text file format (csv or tsv). Names reflect the data type (e.g., dataset_snv for SNVs, gene_count_matrix for gene expression in raw count unit) and samples included (PCA for pulmonary carcinoids, ITH for intra-tumor heterogeneity multi-regional sequencing PCA samples, TR for replicates, LCNEC for large-cell neuroendocrine carcinoma, SCLC for small cell lung cancer).

metadata.csv: metadata from Table S1 of Sexton-Oates et al. medrxiv 2025

WGS data (145 samples in total)

dataset_snv_PCA_LCNEC_ITH: small variants
dataset_purple_PCA_LCNEC_ITH: copy number variants
dataset_sv_dmg_lungNENomicsCombined_LCNEC: structural variants

RNA-seq expression data (497 samples in total, 239 sequenced in our study + 258 public samples)

gene_count_matrix_PCA_LCNEC_SCLC_ITH_TR: raw counts
gene_TPM_matrix_PCA_LCNEC_SCLC_ITH_TR: TPM
quanTIseq_matrix_PCA_LCNEC_SCLC_ITH_TR: estimated cell type proportions

Spatial transcriptomics expression data (4 samples)

dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN084_scanpy.rds, dataset_spatial_adata_LNEN206_scanpy.rds: the four spatial transcriptomics datasets generated by scanpy in R object (adata) format
dataset_spatial_IRIS _object_20_domains.Rda: spatial domains and cell type deconvolution for the four samples as an IRIS R object
dataset_spatial_IRIS_annotations.csv: spatial domain for each spot for the four samples, in text format
dataset_spatial_IRIS_proportions.csv: estimated proportions of cells from each type per spot for the four samples, in text format

Methylation array (EPIC 850k)

Normalized_MandBetaTables_lungNENomicsCombined_LCNEC_ITH.RData: M values and beta values in R object format
Normalized_MTable_lungNENomicsCombined_LCNEC_ITH.csv: M values in text format
Normalized_BetaTable_lungNENomicsCombined_LCNEC_ITH.csv: beta values in text format

Files

dataset_purple_PCA_LCNEC_ITH.csv

Files (12.1 GB)

Name	Size
dataset_purple_PCA_LCNEC_ITH.csv md5:a0b16a08ddd7284f894ba4d64a770b5a	236.1 MB	Preview Download
dataset_purple_PCA_LCNEC_ITH.RData md5:108b02e541f2029834dd4b0591262745	65.2 MB	Download
dataset_snv_PCA_LCNEC_ITH.csv md5:8d05c06de5d38f556ba276fc1555cc00	755.7 MB	Preview Download
dataset_snv_PCA_LCNEC_ITH.RData md5:cbf77203a6b3395ce9d8437418849117	89.6 MB	Download
dataset_spatial_adata_LNEN071_scanpy.rds md5:e05bc77f511f502f313142241aad01c6	108.9 MB	Download
dataset_spatial_adata_LNEN084_scanpy.rds md5:99b9f4d398a78e9fb8e40137934b157c	76.1 MB	Download
dataset_spatial_adata_LNEN107_scanpy.rds md5:0661a497b8455c1773c6b1b2f35cfbda	68.4 MB	Download
dataset_spatial_adata_LNEN206_scanpy.rds md5:0e4fee6ecea017c629a2d322020d276d	27.7 MB	Download
dataset_spatial_IRIS_annotations.csv md5:2a7a9ec1775423b2bce9239db4d8f6dd	670.9 kB	Preview Download
dataset_spatial_IRIS_object_20_domains.Rda md5:b76c8bd02212cf91f9edc3138bb851e5	139.7 MB	Download
dataset_spatial_IRIS_proportions.csv md5:c4fe1b653dea4f6ca78ef63e8d9275a8	8.7 MB	Preview Download
dataset_sv_dmg_lungNENomicsCombined_LCNEC.csv md5:e35e784cd332a1f6efefbcfbd82b1d77	87.9 kB	Preview Download
dataset_sv_dmg_lungNENomicsCombined_LCNEC.RData md5:2d07c6e6b66457d81985cf6c1d52c056	22.0 kB	Download
gene_count_matrix_PCA_LCNEC_SCLC_ITH_TR.csv md5:d3484a75b02ee96fc43f1af2e69b1c76	84.5 MB	Preview Download
gene_count_matrix_PCA_LCNEC_SCLC_ITH_TR.RData md5:5b96adf4e13d2b3e72ec3f0b40fadc4f	36.1 MB	Download
gene_TPM_matrix_PCA_LCNEC_SCLC_ITH_TR.RData md5:b2cf7235f5d6419548e9ebd673a140f4	132.9 MB	Download
gene_TPM_matrix_PCA_LCNEC_SCLC_ITH_TR.tsv md5:d3631f61187ecd059261221b54e14cf9	330.5 MB	Download
metadata.csv md5:75ba6f92be22d14f02666ff6fce687d2	171.1 kB	Preview Download
Normalised_BetaTable_lungNENomicsCombined_LCNEC_ITH.csv md5:fd3af5702def9f313c334a1a13330929	2.8 GB	Preview Download
Normalised_MandBetaTables_lungNENomicsCombined_LCNEC_ITH.RData md5:64b2e921be1b31e8a8154a8826dcf734	4.3 GB	Download
Normalised_MTable_lungNENomicsCombined_LCNEC_ITH.csv md5:3643620a0312d8000011c687e7c9f777	2.8 GB	Preview Download
quanTIseq_matrix_PCA_LCNEC_SCLC_ITH_TR.csv md5:1e738b47b4cdd65ca004961ff1d520fe	90.2 kB	Preview Download

Additional details

Is source of: Preprint: 10.1101/2025.07.18.25331556 (DOI)

Neuroendocrine Tumor Research Foundation
Characterization of supra-carcinoids cell states to inform interception strategies Accelerator 2026
Neuroendocrine Tumor Research Foundation
Reconstructing the evolutionary history of neuroendocrine tumor subtypes Pilot 2023
Neuroendocrine Tumor Research Foundation
Reconciling lung carcinoids histopathological and molecular classifications Investigator 2022
Neuroendocrine Tumor Research Foundation
Comprehensive molecular characterization of lung supra-carcinoids Investigator 2019
Worldwide Cancer Research
Filling the unexpected gap from supra-carcinoids to large-cell neuroendocrine carcinomas: shedding light on the genesis of high-grade lung neuroendocrine neoplasms 26-0008
Institut National du Cancer
Genomic characterization of broncho-pulmonary carcinoids INCa-PRT-K-17-047
Worldwide Cancer Research
Studying the evolution of lung neuroendocrine neoplasms to discover new treatment targets 21-0005
Institut National du Cancer
LYRICAN+ INCa-DGOS-INSERM-ITMO cancer_18003
Institut National du Cancer
COALA: Cure Oncogene-Addicted Lung Adenocarcinoma LABREXCMP24-001 – Inca_18791

Programming language: R

	All versions	This version
Views	18	18
Downloads	21	21
Data volume	21.5 GB	21.5 GB

dataset_purple_PCA_LCNEC_ITH.csv

Files (12.1 GB)

Related works

Funding

Software

Processed multi-omic dataset of lung neuroendocrine tumours from the lungNENomics cohort

Authors/Creators

Description

Methods (English)

Technical info (English)

Table of contents (English)

Files

dataset_purple_PCA_LCNEC_ITH.csv

Files (12.1 GB)

Additional details

Related works

Funding

Software