Published April 14, 2026 | Version 1.0.0

Processed multi-omic dataset of lung neuroendocrine tumours from the lungNENomics cohort

Description

This dataset contains processed multi-omic bulk data of lung neuroendocrine tumours (lung NETs), a rare and understudied type of lung cancer. The dataset contain molecular data for 201 novel participants, comprising n=104 grade-1 and n=40 grade-2 tumours, multi-regional sequencing for 41 participants, longitudinal sampling of one participant, and reprocessed and harmonized data from previous studies of large-cell neuroendocrine tumors (LCNEC). The dataset consists of somatic variants called from whole-genome sequencing (WGS), gene expression and cell type estimates from RNA-sequencing data (RNA-seq), and M and beta values from DNA methylation arrays, along with metadata including the molecular group (Ca A1, Ca A2, Ca B, and supra-carcinoid enriched).  Processed spatial transcriptomics (10X genomics VISIUM) datasets of four samples are also including, consisting of gene expression, inferred CNVs and aneuploidy status, and spatial domain and cell proportions for each spot.

Note that in addition to the 201 newly sequenced samples mentioned above, the processed files also include 276 previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details, and the metadata.csv file, where samples sequenced for this study are denoted as “lungNENomics” in the “source” column, and reprocessed previously published samples are denoted according to the study where they were first published (first author + year of publication). 

Also note that corresponding raw sequencing data (fastq files for RNA-seq, cram files for WGS, and idat files for methylation arrays) is hosted on the European Genome-Phenome Archive website, study EGAS00001005979. Additionally, raw spatial data (Digital Spatial Profiling spatial proteomics and 10X Visium transcriptomics) is also hosted in the EGA study, and medical imaging data (Hematoxylin & Eosin stained whole-slide images) from the lungNENomics project is hosted in the EBI bioImage Archive (10.6019/S-BIAD3143).

Methods (English)

See associated publication for a full description of the cohort and methods (Sexton-Oates et al medrxiv 2025; https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1). 

Fresh-frozen tumour tissues and adjacent normal lung tissue or whole blood were collected at diagnosis from 12 contributing centres (https://rarecancersgenomics.com/lungnenomics/), and central pathology review by six pathologists was undertaken for 187 of the 201 participants. DNA extraction was performed either using the Gentra Puregene Tissue Kit from Qiagen (between 2018 and 2020), or the DNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. RNA extraction was performed either using the miRNAeasy Mini Kit (between 2018 and 2020), or the RNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. Data processing was performed using the bioinformatic workflows from the Computational Cancer Genomics team at the International Agency for Research on Cancer/World Health Organization (https://github.com/IARCbioinfo/), except for WGS whose primary processing was performed at CNRGH, France. 

WGS was performed for 72 participants, all with matched normal tissue or blood by the Centre National de Recherche en Génomique Humaine on 106 fresh-frozen tumours (including 34 multi-region tumour samples). The Illumina TruSeq DNA PCR-Free library preparation kit (20015963; Illumina) was used and sequencing was performed on an Illumina HiSeqX5 platform (target depth of 60x for tumour tissue and 30x for normal samples) as paired-end 150 bp reads. Reads were mapped to reference genome GRCh38 (+ALT and decoy) with bwa-mem v0.7.15-r1140. Small variants (single nucleotide variants, multi-nucleotide variants, and Indels) were called with Mutect2 (v4.2.0.016,17). Indels and multi-nucleotide variants were additionally called using Strelka2 (v2.9.10) and were only retained if they were also identified with Mutect2. Resulting variant calling format (VCF) files were annotated with ANNOVAR (v2020Jun08). Somatic and germline structural variants were identified using consensus calling combining DELLY (v0.8.718), Manta (v1.6.019), and SvABA (v1.1.020). A panel of germline structural variants was generated from normal samples, and somatic variants with overlapping breakpoints (within a 100bp region) of germline events that occurred in more than 1% of samples were removed. Copy-number variants were called using PURPLE (v2.5221), using high quality somatic variant calls to refine estimates. 

RNA-seq was performed by the Cologne Center for Genomics, Germany, on 239 tumours (including 61 multi-region samples) from 178 participants. After RNA quality control, libraries were prepared using the Illumina TruSeq Stranded mRNA polyA Kit (20020595; Illumina). Libraries were sequenced using an Illumina Novaseq 6000, as paired-end 100bp reads. Raw reads were mapped to reference genome GRCh38 with annotation gencode v33 (software STAR v2.7.3a), after removing adapter sequences (Trim Galore v0.6.5). Alignments were post-processed with ABRA2 (v2.2223; local realignment) and base quality score recalibration was performed (GATK v4.0.5.124). Gene expression quantification was performed using StringTie (v2.1.225). 

Spatial transcriptomic sequencing of four FFPE samples was performed at Centre Léon Bérard using the 10x Genomics Visium v1 platform. Each sample was placed on a Visium slide followed by deparaffinisation, H&E staining and decrosslinking steps, according to 10x Genomics guidelines. Human probes targeting approximately 18,000 genes were hybridised overnight on the slides and captured on each spot after ligation between the LHS and RHS probes. Libraries were produced for each sample following 10x Genomics protocols, prepared and sequenced on an Illumina NovaSeq 6000 machine with a target depth of 50,000 reads per spot. Samples were processed using SpaceRanger (v1.3.0), which performed demultiplexing, reads mapping to reference genome GRCh38, tissue and fiducial detection, and barcode/unique molecular identifier (UMI) counting, generating feature-barcode matrices. These matrices were used to detect spatial domains across all samples simultaneously with cell type location estimation using the IRIS algorithm.

DNA methylation arrays were performed at the International Agency for Research on Cancer, France, on 277 tumours (including multi-region tumour samples) from 191 participants. DNA was bisulphite converted using the Zymo EZ-96 DNA Methylation kit and hybridised to Infinium MethylationEPIC v1.0 BeadChip arrays (WG-317-1003, Illumina). Arrays were scanned using the Illumina iScan, generating raw intensity (IDAT) files. IDAT files were processed using R package minfi (v1.4027), following our standard workflow r. This workflow uses functional normalisation to produce methylation beta- and M-values and incorporates quality control of both samples and probes.

Note that in addition to the samples mentioned above, the processed files also include previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details. 

Technical info (English)

RData objects were generated using the save function in R v4.3.0.

Table of contents (English)

Each data is available in R object format (RData) and text file format (csv or tsv). Names reflect the data type (e.g., dataset_snv for SNVs, gene_count_matrix for gene expression in raw count unit) and samples included (PCA for pulmonary carcinoids, ITH for intra-tumor heterogeneity multi-regional sequencing PCA samples, TR for replicates, LCNEC for large-cell neuroendocrine carcinoma, SCLC for small cell lung cancer). 

  • metadata.csv: metadata from Table S1 of Sexton-Oates et al. medrxiv 2025

WGS data (145 samples in total)

  • dataset_snv_PCA_LCNEC_ITH: small variants
  • dataset_purple_PCA_LCNEC_ITH: copy number variants
  • dataset_sv_dmg_lungNENomicsCombined_LCNEC: structural variants

RNA-seq expression data (497 samples in total, 239 sequenced in our study + 258 public samples)

  • gene_count_matrix_PCA_LCNEC_SCLC_ITH_TR: raw counts 
  • gene_TPM_matrix_PCA_LCNEC_SCLC_ITH_TR: TPM
  • quanTIseq_matrix_PCA_LCNEC_SCLC_ITH_TR: estimated cell type proportions

Spatial transcriptomics expression data (4 samples)

  • dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN084_scanpy.rds, dataset_spatial_adata_LNEN206_scanpy.rds: the four spatial transcriptomics datasets generated by scanpy in R object (adata) format
  • dataset_spatial_IRIS  _object_20_domains.Rda: spatial domains and cell type deconvolution for the four samples as an IRIS R object
  • dataset_spatial_IRIS_annotations.csv: spatial domain for each spot for the four samples, in text format
  • dataset_spatial_IRIS_proportions.csv: estimated proportions of cells from each type per spot for the four samples, in text format

Methylation array (EPIC 850k)

  • Normalized_MandBetaTables_lungNENomicsCombined_LCNEC_ITH.RData: M values and beta values in R object format
  • Normalized_MTable_lungNENomicsCombined_LCNEC_ITH.csv: M values in text format
  • Normalized_BetaTable_lungNENomicsCombined_LCNEC_ITH.csv: beta values in text format

Files

dataset_purple_PCA_LCNEC_ITH.csv

Files (12.1 GB)

Name Size
md5:a0b16a08ddd7284f894ba4d64a770b5a
236.1 MB Preview Download
md5:108b02e541f2029834dd4b0591262745
65.2 MB Download
md5:8d05c06de5d38f556ba276fc1555cc00
755.7 MB Preview Download
md5:cbf77203a6b3395ce9d8437418849117
89.6 MB Download
md5:e05bc77f511f502f313142241aad01c6
108.9 MB Download
md5:99b9f4d398a78e9fb8e40137934b157c
76.1 MB Download
md5:0661a497b8455c1773c6b1b2f35cfbda
68.4 MB Download
md5:0e4fee6ecea017c629a2d322020d276d
27.7 MB Download
md5:2a7a9ec1775423b2bce9239db4d8f6dd
670.9 kB Preview Download
md5:b76c8bd02212cf91f9edc3138bb851e5
139.7 MB Download
md5:c4fe1b653dea4f6ca78ef63e8d9275a8
8.7 MB Preview Download
md5:e35e784cd332a1f6efefbcfbd82b1d77
87.9 kB Preview Download
md5:2d07c6e6b66457d81985cf6c1d52c056
22.0 kB Download
md5:d3484a75b02ee96fc43f1af2e69b1c76
84.5 MB Preview Download
md5:5b96adf4e13d2b3e72ec3f0b40fadc4f
36.1 MB Download
md5:b2cf7235f5d6419548e9ebd673a140f4
132.9 MB Download
md5:d3631f61187ecd059261221b54e14cf9
330.5 MB Download
md5:75ba6f92be22d14f02666ff6fce687d2
171.1 kB Preview Download
md5:fd3af5702def9f313c334a1a13330929
2.8 GB Preview Download
md5:64b2e921be1b31e8a8154a8826dcf734
4.3 GB Download
md5:3643620a0312d8000011c687e7c9f777
2.8 GB Preview Download
md5:1e738b47b4cdd65ca004961ff1d520fe
90.2 kB Preview Download

Additional details

Related works

Is source of
Preprint: 10.1101/2025.07.18.25331556 (DOI)

Funding

Neuroendocrine Tumor Research Foundation
Characterization of supra-carcinoids cell states to inform interception strategies Accelerator 2026
Neuroendocrine Tumor Research Foundation
Reconstructing the evolutionary history of neuroendocrine tumor subtypes Pilot 2023
Neuroendocrine Tumor Research Foundation
Reconciling lung carcinoids histopathological and molecular classifications Investigator 2022
Neuroendocrine Tumor Research Foundation
Comprehensive molecular characterization of lung supra-carcinoids Investigator 2019
Worldwide Cancer Research
Filling the unexpected gap from supra-carcinoids to large-cell neuroendocrine carcinomas: shedding light on the genesis of high-grade lung neuroendocrine neoplasms 26-0008
Institut National du Cancer
Genomic characterization of broncho-pulmonary carcinoids INCa-PRT-K-17-047
Worldwide Cancer Research
Studying the evolution of lung neuroendocrine neoplasms to discover new treatment targets 21-0005
Institut National du Cancer
LYRICAN+ INCa-DGOS-INSERM-ITMO cancer_18003
Institut National du Cancer
COALA: Cure Oncogene-Addicted Lung Adenocarcinoma LABREXCMP24-001 – Inca_18791

Software

Programming language
R