Processed multi-omic dataset of lung neuroendocrine tumours from the lungNENomics cohort
Authors/Creators
Description
This dataset contains processed multi-omic bulk data of lung neuroendocrine tumours (lung NETs), a rare and understudied type of lung cancer. The dataset contain molecular data for 201 novel participants, comprising n=104 grade-1 and n=40 grade-2 tumours, multi-regional sequencing for 41 participants, longitudinal sampling of one participant, and reprocessed and harmonized data from previous studies of large-cell neuroendocrine tumors (LCNEC). The dataset consists of somatic variants called from whole-genome sequencing (WGS), gene expression and cell type estimates from RNA-sequencing data (RNA-seq), and M and beta values from DNA methylation arrays, along with metadata including the molecular group (Ca A1, Ca A2, Ca B, and supra-carcinoid enriched). Processed spatial transcriptomics (10X genomics VISIUM) datasets of four samples are also including, consisting of gene expression, inferred CNVs and aneuploidy status, and spatial domain and cell proportions for each spot.
Note that in addition to the 201 newly sequenced samples mentioned above, the processed files also include 276 previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details, and the metadata.csv file, where samples sequenced for this study are denoted as “lungNENomics” in the “source” column, and reprocessed previously published samples are denoted according to the study where they were first published (first author + year of publication).
Also note that corresponding raw sequencing data (fastq files for RNA-seq, cram files for WGS, and idat files for methylation arrays) is hosted on the European Genome-Phenome Archive website, study EGAS00001005979. Additionally, raw spatial data (Digital Spatial Profiling spatial proteomics and 10X Visium transcriptomics) is also hosted in the EGA study, and medical imaging data (Hematoxylin & Eosin stained whole-slide images) from the lungNENomics project is hosted in the EBI bioImage Archive (10.6019/S-BIAD3143).
Methods (English)
See associated publication for a full description of the cohort and methods (Sexton-Oates et al medrxiv 2025; https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1).
Fresh-frozen tumour tissues and adjacent normal lung tissue or whole blood were collected at diagnosis from 12 contributing centres (https://rarecancersgenomics.com/lungnenomics/), and central pathology review by six pathologists was undertaken for 187 of the 201 participants. DNA extraction was performed either using the Gentra Puregene Tissue Kit from Qiagen (between 2018 and 2020), or the DNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. RNA extraction was performed either using the miRNAeasy Mini Kit (between 2018 and 2020), or the RNAdvance Tissue Kit (since 2021), following the manufacturer’s instructions. Data processing was performed using the bioinformatic workflows from the Computational Cancer Genomics team at the International Agency for Research on Cancer/World Health Organization (https://github.com/IARCbioinfo/), except for WGS whose primary processing was performed at CNRGH, France.
WGS was performed for 72 participants, all with matched normal tissue or blood by the Centre National de Recherche en Génomique Humaine on 106 fresh-frozen tumours (including 34 multi-region tumour samples). The Illumina TruSeq DNA PCR-Free library preparation kit (20015963; Illumina) was used and sequencing was performed on an Illumina HiSeqX5 platform (target depth of 60x for tumour tissue and 30x for normal samples) as paired-end 150 bp reads. Reads were mapped to reference genome GRCh38 (+ALT and decoy) with bwa-mem v0.7.15-r1140. Small variants (single nucleotide variants, multi-nucleotide variants, and Indels) were called with Mutect2 (v4.2.0.016,17). Indels and multi-nucleotide variants were additionally called using Strelka2 (v2.9.10) and were only retained if they were also identified with Mutect2. Resulting variant calling format (VCF) files were annotated with ANNOVAR (v2020Jun08). Somatic and germline structural variants were identified using consensus calling combining DELLY (v0.8.718), Manta (v1.6.019), and SvABA (v1.1.020). A panel of germline structural variants was generated from normal samples, and somatic variants with overlapping breakpoints (within a 100bp region) of germline events that occurred in more than 1% of samples were removed. Copy-number variants were called using PURPLE (v2.5221), using high quality somatic variant calls to refine estimates.
RNA-seq was performed by the Cologne Center for Genomics, Germany, on 239 tumours (including 61 multi-region samples) from 178 participants. After RNA quality control, libraries were prepared using the Illumina TruSeq Stranded mRNA polyA Kit (20020595; Illumina). Libraries were sequenced using an Illumina Novaseq 6000, as paired-end 100bp reads. Raw reads were mapped to reference genome GRCh38 with annotation gencode v33 (software STAR v2.7.3a), after removing adapter sequences (Trim Galore v0.6.5). Alignments were post-processed with ABRA2 (v2.2223; local realignment) and base quality score recalibration was performed (GATK v4.0.5.124). Gene expression quantification was performed using StringTie (v2.1.225).
Spatial transcriptomic sequencing of four FFPE samples was performed at Centre Léon Bérard using the 10x Genomics Visium v1 platform. Each sample was placed on a Visium slide followed by deparaffinisation, H&E staining and decrosslinking steps, according to 10x Genomics guidelines. Human probes targeting approximately 18,000 genes were hybridised overnight on the slides and captured on each spot after ligation between the LHS and RHS probes. Libraries were produced for each sample following 10x Genomics protocols, prepared and sequenced on an Illumina NovaSeq 6000 machine with a target depth of 50,000 reads per spot. Samples were processed using SpaceRanger (v1.3.0), which performed demultiplexing, reads mapping to reference genome GRCh38, tissue and fiducial detection, and barcode/unique molecular identifier (UMI) counting, generating feature-barcode matrices. These matrices were used to detect spatial domains across all samples simultaneously with cell type location estimation using the IRIS algorithm.
DNA methylation arrays were performed at the International Agency for Research on Cancer, France, on 277 tumours (including multi-region tumour samples) from 191 participants. DNA was bisulphite converted using the Zymo EZ-96 DNA Methylation kit and hybridised to Infinium MethylationEPIC v1.0 BeadChip arrays (WG-317-1003, Illumina). Arrays were scanned using the Illumina iScan, generating raw intensity (IDAT) files. IDAT files were processed using R package minfi (v1.4027), following our standard workflow r. This workflow uses functional normalisation to produce methylation beta- and M-values and incorporates quality control of both samples and probes.
Note that in addition to the samples mentioned above, the processed files also include previously published lung neuroendocrine neoplasm samples that were reprocessed following the exact workflows described above in order to avoid batch effects. See Sexton-Oates et al medrxiv 2025 (https://www.medrxiv.org/content/10.1101/2025.07.18.25331556v1) for details.
Technical info (English)
RData objects were generated using the save function in R v4.3.0.
Table of contents (English)
Each data is available in R object format (RData) and text file format (csv or tsv). Names reflect the data type (e.g., dataset_snv for SNVs, gene_count_matrix for gene expression in raw count unit) and samples included (PCA for pulmonary carcinoids, ITH for intra-tumor heterogeneity multi-regional sequencing PCA samples, TR for replicates, LCNEC for large-cell neuroendocrine carcinoma, SCLC for small cell lung cancer).
- metadata.csv: metadata from Table S1 of Sexton-Oates et al. medrxiv 2025
WGS data (145 samples in total)
- dataset_snv_PCA_LCNEC_ITH: small variants
- dataset_purple_PCA_LCNEC_ITH: copy number variants
- dataset_sv_dmg_lungNENomicsCombined_LCNEC: structural variants
RNA-seq expression data (497 samples in total, 239 sequenced in our study + 258 public samples)
- gene_count_matrix_PCA_LCNEC_SCLC_ITH_TR: raw counts
- gene_TPM_matrix_PCA_LCNEC_SCLC_ITH_TR: TPM
- quanTIseq_matrix_PCA_LCNEC_SCLC_ITH_TR: estimated cell type proportions
Spatial transcriptomics expression data (4 samples)
- dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN071_scanpy.rds, dataset_spatial_adata_LNEN084_scanpy.rds, dataset_spatial_adata_LNEN206_scanpy.rds: the four spatial transcriptomics datasets generated by scanpy in R object (adata) format
- dataset_spatial_IRIS _object_20_domains.Rda: spatial domains and cell type deconvolution for the four samples as an IRIS R object
- dataset_spatial_IRIS_annotations.csv: spatial domain for each spot for the four samples, in text format
- dataset_spatial_IRIS_proportions.csv: estimated proportions of cells from each type per spot for the four samples, in text format
Methylation array (EPIC 850k)
- Normalized_MandBetaTables_lungNENomicsCombined_LCNEC_ITH.RData: M values and beta values in R object format
- Normalized_MTable_lungNENomicsCombined_LCNEC_ITH.csv: M values in text format
- Normalized_BetaTable_lungNENomicsCombined_LCNEC_ITH.csv: beta values in text format
Files
dataset_purple_PCA_LCNEC_ITH.csv
Files
(12.1 GB)
| Name | Size | |
|---|---|---|
|
md5:a0b16a08ddd7284f894ba4d64a770b5a
|
236.1 MB | Preview Download |
|
md5:108b02e541f2029834dd4b0591262745
|
65.2 MB | Download |
|
md5:8d05c06de5d38f556ba276fc1555cc00
|
755.7 MB | Preview Download |
|
md5:cbf77203a6b3395ce9d8437418849117
|
89.6 MB | Download |
|
md5:e05bc77f511f502f313142241aad01c6
|
108.9 MB | Download |
|
md5:99b9f4d398a78e9fb8e40137934b157c
|
76.1 MB | Download |
|
md5:0661a497b8455c1773c6b1b2f35cfbda
|
68.4 MB | Download |
|
md5:0e4fee6ecea017c629a2d322020d276d
|
27.7 MB | Download |
|
md5:2a7a9ec1775423b2bce9239db4d8f6dd
|
670.9 kB | Preview Download |
|
md5:b76c8bd02212cf91f9edc3138bb851e5
|
139.7 MB | Download |
|
md5:c4fe1b653dea4f6ca78ef63e8d9275a8
|
8.7 MB | Preview Download |
|
md5:e35e784cd332a1f6efefbcfbd82b1d77
|
87.9 kB | Preview Download |
|
md5:2d07c6e6b66457d81985cf6c1d52c056
|
22.0 kB | Download |
|
md5:d3484a75b02ee96fc43f1af2e69b1c76
|
84.5 MB | Preview Download |
|
md5:5b96adf4e13d2b3e72ec3f0b40fadc4f
|
36.1 MB | Download |
|
md5:b2cf7235f5d6419548e9ebd673a140f4
|
132.9 MB | Download |
|
md5:d3631f61187ecd059261221b54e14cf9
|
330.5 MB | Download |
|
md5:75ba6f92be22d14f02666ff6fce687d2
|
171.1 kB | Preview Download |
|
md5:fd3af5702def9f313c334a1a13330929
|
2.8 GB | Preview Download |
|
md5:64b2e921be1b31e8a8154a8826dcf734
|
4.3 GB | Download |
|
md5:3643620a0312d8000011c687e7c9f777
|
2.8 GB | Preview Download |
|
md5:1e738b47b4cdd65ca004961ff1d520fe
|
90.2 kB | Preview Download |
Additional details
Related works
- Is source of
- Preprint: 10.1101/2025.07.18.25331556 (DOI)
Funding
- Neuroendocrine Tumor Research Foundation
- Characterization of supra-carcinoids cell states to inform interception strategies Accelerator 2026
- Neuroendocrine Tumor Research Foundation
- Reconstructing the evolutionary history of neuroendocrine tumor subtypes Pilot 2023
- Neuroendocrine Tumor Research Foundation
- Reconciling lung carcinoids histopathological and molecular classifications Investigator 2022
- Neuroendocrine Tumor Research Foundation
- Comprehensive molecular characterization of lung supra-carcinoids Investigator 2019
- Worldwide Cancer Research
- Filling the unexpected gap from supra-carcinoids to large-cell neuroendocrine carcinomas: shedding light on the genesis of high-grade lung neuroendocrine neoplasms 26-0008
- Institut National du Cancer
- Genomic characterization of broncho-pulmonary carcinoids INCa-PRT-K-17-047
- Worldwide Cancer Research
- Studying the evolution of lung neuroendocrine neoplasms to discover new treatment targets 21-0005
- Institut National du Cancer
- LYRICAN+ INCa-DGOS-INSERM-ITMO cancer_18003
- Institut National du Cancer
- COALA: Cure Oncogene-Addicted Lung Adenocarcinoma LABREXCMP24-001 – Inca_18791
Software
- Programming language
- R