Interpretable Inflammation Landscape of Circulating Immune cells

Jiménez-Gracia, Laura; Maspero, Davide; Aguilar Fernández, Sergio; Craighero, Francesco; Nieto Sáchica, Juan Camilo; Heyn, Holger

doi:10.5281/zenodo.14851902

Published August 25, 2025 | Version v1

Journal article Open

Interpretable Inflammation Landscape of Circulating Immune cells

1. Centro Nacional de Análisis Genómico
2. Centre for Genomic Regulation

Interpretable Inflammation Landscape of Circulating Immune cells

This repository contains scRNA-seq processed datasets and metadata used in the manuscript entitled "Interpretable Inflammation Landscape of Circulating Immune cells".

Abstract

Inflammation is a biological phenomenon beneficial for homeostasis, but unfavorable if dysregulated. Although major progress has been made in characterizing inflammation in specific diseases, a global, holistic understanding is still elusive. This is particularly intriguing, considering its function for human health and the potential for modern medicine if fully deciphered. Here, we leverage advances in single-cell genomics to delineate inflammatory processes of circulating immune cells during infection, immune-mediated inflammatory diseases and cancer. Our single-cell atlas of >6.5 million peripheral blood mononuclear cells from 1047 patients (56% female, 43% male) and 19 diseases allowed us to learn a comprehensive model of inflammation in circulating immune cells. The atlas expanded our current knowledge of the biology of inflammation of immune-mediated diseases (7), acute (1) and chronic (3) inflammatory diseases, infection (4) and solid tumors (4), and laid the foundation to develop a precision medicine framework using unsupervised as well as explainable machine learning. Beyond a disease-centered analysis, we charted altered activity of inflammatory molecules in peripheral blood cells, depicting discriminative inflammation-related genes to further understand mechanisms of inflammation. Finally, we laid the groundwork for learning a classifier for inflammatory diseases, presenting cells in circulation as a powerful resource for patient classification.

Inflammation atlas cohort description

The project includes in-house single-cell RNA-sequencing data generation from samples shared by our collaborators from several research institutions. Samples were collected with written informed consent obtained from all participants and comply with the ethical guidelines for human samples. Specifically, we generated data from patients suffering Rheumatoid Arthritis (RA), Psoriatic Arthritis (PSA), Crohn's Disease (CD), Ulcerative Colitis (UC), Psoriasis (PS), Systemic Lupus Erythematosus (SLE) and healthy controls in collaboration with the Vall d’Hebron Research Institute within the DoCTIS consortia (SCGT00). Additionally, we processed and obtained data from healthy controls in collaboration with the Institut Hospital del Mar d'Investigacions Mèdiques (SCGT01); Asthma, Chronic Obstructive Pulmonary Disease (COPD) and healthy control samples in collaboration with the University Medical Center Groningen (SCGT02); Breast Cancer (BRCA) samples in collaboration with the Vall d’Hebron Institute of Oncology (SCGT03); cirrhosis samples in collaboration with the Biomedical Research Institut Sant Pau (SCGT04); samples of patients suffering Colorectal Cancer (CRC) in collaboration with the Katholieke Universiteit Leuven (SCGT05) and, finally, COVID and healthy control samples also in collaboration with Biomedical Research Institut Sant Pau (SCGT06).

Moreover, we also included publicly available datasets to complete our cohort. Specifically, we considered data from patients suffering sepsis from Reyes et al. (1) and Jiang et al. (2), Head and Neck Squamous Cell Carcinoma (HNSCC) from Cillo et al. (3), Hepatitis B Virus (HBV) from Zhang et al. (4), Multiple Sclerosis (MS) from Schafflick et al. (5), NasoPharyngeal Cancer (NPC) from Liu et al. (6), Human Immunodeficiency Virus (HIV) from Palshikar et al. (7) and Wang et al. (8), SLE from Perez et al. (9), Savage et al. (10) and Mistry et al. (11), cirrhosis from Ramachandran et al. (12), CD from Martin et al. (13), COVID-Flu-Sepsis from COMBAT from Ahren et al. (14) as well as COVID from Ren et al. (15) and healthy controls from Terekhova et al. (16) and 10X Genomics together with the available healthy samples from all the cited studies.

NOTE: Further details on dataset and sample included in the inflammation atlas can be found in Supplementary Table 1. Sheet 1-2.

Raw data (FASTQ)

Single-cell RNA-sequencing (scRNA-seq) in-house generated data and associated count matrices are accessible at Sequence Read Archieve (SRA), NCBI Gene Expression Omnibus (GEO) and European Genome Archive (EGA) databases. Previously published scRNA-seq data included in this project, either FASTQ files or processed count matrices, were obtained from GEO, BioStudies Array Expresse, Broad Institute DUOS, Synapse, Genome Sequence Analysis (GSA), CellXGene Data Portal, and 10X Genomics.

Inflammation atlas cohort split

We divided our dataset into three distinct groups, each serving different purposes aligned with the paper’s objectives and downstream analysis (see Fig. 1b in the manuscript).

Core: We selected a set of studies to generate the Inflammation reference atlas. These samples were randomly split, considering multiple covariates such as study ID, chemistry, and disease, into two subgroups:
- Main (atlas): Samples used to build the reference annotation, to extract biological findings and to train the patient classifier.
- Validation (unseen patients): Samples used for the first level of validation of the patient classifier. These include Core samples never seen by the classifier.
External (unseen studies): We selected a set of studies to evaluate the performance of the patient classifier. These samples represent the second level of validation using an independent set of samples and studies. External studies include samples profiled with the same and different chemistries as the Core data.

NOTE: Further details on dataset and sample splitting can be found in Supplementary Table 1. Sheet 3.

Additionally, a four level of dataset splitting was done for a centralized, multi-disease scenario (SCGT00 dataset).

SCGT00_CentralizedDataset: We selected a single study that includes data from 6 diseases + healthy controls, that have been generated in the same research center, with a single assay chemistry, and by the same technician. These samples were pooled in groups of 8 patients, thus we stratified them by sequencing pool and disease, ensuring that reference and query patients belong to distinct cohorts.
- Main (SCGT00 atlas): Samples used to build the reference annotation, to extract biological findings and to train the patient classifier.
- Validation (external): Samples used for the patient classifier.

NOTE: Further details on SCGT00_CentralizedDataset sample splitting can be found in Supplementary Table 1. Sheet 4. To regenerate this object and reproduce the manuscript results, the INFLAMMATION ATLAS data should be regenerated from "core", and then, split based on the details provided in Sheet 4.

ZENODO REPOSITORY

Supplementary_Table_1.xlsx:Dataset overview of human PBMCs samples.

This file contains general information regarding the datasets and the clinical information of the samples included in the current study.

Sheet 1: byStudyID. Details on the dataset (studyID), where the data has been generated (in-house or public), the 10X Genomics chemistry, the publication and the dataset reference (in case of public data), and if we have remapped the FASTQ files. In all cases, we provide the CellRanger and Reference Genome version used. Additionally, for each disease, we provide the number of donors collected before the quality control.
Sheet 2: byDisease_splitted. Summary of the number of patients per disease and stratified by subsets (Main, unseen patients or unseen studies), considering sex and binned age categories.
Sheet 3: bySampleID_afterQC. Details regarding the technical and clinical metadata per sample; for the missing metadata information (NA is displayed).
Sheet 4: SCGT00_CentralizedDataset. Details of samples from a unified, centralized study of the patient cohort, processed by sample pools (patientPool) and stratification into Reference and Query subsets.

INFLAMMATION_ATLAS_{group}_afterQC.h5ad: Raw count matrices after QC in h5ad format for each group [main, validation, external].

Here, only samples and cells that were not removed due to low quality control are included. Also, "main" and "validation" datasets were also filtered for non-expressed genes (<1 count in less than 20 cells in less than 5 patients).

Each adata file contains:

adata.obs:
- sampleID: A unique identifier for each biological sample.
- libraryID: A unique identifier for the sequencing library prepared from the sample, used to track the specific library preparation and sequencing run associated with the sample.
- Additional sample metadata can be found in INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv.
- Level1/Level2 and Level1pred/Level2pred: Cell type annotation levels. This hierarchical annotation framework consists of two levels: Level1 provides a broad classification of cell types, while Level2 offers a more detailed and granular classification.
  - Level1 and Level2 annotations are derived from samples included in the MAIN dataset. These annotations were manually curated by an immunologist through sub-clustering of the main lineages and analyzing marker gene expression.
  - Level1pred and Level2pred represent the predicted cell type annotations generated by the scANVI atlas model after projecting the corresponding dataset onto the MAIN integrated data and transferring the Level2 labels. Level1pred categories were derived by grouping the more detailed Level2pred categories based on their established ontology. No manual checks on the consistency of these annotations were performed.
adata.var: ENSEMBL IDs as index for adata.var which includes the following information:
- hgnc_id: The unique identifier assigned to a gene by the HUGO Gene Nomenclature Committee (HGNC).
- symbol: The official gene symbol or abbreviation used to represent the gene.
- locus_group: The category or classification of the gene locus (e.g., protein-coding gene, non-coding RNA).
- HUGO_status: The status of the gene according to the HUGO Gene Nomenclature Committee, indicating its acceptance or validation (e.g., approved, withdrawn).
- mt: Mitochondrial genes. These genes were used to compute the percentage of mitochondrial genes detected in each cell, used as a quality control metric.
- rb: Ribosomal genes. These genes were used to compute the percentage of ribosomal genes detected in each cell, used as a quality control metric.
- hb: Hemoglobin genes. These genes were used to compute the percentage of hemoglobin genes detected in each cell, used as a quality control metric.
- plt: Platelet genes. These genes were used to compute the percentage of Platelet genes detected in each cell, used as a quality control metric.
- n_cells_by_counts: The number of cells in the sample that were counted based on gene expression.
- total_counts: The total number of gene expression counts (reads or UMIs) detected across all genes in the sample.
- gene_universe: The list of genes (8253) considered in the downstream analysis.

INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv:Sample metadata from Inflammation Atlas patient cohort.
This file includes:
- sampleID: A unique identifier for each biological sample.
- chemistry: The specific 10X Genomics chemistry and version used to process the sample, either 3' or 5' GEX capturing [3'v2, 3'v3, 5'v1, 5'v2].
- technology: The strategy and technology used to process the sample, using standard or High-Throughput assays [HT], as well as processing a single sample per library or multiplexing samples using different strategies [GenoHashed -with SNPs-, CellPlex -with CMO-, Hashed -with HTO-].
- patientID: A unique identifier for each patient from whom the sample was taken.
- institute: The institution or research center where the sample was collected.
- disease: The specific disease or condition diagnosed in the patient.
- diseaseGroup: The broader classification of the disease [IMIDs, Solid Tumors, Infectious, Acute Inflammation, Chronic Inflammation, Healthy]
- ethnicity: The ethnic background of the patient [Caucasian, Asian, African American, NA]
- timepoint_replicate: Specific id to distinguish multiple samples from the same patient [0, 1, or 2].
- treatmentStatus: The status of the patient's treatment at the time of sample collection [Naive, Ongoing, NA].
- sex: The biological sex of the patient [male, female].
- age: The age of the patient at the time of sample collection.
- binned_age: The age of the patient categorized into bins [<18, 18-30, 31-40, 41-50, ..., >80].
- BMI: The Body Mass Index (BMI) of the patient at the time of sample collection.
- diseaseStatus: The current status or stage of the disease in the patient (e.g., COVID_severe, COVID_mild).
- smokingStatus: The smoking habits of the patient [smoker, never-smoker, former-smoker, NA]

scANVI_models.zip: This compressed folder must be unzipped before use and contains scANVI (single-cell ANnotated Variational Inference) models:
- scANVI_atlas: This folder contains scANVI model that was trained on the full dataset, encompassing all identified cell types, including Red Blood Cells and Platelets. This comprehensive model is used to characterize the entire cellular landscape, capturing the diversity of immune and non-immune cells present in the dataset. It was used to project external datasets.
- scANVI_downstream: This folder contains a refined scANVI model specifically used for downstream analyses. It excludes Red Blood Cells and Platelets to remove possible confounding factors that could affect such analyses.

Code availability

The code to reproduce the full analysis presented in this article is hosted in the Github repository:

https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/Inflammation-PBMCs-Atlas

Files

INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv

Files (22.0 GB)

Name	Size
INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv md5:395e6f817fce9d5091f274d07c91359b	149.2 kB	Preview Download
INFLAMMATION_ATLAS_external_afterQC.h5ad md5:187862d15398187c874f62bd5ebf02ce	2.1 GB	Download
INFLAMMATION_ATLAS_main_afterQC.h5ad md5:14fa25b5ff8d5703e13838b2f17274c5	17.2 GB	Download
INFLAMMATION_ATLAS_validation_afterQC.h5ad md5:32fe8866b1de6a2b08ce9740de31c9c2	2.5 GB	Download
scANVI_models.zip md5:e3c3ec52f9c40cea7070f718d513b14c	123.4 MB	Preview Download
Supplementary_Table_1.xlsx md5:6160aad1280c8177431afd14df524822	114.5 kB	Download

Additional details

European Commission
DocTIS - DECISION ON OPTIMAL COMBINATORIAL THERAPIES IN IMIDS USING SYSTEMS APPROACHES 848028

Repository URL: https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/Inflammation-PBMCs-Atlas

1. Reyes, M. et al. An immune-cell signature of bacterial sepsis. Nat. Med. 26, 333–340 (2020).
2. Jiang, Y. et al. Single cell RNA sequencing identifies an early monocyte gene signature in acute respiratory distress syndrome. JCI Insight 5, (2020).
3. Cillo, A. R. et al. Immune Landscape of Viral- and Carcinogen-Driven Head and Neck Cancer. Immunity 52, 183-199.e9 (2020)
4. Zhang, C. et al. Single-cell RNA sequencing reveals intrahepatic and peripheral immune characteristics related to disease phases in HBV-infected patients. Gut 72, 153–167 (2023).
5. Schafflick, D. et al. Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis. Nat. Commun. 11, 247 (2020).
6. Liu, Y. et al. Tumour heterogeneity and intercellular networks of nasopharyngeal carcinoma at single cell resolution. Nat. Commun. 12, 741 (2021).
7. Palshikar, M. G. et al. Executable models of immune signaling pathways in HIV-associated atherosclerosis. Npj Syst. Biol. Appl. 8, 1–15 (2022).
8. Wang, S. et al. An atlas of immune cell exhaustion in HIV-infected individuals revealed by single- cell transcriptomics. Emerg. Microbes Infect. 9, 2333–2347 (2020).
9. Perez, R. K. et al. Single-cell RNA-seq reveals cell type–specific molecular and genetic associations to lupus. Science 376, eabf1970 (2022).
10. Savage, A. K. et al. Multimodal analysis for human ex vivo studies shows extensive molecular changes from delays in blood processing. iScience 24, (2021).
11. Mistry, P. et al. Transcriptomic, epigenetic, and functional analyses implicate neutrophil diversity in the pathogenesis of systemic lupus erythematosus. Proc. Natl. Acad. Sci. 116, 25222–25228 (2019).
12. Ramachandran, P. et al. Resolving the fibrotic niche of human liver cirrhosis at single-cell level. Nature 575, 512–518 (2019).
13. Martin, J. C. et al. Single-Cell Analysis of Crohn's Disease Lesions Identifies a Pathogenic Cellular Module Associated with Resistance to Anti-TNF Therapy. Cell 178, 1493-1508.e20 (2019).
14. Ahern, D. J. et al. A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell 185, 916-938.e58 (2022).
15. Ren, X. et al. COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895-1913.e19 (2021).
16. Terekhova, M. et al. Single-cell atlas of healthy human blood unveils age-related loss of NKG2C+GZMB−CD8+ memory T cells and accumulation of type 2 memory T cells. Immunity 56, 2836-2854.e9 (2023).

	All versions	This version
Views	1,376	1,376
Downloads	2,677	2,677
Data volume	37.9 TB	37.9 TB

Interpretable Inflammation Landscape of Circulating Immune cells

Abstract

Inflammation atlas cohort description

Raw data (FASTQ)

Inflammation atlas cohort split

ZENODO REPOSITORY

Code availability

INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv

Files (22.0 GB)

Funding

Software

References

Interpretable Inflammation Landscape of Circulating Immune cells

Authors/Creators

Description

Interpretable Inflammation Landscape of Circulating Immune cells

Abstract

Inflammation atlas cohort description

Raw data (FASTQ)

Inflammation atlas cohort split

ZENODO REPOSITORY

Code availability

Files

INFLAMMATION_ATLAS_afterQC_sampleMetadata.csv

Files (22.0 GB)

Additional details

Funding

Software

References