This ZENG_2022__DATA_README.txt file was generated on 2022-08-15 by Zexian Zeng GENERAL INFORMATION 1. Title of Dataset: Data from: Machine learning on syngeneic mouse tumor profiles to model clinical immunotherapy response. 2. Author Information Corresponding Investigator X. Shirley Liu, PhD Professor of Biostatistics and Computational Biology Department of Data Science Center for Functional Cancer Epigenetics Dana-Farber Cancer Institute and Harvard University Email: xsliu@ds.dfci.harvard.edu 3. Date of data collection: 2021 4. Geographic location of data collection: Boston, USA 5. Funding sources that supported the collection of the data: NIH 6. Recommended citation for this dataset: Zeng, Zexian (2022), Machine learning on syngeneic mouse tumor profiles to model clinical immunotherapy response, Dryad, Dataset, https://doi.org/10.5061/dryad.b8gtht7g1 DATA & FILE OVERVIEW 1. Description of dataset We queried datasets deposited in the Gene Expression Omnibus (GEO) that matched a manually curated list of syngeneic mouse models or syngeneic cancer cell lines (Table S10). For studies involving immune checkpoint blockade (ICB) treatment of anti-PD1, anti-PDL1, anti-PDL2, and anti-CTLA4, we manually annotated the experimental characteristics of each sample. The response status for each sample was curated from the original published studies. For the samples that do not have response information annotated, we labeled samples’ response statuses based on their diameter (size) change after treatment. Following a consensus , we used a 30% reduction as the cutoff to call a sample’s response status. To keep data consistent between human and mouse and different datasets, we dichotomized the responses to a binary label. In total, we collected 761 syngeneic tumor RNA-seq samples from 26 published studies. To ensure consistency, raw sequencing reads were downloaded from each study and processed through a standardized pipeline called RNA-seq IMmune Analysis Pipeline (RIMA) (https://liulab-dfci.github.io/RIMA/). RIMA is an automated Snakemake pipeline developed by our group to streamline RNA-seq data processing, including but not limited to read alignment, quality control, expression quantification, and batch effect removal. Read alignments were performed with STAR (v.2.4.2a) on FASTQ files against the mm10 reference genome assembly (mm10, Genome Reference Consortium Mouse Build 38) from the NCI Genomic Data Commons (GDC). RNA-seq quality control (QC) was performed on the aligned BAM files using RSeQC (v2.4). With the reads appropriately aligned, expression levels were quantified by SALMON (v.0.14.0) on the BAM files. We normalized and batch controlled the Transcripts Per Million (TPM) data by quantile normalization and ComBat within each syngeneic mouse model. 2. File List: File 1 Name: preICB_exprn.csv File 1 Description: expression profiles for the syngeneic mouse models that are not treated by ICB. File 2 Name: postICB_exprn.csv File 2 Description: expression profiles for the syngeneic mouse models that are treated by ICB. File 3 Name: preICB_exprn_external.csv File 3 Description: Additional samples curatd for external validation. This file contains expression profiles for the syngeneic mouse models that are not treated by ICB. File 4 Name: postICB_exprn_external.csv File 4 Description: Additional samples curatd for external validation. This file contains expression profiles for the syngeneic mouse models that are treated by ICB. File 5 Name: preICB_phenotype.csv File 5 Description: experimental data for the syngeneic mouse models that are not treated by ICB. File 6 Name: postICB_phenotype.csv File 6 Description: experimental data for the syngeneic mouse models that are treated by ICB. File 7 Name: preICB_phenotype_external.csv File 7 Description: Additional samples curatd for external validation. This file contains experimental data for the syngeneic mouse models that are not treated by ICB. File 8 Name: postICB_phenotype_external.csv File 8 Description: Additional samples curatd for external validation. This file contains experimental data for the syngeneic mouse models that are treated by ICB. File 9 Name: preICB_response.csv File 9 Description: response labels for the syngeneic mouse models that are not treated by ICB. File 10 Name: postICB_response.csv File 10 Description: response labels for the syngeneic mouse models that are treated by ICB. File 11 Name: preICB_response_external.csv File 11 Description: Additional samples curatd for external validation. This file contains response label for the syngeneic mouse models that are not treated by ICB. File 12 Name: postICB_response_external.csv File 12 Description: Additional samples curatd for external validation. This file contains response label data for the syngeneic mouse models that are treated by ICB. METHODOLOGICAL INFORMATION We trained the joint dimension reduction framework separately using the post-ICB and control syngeneic model data. For both training processes, we split the data into training, validation, and testing sets in the ratio of 6:2:2. The training data was used to optimize the matrix reconstruction errors, and the validation set was used to tune the hyperparameters, including penalty terms for the phenotype and response matrix reconstruction errors and the number of latent factors k (Materials and Methods). In brief, the framework takes the gene expression matrix, mouse experimental variables, and ICB response as inputs and seeks lower-dimensional representations through iterative updates. The model quickly converged monotonically on the training data for treatment and control tumors (Supplementary Fig. S1A and S1C). We repeated the experiment ten times with random initial seeds. The penalty terms for the phenotype matrix and response matrix govern their relative importance in matrix reconstruction and are tuned in the validation set based on prediction accuracy. We noted that this balance between optimizing reconstruction error and prediction accuracy could prevent model overfitting (Supplementary Fig. S1B and S1D) (Materials and Methods). Of the ten replicates, the model with the best performance in the testing dataset was used for downstream analyses (Table S2, Supplementary Fig. S1B and S1D).