Computer Code for iGOF-Perturb-seq
Authors/Creators
Description
These notebooks contain code for reproducing the single-cell analyses from:
Liansheng Zhang, Qi Ma, Xiangrui Kong, Weijuan Zou, Bo Wang, Bin Wu, Shicheng Cai, Tao Bai, Runlin Tan, Ziji Dai, Xingyu Liu, Zhiheng Jia, Meimei Zhang, Tianwen Li, Yuanyi Zheng, Xinde Hu, Jianrong Wu, Zhengzheng Xu, and Haibo Zhou, "Mapping transcription factor functions in astrocytes using in vivo gain-of-function Perturb-seq", Science, 2025
The computational code for the in vivo gain-of-function Perturb-seq platform (iGOF-Perturb-seq) is based on the published paper. For detailed theoretical background, we provide relevant references throughout each analysis section, but we strongly recommend consulting the Methods section of the original publication.
Key Information Provided:
How to Define Cells with MOI=1
This component is crucial for our platform. To implement this process, please refer to the code in the 1MOI_cell_calling directory. Here, we demonstrate the computational workflow using one representative reaction as an example.
Barcode Matrix Acquisition
-
First, use
get_barcodes.pyto process the cell barcode file (possorted_genome_bam.bam) generated by Cell Ranger alignment, producing the initial matrix. -
For additional details, refer to: https://github.com/shendurelab/single-cell-ko-screens
-
Then, apply
get_tf10.barcode.reduce3.rto further filter cells, retaining only those with barcodes that exactly match the cell barcodes in the snRNA-seq data for subsequent analysis.
Data Consolidation with getline.r
This script organizes TF assignments for each cell, ensuring each cell is represented by a single data entry.
Basic Usage:
Rscript getline.r -i tf10.barcode.reduce3.txt -o tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine2 # Or using long option names Rscript getline.r --input tf10.barcode.reduce3.txt --output tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine2 # View help Rscript getline.r --help
MOI Calculation with pipeline.all.sh
This script performs MOI calculation. The example below demonstrates the process using the counts matrix from reaction 10, where the cells have been filtered to match those in the single-nucleus RNA-seq data.
sh pipeline.all.sh 2 1.5 tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine
Help Information:
(base) -bash-4.2$ sh pipeline.all.sh --help Usage: pipeline.all.sh <threshold1> <threshold2> <input_file> Example: pipeline.all.sh 2 1.5 Seq6-30bp_2.output.v1.change.cell_summary_baseR.txt
Key References:
-
PMID: 29457792 PMCID: PMC5882576 DOI: 10.1038/nmeth.4604
-
PMID: 36608654 PMCID: PMC10344468 DOI: 10.1016/j.cell.2022.11.026
After defining MOI=1 cells, all subsequent analyses are built upon this foundation.
Application 1: How to predict the impact of TF perturbation on normal astrocyte functions
To predict TFs that could potentially regulate normal astrocyte functions, we first identified marker genes associated with each function: for synaptogenesis - Gpc4, Gpc6, Thbs1, Thbs2, Sparc, and Sparcl1; for glutamate uptake - Grin2b, Gria1, Slc1a3, and Slc1a2; for maintaining the BBB - Aqp4, Slc4a4, and Gja1; and for phagocytosis - Megf10, Mertk, Gas6, and Axl. We then analyzed which TF perturbations significantly altered the expression of these marker genes compared to the control (zsGreen) group. The specific implementation process can be found in the 2024-7-19.update_astrocyteGeneFunction.sh script within the "Candidates_for_Functional_Verification" directory.
Application 2: How to perform similarity analysis of TF perturbation DEGs and corresponding pathway enrichment
This primarily involves DEG identification, pathway enrichment annotation, similarity correlation analysis of pathways, and UMAP visualization of highly similar pathway modules. For more details, refer to the readme.txt file in the "DEGs_calling_pathway_correlation" directory.
Application 3: How to perform integrative analysis of TF perturbation and disease correlation
The integrative analysis of iGOF-perturb-seq DEGs and enriched pathways for 955 TFs and their correlation with public data was conducted as follows:
-
DEGs and enriched pathways were extracted from public datasets comparing disease and control groups in astrocytes.
-
DEGs and enriched pathways were extracted for each TF after iGOF-perturb-seq in astrocytes.
-
Results from steps 1 and 2 were overlapped to obtain the similarity between each TF perturbation and the public data, followed by visualization using ggraph.
This directory only illustrates the visualization workflow for assessing the similarity between TF iGOF-perturb-seq perturbations and mental illness. For detailed procedures of steps 1 and 2, please refer to the methods described in the "DEGs_calling_pathway_correlation" directory. More specific visualization code can be found in the "Disease_correlation" directory. The correlation analysis process for other neurodegenerative diseases is similar and thus omitted here.
Application 4: How to identify gene programs of TF perturbations
The analysis of gene expression programs followed the consensus non-negative matrix factorization (cNMF) algorithm, available on GitHub (https://github.com/dylkot/cNMF). We strongly recommend reading the original paper for detailed algorithmic specifics. Here, we only demonstrate the similarity analysis process after obtaining the gene score matrix and weight matrix of gene expression programs related to TF perturbations. For more details, see 2024-12-27.cnmf.r and module_umapPlot.sh in the "gene_program_calling" directory.
Reference: PMID: 31282856 PMCID: PMC6639075 DOI: 10.7554/eLife.43803
Application 5: How to perform meta and heterogeneity analysis of TF perturbations
We performed a comprehensive meta-analysis to assess the heterogeneity of transcription factor (TF) co-regulation patterns across multiple experimental conditions. Meta-analyses were conducted using the metafor package (v4.8-0) in R. Heterogeneity was quantified using I² statistics, with genes classified as low (I² < 40), moderate (40 ≤ I² < 75), or high (I² ≥ 75) heterogeneity. The 2025-8-15-meta_heterogeneity.co-functionTF.sh script in the "meta_heterogeneity_analysis" directory provides reproducible code.
Application 6: How to calculate co-hub genes for module TFs
The analysis of co-hub genes for module TFs primarily starts with DEGs as the starting point. The specific identification of DEGs is located in the "DEGs_calling_pathway_correlation" directory. The following R scripts can be run sequentially to identify co-hub gene sets for the module TFs: Rscript co_function_tf_pathway_analysis.R and Rscript identify_co_hub_genes.v3.R in the "DEGs_calling_pathway_correlation" directory provide reproducible code.
Application 7: How to calculate motif similarity for module co-function TFs
For this analysis, we first downloaded the MEME file from the HOCOMOCO13 database and then used the Tomtom software to calculate the q-value for comparing TFs. The moduleTFs_motif_similarity_analysis.sh script in the "moduleTFs_motif_similarity_analysis" directory provides reproducible code.
References: PMID: 29140464 PMCID: PMC5753240 DOI: 10.1093/nar/gkx1106; PMID: 41124023 PMCID: PMC12619998 DOI: 10.1093/bioinformatics/btaf577
Application 8: How to calculate patterns of TF perturbation across different brain regions
Note that this analysis indirectly distinguishes brain regions relying only on published markers of astrocyte subtypes across different brain regions. Future studies could directly isolate specific brain regions for investigation. The 2025-6-27.mouse_tf_position.analysis.sh script in the "Perturbation_of_TF_in_different_brain_regions" directory provides reproducible code for brain region analysis.
Reference: PMID: 36378959 PMCID: PMC9873482 DOI: 10.1126/science.adc9020
Application 9: How to calculate expression patterns of candidate TFs across different brain regions (integration of public data)
This analysis explores the specific expression distribution patterns of candidate TFs (such as Ferd3l) across different mouse brain regions. The readme.txt file in the "selected_TFs_expression_across_brain_regions" directory demonstrates the use of reproducible code.
Reference: PMID: 38092916 PMCID: PMC10719114 DOI: 10.1038/s41586-023-06812-z
Application 10: How to analyze perturbation similarity of TFs on specific pathways
This analysis is consistent with the process we used in our previous project; for additional details please refer to https://github.com/xdzperfect/KO-of-MDD-Risk-Genes-in-vivo. The eachTF_perturbation_correlation.given_pathway.analysis.r script in the "Specific_pathway_correlation_analysis" directory provides reproducible code.
Application 11: How to analyze subTFs iGOF-Perturb-seq
This analysis follows the workflow of the pools screening iGOF-Perturb-seq analysis process. For more specific details, please refer back to the previous analysis directories and the Methods section of the paper. The readme.txt file in the "subTFs_Perturb-seq_under_LPS" directory demonstrates the use of reproducible code.
Application 12: How to analyze TF dosage
The principle behind this dosage analysis is that in cells with 1 MOI, if more UMIs are detected for a certain TF, it is interpreted as a higher dosage of that TF. Therefore, this dosage analysis relies on mouse TF NGS deep-seq data. The 2025-8-16.dosage_analysis.home_method.sh script in the "TF_dosages_analysis" directory demonstrates the use of reproducible code.
Application 13: How to identify target genes of TFs
To determine what proportion of differentially expressed genes resulting from a TF perturbation are direct target genes versus indirect targets, we primarily integrate TF perturb-seq DEGs with CUT&Tag peak-enriched genes to define direct and indirect target genes. For more details, refer to the TFs_target_genes_analysis.sh script in the "TFs_target_gene_analysis" directory.
Application 14: How to perform WGCNA analysis of TF perturbations
Our WGCNA analysis utilized the hdWGCNA algorithm. The specific workflow and theoretical details are described in the protocol available at: https://smorabit.github.io/hdWGCNA/. Alternatively, you can consult the computational code from our other project here: https://github.com/xdzperfect/KO-of-MDD-Risk-Genes-in-vivo. It should be noted that here we performed hdWGCNA analysis specifically on 1 MOI astrocytes, with the perturbed gene set consisting of 955 TFs.
Reference: PMID: 37426759 PMCID: PMC10326379 DOI: 10.1016/j.crmeth.2023.100498
Application 15: How to analyze array data of TFs
For this analysis, we selected specific TFs for downstream array experiments, comparing and analyzing the similarity between overexpression of specific TFs and pools screening, as well as perturbation patterns under different disease states. Here we only demonstrate some computational processes not available in existing software packages. Cell-cell communication (CCC) analysis was conducted using CellChat (v2.2.0) with default parameters (details omitted).
References: PMID: 31348891 PMCID: PMC6662628 DOI: 10.1016/j.cell.2019.06.029; PMID: 33597522 PMCID: PMC7889871 DOI: 10.1038/s41467-021-21246-9
Special Notes:
This project did not develop new algorithms but primarily applied published methods. For more detailed research, please refer to the repositories of the published methods:
-
snRNA-seq analysis is based on Hao, et al. 2023 (https://satijalab.org/seurat/)
-
Count tables for gRNA expression libraries are generated as shown in Hill et al 2018 (https://github.com/shendurelab/single-cell-ko-screens.git)
-
MOI calling and DEGs analysis are based on Santinha, A.J, et al. 2023 (https://github.com/plattlab/AAV-Perturb-seq) and Julia Joung, et al. 2023 (fengzhanglab/Joung_TFAtlas_Manuscript: v1.0 (zenodo.org)
-
hdWGCNA analysis is based on Morabito et al. 2023 (https://smorabit.github.io/hdWGCNA/)
The analysis process for other array data is consistent with the pools data analysis process and is not provided here. Please contact us if needed.
Data Availability:
Raw data and intermediate Seurat objects have been uploaded to the NGDC database. You can search for project PRJCA028495 (https://ngdc.cncb.ac.cn/search/all?&q=PRJCA028495) from the NGDC database to obtain download links.
Files
1MOI_cell_calling.zip
Files
(1.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:9e3fbfb2e5bdcd3c5e0f176d86d6f59c
|
16.2 MB | Preview Download |
|
md5:84830fc3ad099988e89739c26cf40c6e
|
43.0 MB | Preview Download |
|
md5:e684cb32254a1945bf836931f1016720
|
1.8 kB | Preview Download |
|
md5:880a3b4b34f7ea14fd1651065beb1c52
|
109.7 MB | Preview Download |
|
md5:1cf7c7854a31217f35264954d1284834
|
793.5 kB | Preview Download |
|
md5:9ac05f3fcd5ce5d11bda916cba2178b0
|
799.8 MB | Preview Download |
|
md5:9a7086205fab3505787963a107ba990d
|
11.9 kB | Preview Download |
|
md5:56a94a7ed27dbfc4801a014ff2db98ab
|
11.9 MB | Preview Download |
|
md5:27531fee2232844e6b1f738b9e52e634
|
1.1 kB | Preview Download |
|
md5:b3dc9a55ad6393502897ea786aa641ba
|
137.3 MB | Preview Download |
|
md5:93bea4652a31c29313f2a4814afc52c2
|
46.0 MB | Preview Download |
|
md5:4ea19c5be58746101109389d37a901d6
|
32.2 MB | Preview Download |
|
md5:ca0b6b3360b8e332bba1964de17ee9f2
|
1.1 MB | Preview Download |
|
md5:cf2712bd15bb2cac04563beab74f6b6b
|
3.7 kB | Preview Download |
|
md5:402a129b206e74febbf55e9ec57f64e6
|
1.8 MB | Preview Download |
|
md5:20d54a1344d923521d67fab7864d1ab0
|
453 Bytes | Preview Download |