Computer Code for iGOF-Perturb-seq

Xu, Zhengzheng; Kong, Xiangrui

doi:10.5281/zenodo.17622140

Published November 16, 2025 | Version v1

Computational notebook Open

Computer Code for iGOF-Perturb-seq

These notebooks contain code for reproducing the single-cell analyses from:

Liansheng Zhang, Qi Ma, Xiangrui Kong, Weijuan Zou, Bo Wang, Bin Wu, Shicheng Cai, Tao Bai, Runlin Tan, Ziji Dai, Xingyu Liu, Zhiheng Jia, Meimei Zhang, Tianwen Li, Yuanyi Zheng, Xinde Hu, Jianrong Wu, Zhengzheng Xu, and Haibo Zhou, "Mapping transcription factor functions in astrocytes using in vivo gain-of-function Perturb-seq", Science, 2025

The computational code for the in vivo gain-of-function Perturb-seq platform (iGOF-Perturb-seq) is based on the published paper. For detailed theoretical background, we provide relevant references throughout each analysis section, but we strongly recommend consulting the Methods section of the original publication.

Key Information Provided:

How to Define Cells with MOI=1

This component is crucial for our platform. To implement this process, please refer to the code in the 1MOI_cell_calling directory. Here, we demonstrate the computational workflow using one representative reaction as an example.

Barcode Matrix Acquisition

First, use get_barcodes.py to process the cell barcode file (possorted_genome_bam.bam) generated by Cell Ranger alignment, producing the initial matrix.
For additional details, refer to: https://github.com/shendurelab/single-cell-ko-screens
Then, apply get_tf10.barcode.reduce3.r to further filter cells, retaining only those with barcodes that exactly match the cell barcodes in the snRNA-seq data for subsequent analysis.

Data Consolidation with getline.r

This script organizes TF assignments for each cell, ensuring each cell is represented by a single data entry.

Basic Usage:

bash

Rscript getline.r -i tf10.barcode.reduce3.txt -o tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine2

# Or using long option names
Rscript getline.r --input tf10.barcode.reduce3.txt --output tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine2

# View help
Rscript getline.r --help

MOI Calculation with pipeline.all.sh

This script performs MOI calculation. The example below demonstrates the process using the counts matrix from reaction 10, where the cells have been filtered to match those in the single-nucleus RNA-seq data.

bash

sh pipeline.all.sh 2 1.5 tf10.mouseTF.barcode.final.filter2snRNA-seq.reduce2CellBarcodeLine

Help Information:

bash

(base) -bash-4.2$ sh pipeline.all.sh --help
Usage: pipeline.all.sh <threshold1> <threshold2> <input_file>
Example: pipeline.all.sh 2 1.5 Seq6-30bp_2.output.v1.change.cell_summary_baseR.txt

Key References:

PMID: 29457792 PMCID: PMC5882576 DOI: 10.1038/nmeth.4604
PMID: 36608654 PMCID: PMC10344468 DOI: 10.1016/j.cell.2022.11.026

After defining MOI=1 cells, all subsequent analyses are built upon this foundation.

Application 1: How to predict the impact of TF perturbation on normal astrocyte functions

To predict TFs that could potentially regulate normal astrocyte functions, we first identified marker genes associated with each function: for synaptogenesis - Gpc4, Gpc6, Thbs1, Thbs2, Sparc, and Sparcl1; for glutamate uptake - Grin2b, Gria1, Slc1a3, and Slc1a2; for maintaining the BBB - Aqp4, Slc4a4, and Gja1; and for phagocytosis - Megf10, Mertk, Gas6, and Axl. We then analyzed which TF perturbations significantly altered the expression of these marker genes compared to the control (zsGreen) group. The specific implementation process can be found in the 2024-7-19.update_astrocyteGeneFunction.sh script within the "Candidates_for_Functional_Verification" directory.

Application 2: How to perform similarity analysis of TF perturbation DEGs and corresponding pathway enrichment

This primarily involves DEG identification, pathway enrichment annotation, similarity correlation analysis of pathways, and UMAP visualization of highly similar pathway modules. For more details, refer to the readme.txt file in the "DEGs_calling_pathway_correlation" directory.

Application 3: How to perform integrative analysis of TF perturbation and disease correlation

The integrative analysis of iGOF-perturb-seq DEGs and enriched pathways for 955 TFs and their correlation with public data was conducted as follows:

DEGs and enriched pathways were extracted from public datasets comparing disease and control groups in astrocytes.
DEGs and enriched pathways were extracted for each TF after iGOF-perturb-seq in astrocytes.
Results from steps 1 and 2 were overlapped to obtain the similarity between each TF perturbation and the public data, followed by visualization using ggraph.

This directory only illustrates the visualization workflow for assessing the similarity between TF iGOF-perturb-seq perturbations and mental illness. For detailed procedures of steps 1 and 2, please refer to the methods described in the "DEGs_calling_pathway_correlation" directory. More specific visualization code can be found in the "Disease_correlation" directory. The correlation analysis process for other neurodegenerative diseases is similar and thus omitted here.

Application 4: How to identify gene programs of TF perturbations

The analysis of gene expression programs followed the consensus non-negative matrix factorization (cNMF) algorithm, available on GitHub (https://github.com/dylkot/cNMF). We strongly recommend reading the original paper for detailed algorithmic specifics. Here, we only demonstrate the similarity analysis process after obtaining the gene score matrix and weight matrix of gene expression programs related to TF perturbations. For more details, see 2024-12-27.cnmf.r and module_umapPlot.sh in the "gene_program_calling" directory.

Reference: PMID: 31282856 PMCID: PMC6639075 DOI: 10.7554/eLife.43803

Application 5: How to perform meta and heterogeneity analysis of TF perturbations

We performed a comprehensive meta-analysis to assess the heterogeneity of transcription factor (TF) co-regulation patterns across multiple experimental conditions. Meta-analyses were conducted using the metafor package (v4.8-0) in R. Heterogeneity was quantified using I² statistics, with genes classified as low (I² < 40), moderate (40 ≤ I² < 75), or high (I² ≥ 75) heterogeneity. The 2025-8-15-meta_heterogeneity.co-functionTF.sh script in the "meta_heterogeneity_analysis" directory provides reproducible code.

Application 6: How to calculate co-hub genes for module TFs

The analysis of co-hub genes for module TFs primarily starts with DEGs as the starting point. The specific identification of DEGs is located in the "DEGs_calling_pathway_correlation" directory. The following R scripts can be run sequentially to identify co-hub gene sets for the module TFs: Rscript co_function_tf_pathway_analysis.R and Rscript identify_co_hub_genes.v3.R in the "DEGs_calling_pathway_correlation" directory provide reproducible code.

Application 7: How to calculate motif similarity for module co-function TFs

For this analysis, we first downloaded the MEME file from the HOCOMOCO13 database and then used the Tomtom software to calculate the q-value for comparing TFs. The moduleTFs_motif_similarity_analysis.sh script in the "moduleTFs_motif_similarity_analysis" directory provides reproducible code.

References: PMID: 29140464 PMCID: PMC5753240 DOI: 10.1093/nar/gkx1106; PMID: 41124023 PMCID: PMC12619998 DOI: 10.1093/bioinformatics/btaf577

Application 8: How to calculate patterns of TF perturbation across different brain regions

Note that this analysis indirectly distinguishes brain regions relying only on published markers of astrocyte subtypes across different brain regions. Future studies could directly isolate specific brain regions for investigation. The 2025-6-27.mouse_tf_position.analysis.sh script in the "Perturbation_of_TF_in_different_brain_regions" directory provides reproducible code for brain region analysis.

Reference: PMID: 36378959 PMCID: PMC9873482 DOI: 10.1126/science.adc9020

Application 9: How to calculate expression patterns of candidate TFs across different brain regions (integration of public data)

This analysis explores the specific expression distribution patterns of candidate TFs (such as Ferd3l) across different mouse brain regions. The readme.txt file in the "selected_TFs_expression_across_brain_regions" directory demonstrates the use of reproducible code.

Reference: PMID: 38092916 PMCID: PMC10719114 DOI: 10.1038/s41586-023-06812-z

Application 10: How to analyze perturbation similarity of TFs on specific pathways

This analysis is consistent with the process we used in our previous project; for additional details please refer to https://github.com/xdzperfect/KO-of-MDD-Risk-Genes-in-vivo. The eachTF_perturbation_correlation.given_pathway.analysis.r script in the "Specific_pathway_correlation_analysis" directory provides reproducible code.

Application 11: How to analyze subTFs iGOF-Perturb-seq

This analysis follows the workflow of the pools screening iGOF-Perturb-seq analysis process. For more specific details, please refer back to the previous analysis directories and the Methods section of the paper. The readme.txt file in the "subTFs_Perturb-seq_under_LPS" directory demonstrates the use of reproducible code.

Application 12: How to analyze TF dosage

The principle behind this dosage analysis is that in cells with 1 MOI, if more UMIs are detected for a certain TF, it is interpreted as a higher dosage of that TF. Therefore, this dosage analysis relies on mouse TF NGS deep-seq data. The 2025-8-16.dosage_analysis.home_method.sh script in the "TF_dosages_analysis" directory demonstrates the use of reproducible code.

Application 13: How to identify target genes of TFs

To determine what proportion of differentially expressed genes resulting from a TF perturbation are direct target genes versus indirect targets, we primarily integrate TF perturb-seq DEGs with CUT&Tag peak-enriched genes to define direct and indirect target genes. For more details, refer to the TFs_target_genes_analysis.sh script in the "TFs_target_gene_analysis" directory.

Application 14: How to perform WGCNA analysis of TF perturbations

Our WGCNA analysis utilized the hdWGCNA algorithm. The specific workflow and theoretical details are described in the protocol available at: https://smorabit.github.io/hdWGCNA/. Alternatively, you can consult the computational code from our other project here: https://github.com/xdzperfect/KO-of-MDD-Risk-Genes-in-vivo. It should be noted that here we performed hdWGCNA analysis specifically on 1 MOI astrocytes, with the perturbed gene set consisting of 955 TFs.

Reference: PMID: 37426759 PMCID: PMC10326379 DOI: 10.1016/j.crmeth.2023.100498

Application 15: How to analyze array data of TFs

For this analysis, we selected specific TFs for downstream array experiments, comparing and analyzing the similarity between overexpression of specific TFs and pools screening, as well as perturbation patterns under different disease states. Here we only demonstrate some computational processes not available in existing software packages. Cell-cell communication (CCC) analysis was conducted using CellChat (v2.2.0) with default parameters (details omitted).

References: PMID: 31348891 PMCID: PMC6662628 DOI: 10.1016/j.cell.2019.06.029; PMID: 33597522 PMCID: PMC7889871 DOI: 10.1038/s41467-021-21246-9

Special Notes:

This project did not develop new algorithms but primarily applied published methods. For more detailed research, please refer to the repositories of the published methods:

snRNA-seq analysis is based on Hao, et al. 2023 (https://satijalab.org/seurat/)
Count tables for gRNA expression libraries are generated as shown in Hill et al 2018 (https://github.com/shendurelab/single-cell-ko-screens.git)
MOI calling and DEGs analysis are based on Santinha, A.J, et al. 2023 (https://github.com/plattlab/AAV-Perturb-seq) and Julia Joung, et al. 2023 (fengzhanglab/Joung_TFAtlas_Manuscript: v1.0 (zenodo.org)
hdWGCNA analysis is based on Morabito et al. 2023 (https://smorabit.github.io/hdWGCNA/)

The analysis process for other array data is consistent with the pools data analysis process and is not provided here. Please contact us if needed.

Data Availability:

Raw data and intermediate Seurat objects have been uploaded to the NGDC database. You can search for project PRJCA028495 (https://ngdc.cncb.ac.cn/search/all?&q=PRJCA028495) from the NGDC database to obtain download links.

Files

1MOI_cell_calling.zip

Files (1.2 GB)

Name	Size
1MOI_cell_calling.zip md5:9e3fbfb2e5bdcd3c5e0f176d86d6f59c	16.2 MB	Preview Download
array_data_analysis.zip md5:84830fc3ad099988e89739c26cf40c6e	43.0 MB	Preview Download
Candidates_for_Functional_Verification.zip md5:e684cb32254a1945bf836931f1016720	1.8 kB	Preview Download
DEGs_calling_pathway_correlation.zip md5:880a3b4b34f7ea14fd1651065beb1c52	109.7 MB	Preview Download
Disease_correlation.zip md5:1cf7c7854a31217f35264954d1284834	793.5 kB	Preview Download
gene_program_calling.zip md5:9ac05f3fcd5ce5d11bda916cba2178b0	799.8 MB	Preview Download
meta_heterogeneity_analysis.zip md5:9a7086205fab3505787963a107ba990d	11.9 kB	Preview Download
moduleTFs_cohubGene_identification.zip md5:56a94a7ed27dbfc4801a014ff2db98ab	11.9 MB	Preview Download
moduleTFs_motif_similarity_analysis.zip md5:27531fee2232844e6b1f738b9e52e634	1.1 kB	Preview Download
Perturbation_of_TF_in_different_brain_regions.zip md5:b3dc9a55ad6393502897ea786aa641ba	137.3 MB	Preview Download
selected_TFs_expression_across_brain_regions.zip md5:93bea4652a31c29313f2a4814afc52c2	46.0 MB	Preview Download
Specific_pathway_correlation_analysis.zip md5:4ea19c5be58746101109389d37a901d6	32.2 MB	Preview Download
subTFs_Perturb-seq_under_LPS.zip md5:ca0b6b3360b8e332bba1964de17ee9f2	1.1 MB	Preview Download
TF_dosages_analysis.zip md5:cf2712bd15bb2cac04563beab74f6b6b	3.7 kB	Preview Download
TFs_target_gene_analysis.zip md5:402a129b206e74febbf55e9ec57f64e6	1.8 MB	Preview Download
WGCNA_analysis.zip md5:20d54a1344d923521d67fab7864d1ab0	453 Bytes	Preview Download

	All versions	This version
Views	148	145
Downloads	410	410
Data volume	41.6 GB	41.6 GB

Computer Code for iGOF-Perturb-seq

Authors/Creators

Description

Key Information Provided:

How to Define Cells with MOI=1

Barcode Matrix Acquisition

Data Consolidation with getline.r

MOI Calculation with pipeline.all.sh

Key References:

Files

1MOI_cell_calling.zip

Files (1.2 GB)