Published May 7, 2025 | Version v1
Dataset Open

In silico Gene Perturbation Results from GeneRAIN Models

Creators

  • 1. BGI
  • 2. ROR icon UNSW Sydney
  • 3. ROR icon University of Sydney

Description

This dataset contains the processed results of in silico gene perturbation experiments conducted using various GeneRAIN models. GeneRAIN is a suite of Transformer-based models developed for learning gene expression relationships from large-scale bulk RNA-seq data. These experiments were designed to evaluate the ability of different GeneRAIN model architectures and normalization strategies to simulate transcriptomic responses to genetic perturbations.

The dataset comprises six gzipped CSV files, each representing the results from a specific GeneRAIN model and normalization method combination:

  • GeneRAIN.GPT_Binning_by_genes.perturb_gene_level_details.csv.gz: Results from the GeneRAIN GPT model using the Binning-By-Gene normalization method.
  • GeneRAIN.GPT_Z-Score.perturb_gene_level_details.csv.gz: Results from the GeneRAIN GPT model using the Z-Score normalization method.
  • GeneRAIN.Pred_expr_Binning_by_genes.perturb_gene_level_details.csv.gz: Results from the GeneRAIN BERT-Pred-Expr model using the Binning-By-Gene normalization method.
  • GeneRAIN.Pred_expr_Z-Score.perturb_gene_level_details.csv.gz: Results from the GeneRAIN BERT-Pred-Expr model using the Z-Score normalization method.
  • GeneRAIN.Pred_genes_Binning_by_genes.perturb_gene_level_details.csv.gz: Results from the GeneRAIN BERT-Pred-Genes model using the Binning-By-Gene normalization method.
  • GeneRAIN.Pred_genes_Z-Score.perturb_gene_level_details.csv.gz: Results from the GeneRAIN BERT-Pred-Genes model using the Z-Score normalization method.

Each CSV file contains one row for each gene within each processed sample used in the in silico perturbation analysis. The columns provide detailed information about the sample, the gene, its expression state, the applied perturbation, and the resulting gene embeddings from the model:

  • Batch_Index: The index of the batch the sample belonged to.
  • Sample_Index_in_Batch: The index of the sample within its batch.
  • Dataset_Label: The label of the dataset partition (e.g., 'K562_essential').
  • Gene_Pos_In_Input: The position of the gene in the input sequence fed to the model (0-based index), typically based on expression ranking.
  • Gene_ID_Index: The numerical index representing the specific gene in the gene embedding space.
  • Gene_Symbol: The gene symbol corresponding to the Gene_ID_Index.
  • Input_Binned_Expr: The binned expression value of this gene in the baseline input fed to the model (relevant for binning-based models).
  • Output_Binned_Expr_True: The true binned expression value of this gene after perturbation, as provided by the input dataset (not predicted by the model).
  • Perturbed_Gene_ID: The Gene_ID_Index of the gene whose expression was artificially altered in the in silico perturbation for this specific sample. This value is the same for all rows corresponding to the same sample.
  • Is_Perturbed_Input_Gene: A boolean (True/False) indicating if this specific gene (Gene_ID_Index in this row) is the one that was perturbed in silico for this sample.
  • Gene_Emb_Baseline: A comma-separated string representing the embedding vector of this gene derived from the baseline (unperturbed) input.
  • Gene_Emb_Perturbed: A comma-separated string representing the embedding vector of this gene derived from the perturbed input.
  • Gene_ID_Perturbed_Input (Optional): If the model is 'GPT' or 'Bert_pred_tokens', this column shows the Gene_ID_Index present at this Gene_Pos_In_Input in the perturbed input sequence (which might differ from the baseline input sequence).

These data can be used to analyze and compare the effects of in silico gene perturbations on gene representations across different GeneRAIN model configurations and normalization methods, supporting research into how these models capture and simulate biological responses.

Notes on usage:

  • The embedding vectors (Gene_Emb_Baseline and Gene_Emb_Perturbed) are stored as comma-separated strings and need to be converted to numerical arrays for analysis.
  • Github repo: https://github.com/suzheng/GeneRAIN

Files