Published February 27, 2023 | Version v1
Dataset Open

BEELINE

Description

This collection consists of over 400 single-cell gene expression datasets across four curated and six synthetic gene regulatory networks. It was created to benchmarking algorithms for gene regulatory network inference in Pratapa et al. (2020).

 

Task: The collection can be used to study causal inference algorithms.

 

Summary: 

  • Size of collection: >400 datasets on 6 - 19 features of different size
  • Task: Causal Inference Problem
  • Data Type: Mixed Data
  • Dataset Scope: Collection of Datasets
  • Ground Truth: Known Graph
  • Temporal Structure: Static Data
  • License: CC BY-NC 4.0 (see 10.5281/zenodo.3701939)
  • Missing Values: No Missing Values

 

Missingness Statement: There are no missing values.

 

Collection: (for a detailed description see Peng et al. (2024), for simulation details see Pratapa et al. (2020))

  • Curated: There are experiments on four curated gene regulatory networks: mCAD (Mammalian Cortical Area Development, 14 edges and 5 nodes), VSC (Ventral Spinal Cord Development, 15 edges and 8 nodes), HSC (Hematopoietic Stem Cell Differentiation, 30 edges and 11 nodes), and GSD (Gonadal Sex Determination, 79 edges and 18 nodes).
  • Synthetic: There are experiments six synthetic gene regulatory networks: dyn-BF (Bifurcating, 12 edges and 5 nodes), dyn-BFC (Bifurcating Converging, 18 edges and 9 nodes), dyn-CY (Cycle, 6 edges and 5 nodes), dyn-LI (Linear, 8 edges and 7 nodes), dyn-LL (Linear Long, 19 edges and 18 nodes), and dyn-TF (Trifurcating, 20 edges and 7 nodes).

 

Files per Experiment:

  • GroundTruth.csv: This file represents the actual biological regulatory interactions between genes, typically derived from known databases, literature, or synthetic models. An edge weight of +1 represents activation, -1 represents inhibition.
  • refNetwork.csv: This file is a processed version of the ground truth network, keeping only the sign (+ or -) of interactions.
  • ExpressionData.csv: This file contains the RNAseq data, with genes as rows and cell IDs as columns.
  • PseudoTime.csv: This file contains the Pseudotime. It is a computationally inferred measure that orders single cells along a trajectory to represent their progression through a biological process, such as differentiation or development.

Files

beeline_datasets.zip

Files (63.1 MB)

Name Size Download all
md5:857cb335da79d09865cd7db2549fbda4
63.1 MB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.3701939 (DOI)
Is published in
Journal article: 10.1038/s41592-019-0690-6 (DOI)

References