Published February 13, 2025 | Version 1.0.1
Dataset Open

NeuroScape

Description

NeuroScape: A Curated Dataset of Neuroscientific Articles from 1999 to 2023

Description

This dataset comprises a collection of neuroscientific articles published between January 1, 1999, and December 31, 2023. The compilation includes information on articles and research domain clusters in multiple formats, including CSV, GraphML, and HDF5.

Scope and Selection Criteria

  • Source Journals: The articles in this dataset were selectively retrieved from journals ranked in the first and second quartile (Q1 and Q2) in the field of neuroscience according to the SCImago Journal Rank. Additionally, articles from Q1 multidisciplinary journals such as Nature, Science, and PLOS One were included.
  • Search Methodology: PubMed searches were conducted for each year using the journal name and publication year as query terms. All articles returned from these searches were initially included.
  • Discipline Classification: A neural network classifier was employed to filter articles specifically related to neuroscience. Articles that did not meet the classifier's threshold were excluded.
  • Non-Exhaustiveness: This dataset does not encompass all neuroscientific articles published in the given period. Articles without abstracts or key metadata were omitted, and classification errors may have led to the exclusion of some relevant publications.

Changelog

Version 1.0.1  (Latest)

  • Fixed incorrect cluster citation graph: The previous version had an incorrect cluster_citation_density.graphml file. This has now been corrected.

Directory Structure

.
├── Code
│   ├── notebooks
│   │   ├── keyword_search.ipynb │   │   ├── exploring_clusters.ipynb │   │   ├── loading_article_shards.ipynb │   │   ├── traversing_article_graph.ipynb
│   │   ├── discipline_classification.ipynb
│   │   └── from_generic_to_domain_embedding.ipynb │   ├── requirements.txt │   └── src │   ├── data_types.py │   └── utils.py └── Data    ├── CSV    │   ├── neuroscience_articles_1999-2023.csv    │   ├── neuroscience_clusters_1999-2023.csv    │   └── neuroscience_dimensions_1999-2023.csv    ├── Graphs    │   ├── cluster_citation_density.graphml    │   ├── article_similarity.graphml    ├── HDF5    │   ├── DomainEmbeddings    │   │   └── 2037 shard_#SHARD_ID.h5 files containing 200 articles    │   └── VoyageAIEmbeddings    │      ├── Large_02_Instruct
   │       │   └── 2037 shard_#SHARD_ID.h5 files containing 200 articles
  │       └── Lite_02_Instruct
   │           └── 2037 shard_#SHARD_ID.h5 files containing 200 articles     └── Models    ├── discipline_classification_model.pth    └── domain_embedding_model.pth

Code

The Code folder contains minimal example code to help users get started with the dataset. It includes:

  • Jupyter Notebooks demonstrating how to work with thet data with minimal usage examples.
  • Python Scripts with basic utilities for handling the dataset.

These examples provide a simple foundation for working with the dataset. More advanced analysis and demonstrations are covered in the accompanying publication.

CSV Files

Neuroscience Articles (neuroscience_articles_1999-2023.csv)

This file contains metadata on neuroscientific articles from 1999 to 2023.

Variables:

  • Pmid: PubMed ID (unique identifier).
  • Doi: Digital Object Identifier.
  • Type: Article type (Review or Research).
  • Title: Article title.
  • Year: Year of publication.
  • Month: Month of publication.
  • Age: Age of the article as of January 3, 2025.
  • Citations: Total number of citations.
  • Citation Rate: Citations divided by article age.
  • Cluster ID: The research cluster the article belongs to (neuroscience_clusters_1999-2023.csv).
  • Journal: The journal where the article was published.
  • Disciplines: Disciplines published by the journal as classified by SCImago.The article does NOT necessarily qualify for all listed disciplines.
  • Abstract: The abstract of the article.

Neuroscience Clusters (neuroscience_clusters_1999-2023.csv)

Clusters of related articles based on research themes.

Variables:

  • Cluster ID: Unique identifier for the cluster.
  • Title: Title of the research cluster.
  • Size: Number of articles in the cluster.
  • Year First Article: Year of the earliest article in the cluster.
  • MCR Research: Median citation rate for research articles.
  • MCR Review: Median citation rate for review articles.
  • Reference Krackhardt: Measure of internal vs. external references.
  • Citation Krackhardt: Measure of internal vs. external citations.
  • Most Cited Cluster: Cluster most frequently cited by articles in this cluster.
  • Most Citing Cluster: Cluster that cites this cluster the most.
  • Keywords: Keywords describing the cluster.
  • Description: A summary of the research in the cluster.
  • Focus: Whether the cluster is focused on content or methodology.
  • Most Similar Cluster: Cluster most semantically similar to this one.
  • Similarity: Cosine similarity score with the most similar cluster.
  • Distinguishing Features: Key features distinguishing the cluster from its similar cluster.
  • Open Questions: Outstanding research questions within the cluster.
  • Dimensions: Evaluation of dimensions including appliedness, modality, spatiotemporal scale, cognitive complexity, species focus, theoretical engagement, theorey scope, methodological approach, and interdisciplinarity.
  • Trends: Emerging or declining trends between Jan 2021 and December 2023.

Neuroscience Dimensions (neuroscience_dimensions_1999-2023.csv)

Provides various research dimensions assessed for each cluster. Each dimension comes with specific binarized categories.

Key Variables:

  • Appliedness: Fundamental, translational, or clinical focus.
  • Modality: Auditory, visual, olfactory, gustatory, somatosensory.
  • Spatiotemporal Scale: Focus on molecular, cellular, system-level neuroscience.
  • Cognitive Complexity: Simple vs. complex cognitive processes.
  • Species: Human, non-human primate, rodent, etc.
  • Theory Engagement: Data-driven vs hypothesis-driven research.
  • Theory Scope: Scope of theoretical frameworks utilized by the cluster.
  • Methodological Approach: Experimental, observational, computational, meta-analytic.
  • Interdisciplinarity: Low to very high.

HDF5 Files

The HDF5 directory contains two sets of embeddings for the abstracts of articles. All folders contain 2037 HDF5 shard files, each holding about 200 articles (using a custom defined article filetype).

Article Datatypes:

  • pmid, doi, title, type, journal, year, age, citationcount, citationrate, abstract: Corresponds directly with the CSV data.
  • embedding: Text embedding of the article's abstract. There are two versions.
  • out_links: List of PubMed IDs for articles in the dataset that are cited by this article (references).
  • in_links: List of PubMed IDs for articles in the dataset that cite this article (citations).

Please note that abstracts of articles in the subfolders of HDF5/VoyageAIEmbeddings have been embedded using Voyage AI's voyage-lite-02-instruct and voyage-large-02-instruct models, respectively. Those in the folder HDF5/DomainEmbeddings are voyage-large-02-instructembeddings that have subsequently been further transformed into a domain-specific lower dimensional embedding using a custom neural network (domain_embedding_model.pth).

Graph-Based Data

Article Similarity Graph (article_similarity.graphml)

A graph representation of article similarity based on cosine similarity between abstract embeddings (using domain-specific embedding reuslting from domain_embedding_model.pth).

  • Vertices: Each article is a node with pmid (PubMed ID) as an attribute.
  • Edges: The top 50 nearest neighbor articles (by cosine similarity) form edges.
  • Edge Weight: The cosine similarity score between the two articles.

Citation Density Graph (cluster_citation_density.graphml)

Represents citation relationships between research clusters.

  • Vertices: Each cluster is a node, identified by its Cluster ID.
  • Edges: Each cluster is connected to:
    • The Most Citing Cluster: The cluster that cites articles in this cluster more than any other.
    • The Most Cited Cluster: The cluster that this cluster cites more than any other.
  • Edge Weight: The citation density, defined as:
    • The fraction of actual citations between two clusters relative to the maximum possible citations between them.
    • This calculation takes relative article ages into account to normalize citation activity over time.

Model Files

The dataset includes two pre-trained neural network models for classification and embedding transformation.

Discipline Classification Model (discipline_classification_model.pth)

  • Type: PyTorch model (.pth format)
  • Purpose: Classifies articles into scientific disciplines.

Domain Embedding Model (domain_embedding_model.pth)

  • Type: PyTorch model (.pth format)
  • Purpose: Transforms high-dimensional text embeddings into domain-specific lower-dimensional representations.

Files

NeuroScape_v101.zip

Files (5.5 GB)

Name Size Download all
md5:d6273d39510defdc8fc13ee31a8d9a3c
5.5 GB Preview Download