There is a newer version of the record available.

Published October 31, 2025 | Version 1.10.2
Dataset Open

BIP! NDR (NoDoiRefs): a dataset of citations from papers without DOIs in computer science conferences and workshops

  • 1. Univ. of the Peloponnese & ATHENA RC
  • 2. ATHENA RC
  • 3. Univ. of the Peloponnese

Description

Overview

In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation  has created a void in available data.

BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text.

The current version of the dataset contains ~4.3M citations made by approximately 211K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. The DBLP snapshot used for this version was the one released on September 2025

Dataset files

1. Core Non-DOI Citation Dataset - bip_ndr_{version}.tar.gz

The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming. 

Each JSON object has three main fields:

  • “_id”: a unique identifier,

  • “citing_paper”, the “dblp_id” of the citing paper,

  • “cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields:

    • “dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present.

    • “doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present.

    • “bibliographic_reference”: the raw citation string as it appears in the citing paper.

Changes from previous version:

  • Added more papers from DBLP.

2. Citation Intents Dataset - bip_ndr_ci_{version}.tar.gz

This file enriches the BIP! NDR dataset with citation-level intent classification.
It preserves the same base structure of the previous file, while adding a nested array of "citations" with each element of "cited_papers".

Each "citation" provides the local textual context, section, and intent of the citation in the following format:

  • "citation_id": Unique identifier in the format {citing_id}>{cited_id}_CIT{index} linking the citing and cited entities.
  • "section": The section of the citing paper where the citation occurs (e.g., Introduction, Methods, Results).
  • "intent": Inferred purpose of the citation based on textual context (see classification schema below).

The "intent" field follows the SciCite classification schema, which categorizes citations into three high-level functional types:

  1. background information: The citation states, mentions, or points to the background information giving more context about a problem, concept, approach, topic, or importance of the problem in the field.
  2. method: Making use of a method, tool, approach or dataset.
  3. results comparison: Comparison of the paper's results/findings with the results/findings of other work.

The classification is done with the Qwen2.5-14B-CIC-SciCite fine-tuned Large Language Model, published by Athena RC

Changes from previous version: 

  • Added more papers with intent

Files

Files (402.3 MB)

Name Size Download all
md5:27b1ff005caf1fcf8fa0eebd31e99ad4
382.5 MB Download
md5:9864ab826c44f091ec44c3db4063867a
19.8 MB Download

Additional details

Related works

References
Conference paper: 10.1007/978-3-032-05409-8_13 (DOI)

Funding

European Commission
SciLake - Democratising and making sense out of heterogeneous scholarly content 101058573
European Commission
GraspOS - GraspOS: next Generation Research Assessment to Promote Open Science 101095129