BIP! NDR (NoDoiRefs): a dataset of citations from papers without DOIs in computer science conferences and workshops
- 1. Univ. of the Peloponnese & ATHENA RC
- 2. ATHENA RC
- 3. Univ. of the Peloponnese
Description
Overview
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data.
BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text.
The current version of the dataset contains ~4.3M citations made by approximately 211K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. The DBLP snapshot used for this version was the one released on September 2025.
Dataset files
1. Core Non-DOI Citation Dataset - bip_ndr_{version}.tar.gz
The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming.
Each JSON object has three main fields:
-
“_id”: a unique identifier,
-
“citing_paper”, the “dblp_id” of the citing paper,
-
“cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields:
-
“dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present.
-
“doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present.
-
“bibliographic_reference”: the raw citation string as it appears in the citing paper.
-
Changes from previous version:
- Added more papers from DBLP.
2. Citation Intents Dataset - bip_ndr_ci_{version}.tar.gz
This file enriches the BIP! NDR dataset with citation-level intent classification.
It preserves the same base structure of the previous file, while adding a nested array of "citations" with each element of "cited_papers".
Each "citation" provides the local textual context, section, and intent of the citation in the following format:
- "citation_id": Unique identifier in the format {citing_id}>{cited_id}_CIT{index} linking the citing and cited entities.
- "section": The section of the citing paper where the citation occurs (e.g., Introduction, Methods, Results).
- "intent": Inferred purpose of the citation based on textual context (see classification schema below).
The "intent" field follows the SciCite classification schema, which categorizes citations into three high-level functional types:
- background information: The citation states, mentions, or points to the background information giving more context about a problem, concept, approach, topic, or importance of the problem in the field.
- method: Making use of a method, tool, approach or dataset.
- results comparison: Comparison of the paper's results/findings with the results/findings of other work.
The classification is done with the Qwen2.5-14B-CIC-SciCite fine-tuned Large Language Model, published by Athena RC.
Changes from previous version:
- Added more papers with intent
Files
Files
(402.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:27b1ff005caf1fcf8fa0eebd31e99ad4
|
382.5 MB | Download |
|
md5:9864ab826c44f091ec44c3db4063867a
|
19.8 MB | Download |
Additional details
Related works
- References
- Conference paper: 10.1007/978-3-032-05409-8_13 (DOI)