BIP! DB: A Dataset of Impact Measures for Research Products

Vergoulis, Thanasis; Kanellos, Ilias; Atzori, Claudio; Mannocci, Andrea; Chatzopoulos, Serafeim; La Bruzzo, Sandro; Manola, Natalia; Manghi, Paolo

doi:10.5281/zenodo.19135560

Published March 20, 2026 | Version 20.1

Dataset Open

BIP! DB: A Dataset of Impact Measures for Research Products

1. IMSI, ATHENA RC
2. CNR

Overview

This dataset contains citation-based impact indicators (also referred as measures) for ~321M distinct persistent identifiers (PIDs) that correspond to various types of research products (publications, datasets, software, and other products).

The calculated indicators are organized into categories based on the aspect of impact they capture.

Influence indicators

Reflect the "total" impact of a research product; how established it is in general.

Citation Count: The total number of citations of the product, the most well-known influence indicator.
PageRank score: An influence indicator based on the PageRank (Page et al., 1999), a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).

Popularity indicators

Capture the "current" impact of a research product; how popular it currently is.

RAM score: A popularity indicator based on the RAM (Ghosh et al., 2011) method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).
AttRank score: A popularity indicator based on the AttRank (Kanellos et al., 2020) method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.

Impulse indicators

Measure the initial momentum that a research product received right after its publication.

Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.

FIeld-weighted indicators

Capture the impact of a research product relative to the average performance in its field, accounting for differences in citation practices across disciplines.

Field-Weighted Citation Impact (FWCI): A field-weighted indicator that measures how a research product performs compared to the global average in its research field. An FWCI of 1.0 indicates that the product is cited exactly as expected for similar publications in the same field; values above 1.0 indicate above-average impact, while values below 1.0 indicate below-average impact.
3-year FWCI: A time-restricted version of the FWCI that considers citations received within the first three years after publication. By limiting the citation window, this indicator captures the early relative impact of a research product, providing insight into how quickly it gains influence in its field.

In our analysis, the expected number of citations for each research product is computed by grouping them by topic, publication year, and product type and then averaging the citations within each group.

More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (Kanellos et al., 2019).

Indicator calculation levels

The impact indicators are calculated in two levels:

PID level: assuming that each PID corresponds to a distinct research product. Currently PIDs are DOIs, PMCIDs, and PMIDs.
OpenAIRE-id level: leveraging PID synonyms based on OpenAIRE's deduplication algorithm (Manghi et al., 2020) - each distinct article has its own OpenAIRE id.

Impact classes

Each researcj product is also assigned an impact class, reflecting its percentile rank among all products in the dataset:

Class	Percentile	Description
C1	Top 0.01%	Exceptional impact
C2	Top 0.1%	Very high impact
C3	Top 1%	High impact
C4	Top 10%	Good impact
C5	Rest 90%	Remaining products

File structure

For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided). The structure of the files differs slightly depending on the level:

PID-level files: Each line follows the format:
identifier <tab> identifier_type <tab> score <tab> class
OpenAIRE-id-level files: These files contain the keyword "openaire_ids" in the filename. Each line follows the format:
identifier <tab> score <tab> class

The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study (Kanellos et al., 2019) and the configuration of AttRank in the original paper (Kanellos et al., 2020).

Topic-related files

In addition to the main indicator files, the dataset also includes topic-level outputs, providing field-weighted impact indicators as well percentile classes within the associated topics from OpenAlex.

Specifically, we associated all research products with their topics from OpenAlex (using only their DOIs). Since currently only the DOIs are used to associate topics from OpenAlex to research products, all identifiers in these files refer to DOIs.

Topic-specific impact classes file: Fore each topic and indicator, precentile classes are computed and provided in topic_based_impact_classes.txt in the following format:

identifier <tab> topic <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class

Field-weighted indicator files: Each line follows the format:
identifier <tab> topic <tab> score

Note that to prevent division by zero, the score column is left empty whenever the average score for a specific combination of topic, publication year, and product type equals zero.

Data sources

The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v10.8.1, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG (Sinha et al, 2015; Wang et al., 2019), and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered.

Additionally, all topic-related computations are derived from OpenAlex topics.

Access and Use

Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.

Terms: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.

Changelog

v20.1

Topic-based indicators now use Topics (instead of Concepts) from Openalex.

v19.1

[major update] Added field-weighted indicators: FWCI and 3-year FWCI.

v19.0

Added PMCID as an additional type of PID.

v15.1

Fixed missing records that were unintentionally omitted in v15.0
Ensures all popularity indicators correctly use current_year = 2025

v12.0

Added PMIDs as an additional type of PID.

v10.0

[Major update] Introduced deduplication of research products using the latest OpenAIRE article deduplication algorithm. Each node in the citation network is now a deduplicated product having a distinct OpenAIRE id.
- Corrected overcounting of citations caused by multiple versions of the same product.
- PID-level scores are now derived from deduplicated OpenAIRE nodes.
Added filtering rules described here to remove from dataset PIDs with problematic metadata.

v9.0

[Major update] Introduced topic-specific impact classes for PID-identified products based on OpenAlex 2nd-level concepts.

v7.0

[Major update] Added impact class labels (C1-C5) for each procuct, indicating the percentile-bsaed impact levels.
- Classes reflect relative position within the global score distribution.

v5.1

[Major update] Introduced dual-level score computation: PID level and OpenAIRE ID level.

Notes

Please cite: T. Vergoulis, I. Kanellos, C. Atzori, A. Mannocci, S. Chatzopoulos, S. La Bruzzo, N. Manola, P. Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment (Sci-K) @ The Web Conf, 2021 (BibTeX)

We kindly ask that any published research using BIP! DB cites the corresponding paper listed above.

Files

Files (71.3 GB)

Name	Size	Download all
3-year_CC.txt.gz md5:997c62c833bfb0a35062e1ee0b103f1a	1.9 GB	Download
3-year_CC_openaire_ids.txt.gz md5:54d3ecd789c0f96a52f84bf0fe28812a	5.3 GB	Download
3-year_FWCI.txt.gz md5:0c83ffe30e91f1829194ca42d046c349	5.6 GB	Download
3-year_FWCI_openaire_ids.txt.gz md5:844acde2417d5fb47801a7914c283448	8.2 GB	Download
AttRank_a0.2_b0.5_c0.3_rho-0.16_year2021-2023_error1e-12.txt.gz md5:0bbdcde024618177c35177be4f10a261	2.6 GB	Download
AttRank_a0.2_b0.5_c0.3_rho-0.16_year2021-2023_error1e-12_openaire_ids.txt.gz md5:80a57076a67e98bc092462a27b2965a7	5.9 GB	Download
CC.txt.gz md5:c55541d2137454bbee8845bbcfed12bb	1.9 GB	Download
CC_openaire_ids.txt.gz md5:55ba0dd6db030582ba1fbd198098c940	5.3 GB	Download
FWCI.txt.gz md5:579adb4c8e4c55184a7aa662c6d088d8	5.9 GB	Download
FWCI_openaire_ids.txt.gz md5:8083b6dc5249b0599de91653fc234cc1	8.5 GB	Download
PR_a0.5_error1e-12.txt.gz md5:a3b4c81501c186166b7dd65c49f0a5e4	2.4 GB	Download
PR_a0.5_error1e-12_openaire_ids.txt.gz md5:168bdbde4b4326b9f0de3088f11438b3	5.7 GB	Download
RAM_c0.6_year2023.txt.gz md5:6df32b8a773d965d9a1ac4eb54642612	2.3 GB	Download
RAM_c0.6_year2023_openaire_ids.txt.gz md5:bc55ff09d57ca29b9731fedbac8d755c	5.6 GB	Download
topic_based_impact_classes.txt.gz md5:ecce4a91de56865dcc0d471ee6a24b73	4.0 GB	Download

Additional details

Is described by: Conference paper: 10.1145/3442442.3451369 (DOI)

European Commission
OpenAIRE Nexus - OpenAIRE-Nexus Scholarly Communication Services for EOSC users 101017452
European Commission
GraspOS - GraspOS: next Generation Research Assessment to Promote Open Science 101095129
European Commission
SciLake - Democratising and making sense out of heterogeneous scholarly content 101058573

Repository URL: https://github.com/athenarc/bip-ranker
Programming language: Python
Development Status: Active

R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

	All versions	This version
Views	9,341	48
Downloads	14,766	86
Data volume	50.6 TB	405.9 GB

Overview

Influence indicators

Popularity indicators

Impulse indicators

FIeld-weighted indicators

Indicator calculation levels

Impact classes

File structure

Topic-related files

Data sources

Access and Use

Changelog

Files (71.3 GB)

Related works

Funding

Software

References

BIP! DB: A Dataset of Impact Measures for Research Products

Authors/Creators

Description

Overview

Influence indicators

Popularity indicators

Impulse indicators

FIeld-weighted indicators

Indicator calculation levels

Impact classes

File structure

Topic-related files

Data sources

Access and Use

Changelog

Notes

Files

Files (71.3 GB)

Additional details

Related works

Funding

Software

References