BIP! DB: A Dataset of Impact Measures for Research Products
Authors/Creators
- 1. IMSI, ATHENA RC
- 2. CNR
Description
Overview
This dataset contains citation-based impact indicators (also referred as measures) for ~321M distinct persistent identifiers (PIDs) that correspond to various types of research products (publications, datasets, software, and other products).
The calculated indicators are organized into categories based on the aspect of impact they capture.
Influence indicators
Reflect the "total" impact of a research product; how established it is in general.
- Citation Count: The total number of citations of the product, the most well-known influence indicator.
- PageRank score: An influence indicator based on the PageRank (Page et al., 1999), a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).
Popularity indicators
Capture the "current" impact of a research product; how popular it currently is.
- RAM score: A popularity indicator based on the RAM (Ghosh et al., 2011) method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).
- AttRank score: A popularity indicator based on the AttRank (Kanellos et al., 2020) method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.
Impulse indicators
Measure the initial momentum that a research product received right after its publication.
- Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.
FIeld-weighted indicators
Capture the impact of a research product relative to the average performance in its field, accounting for differences in citation practices across disciplines.
- Field-Weighted Citation Impact (FWCI): A field-weighted indicator that measures how a research product performs compared to the global average in its research field. An FWCI of 1.0 indicates that the product is cited exactly as expected for similar publications in the same field; values above 1.0 indicate above-average impact, while values below 1.0 indicate below-average impact.
- 3-year FWCI: A time-restricted version of the FWCI that considers citations received within the first three years after publication. By limiting the citation window, this indicator captures the early relative impact of a research product, providing insight into how quickly it gains influence in its field.
In our analysis, the expected number of citations for each research product is computed by grouping them by topic, publication year, and product type and then averaging the citations within each group.
More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (Kanellos et al., 2019).
Indicator calculation levels
The impact indicators are calculated in two levels:
- PID level: assuming that each PID corresponds to a distinct research product. Currently PIDs are DOIs, PMCIDs, and PMIDs.
- OpenAIRE-id level: leveraging PID synonyms based on OpenAIRE's deduplication algorithm (Manghi et al., 2020) - each distinct article has its own OpenAIRE id.
Impact classes
Each researcj product is also assigned an impact class, reflecting its percentile rank among all products in the dataset:
| Class | Percentile | Description |
| C1 | Top 0.01% | Exceptional impact |
| C2 | Top 0.1% | Very high impact |
| C3 | Top 1% | High impact |
| C4 | Top 10% | Good impact |
| C5 | Rest 90% | Remaining products |
File structure
For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided). The structure of the files differs slightly depending on the level:
-
PID-level files: Each line follows the format:
identifier <tab> identifier_type <tab> score <tab> class -
OpenAIRE-id-level files: These files contain the keyword "openaire_ids" in the filename. Each line follows the format:
identifier <tab> score <tab> class
The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study (Kanellos et al., 2019) and the configuration of AttRank in the original paper (Kanellos et al., 2020).
Topic-related files
In addition to the main indicator files, the dataset also includes topic-level outputs, providing field-weighted impact indicators as well percentile classes within the associated topics from OpenAlex.
Specifically, we associated all research products with their topics from OpenAlex (using only their DOIs). Since currently only the DOIs are used to associate topics from OpenAlex to research products, all identifiers in these files refer to DOIs.
- Topic-specific impact classes file: Fore each topic and indicator, precentile classes are computed and provided in
topic_based_impact_classes.txtin the following format:
identifier <tab> topic <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class
- Field-weighted indicator files: Each line follows the format:
identifier <tab> topic <tab> score
Note that to prevent division by zero, the score column is left empty whenever the average score for a specific combination of topic, publication year, and product type equals zero.
Data sources
The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v10.8.1, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG (Sinha et al, 2015; Wang et al., 2019), and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered.
Additionally, all topic-related computations are derived from OpenAlex topics.
Access and Use
Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.
Terms: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.
Changelog
v20.1
- Topic-based indicators now use Topics (instead of Concepts) from Openalex.
v19.1
- [major update] Added field-weighted indicators: FWCI and 3-year FWCI.
v19.0
- Added PMCID as an additional type of PID.
v15.1
- Fixed missing records that were unintentionally omitted in v15.0
- Ensures all popularity indicators correctly use
current_year = 2025
v12.0
- Added PMIDs as an additional type of PID.
v10.0
- [Major update] Introduced deduplication of research products using the latest OpenAIRE article deduplication algorithm. Each node in the citation network is now a deduplicated product having a distinct OpenAIRE id.
- Corrected overcounting of citations caused by multiple versions of the same product.
- PID-level scores are now derived from deduplicated OpenAIRE nodes.
- Added filtering rules described here to remove from dataset PIDs with problematic metadata.
v9.0
- [Major update] Introduced topic-specific impact classes for PID-identified products based on OpenAlex 2nd-level concepts.
v7.0
- [Major update] Added impact class labels (C1-C5) for each procuct, indicating the percentile-bsaed impact levels.
- Classes reflect relative position within the global score distribution.
v5.1
- [Major update] Introduced dual-level score computation: PID level and OpenAIRE ID level.
Notes
Files
Files
(71.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:997c62c833bfb0a35062e1ee0b103f1a
|
1.9 GB | Download |
|
md5:54d3ecd789c0f96a52f84bf0fe28812a
|
5.3 GB | Download |
|
md5:0c83ffe30e91f1829194ca42d046c349
|
5.6 GB | Download |
|
md5:844acde2417d5fb47801a7914c283448
|
8.2 GB | Download |
|
md5:0bbdcde024618177c35177be4f10a261
|
2.6 GB | Download |
|
md5:80a57076a67e98bc092462a27b2965a7
|
5.9 GB | Download |
|
md5:c55541d2137454bbee8845bbcfed12bb
|
1.9 GB | Download |
|
md5:55ba0dd6db030582ba1fbd198098c940
|
5.3 GB | Download |
|
md5:579adb4c8e4c55184a7aa662c6d088d8
|
5.9 GB | Download |
|
md5:8083b6dc5249b0599de91653fc234cc1
|
8.5 GB | Download |
|
md5:a3b4c81501c186166b7dd65c49f0a5e4
|
2.4 GB | Download |
|
md5:168bdbde4b4326b9f0de3088f11438b3
|
5.7 GB | Download |
|
md5:6df32b8a773d965d9a1ac4eb54642612
|
2.3 GB | Download |
|
md5:bc55ff09d57ca29b9731fedbac8d755c
|
5.6 GB | Download |
|
md5:ecce4a91de56865dcc0d471ee6a24b73
|
4.0 GB | Download |
Additional details
Related works
- Is described by
- Conference paper: 10.1145/3442442.3451369 (DOI)
Funding
- European Commission
- OpenAIRE Nexus - OpenAIRE-Nexus Scholarly Communication Services for EOSC users 101017452
- European Commission
- GraspOS - GraspOS: next Generation Research Assessment to Promote Open Science 101095129
- European Commission
- SciLake - Democratising and making sense out of heterogeneous scholarly content 101058573
Software
- Repository URL
- https://github.com/athenarc/bip-ranker
- Programming language
- Python
- Development Status
- Active
References
- R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
- Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
- I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
- I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
- Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
- K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
- P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).