BIP! DB: A Dataset of Impact Measures for Research Products

doi:10.5281/zenodo.10804822

Published March 12, 2024 | Version 13

Dataset Open

BIP! DB: A Dataset of Impact Measures for Research Products

1. IMSI, ATHENA RC
2. CNR
3. OpenAIRE

This dataset contains citation-based impact indicators (a.k.a, "measures") for ~187,8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):

Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general)

Citation Count: The total number of citations of the product, the most well-known influence indicator.

PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).

Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently)

RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).

AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.

Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication)

Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.

More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]).

From version 5.1 onward, the impact indicators are calculated in two levels:

The PID level (assuming that each PID corresponds to a distinct research product).
The OpenAIRE-id level (leveraging PID synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id).

Previous versions of the dataset only provided the scores at the PID level.

From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included).

Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).

Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.

For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier <tab> score <tab> class". The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study [5] and the configuration of AttRank in the original paper. [3] Files for the OpenAIRE-ids case contain the keyword "openaire_ids" in the filename.

From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier <tab> concept <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class".

The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.1.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described here to remove from the dataset PIDs with problematic metadata.

References:

[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380

[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)

[4] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)

[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839

[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045

Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder's API.

Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the CC0 license.

More details about BIP! DB can be found in our relevant peer-reviewed publication:

Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

We kindly request that any published research that makes use of BIP! DB cite the above article.

Notes

Please cite: Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

Files

Files (38.0 GB)

Name	Size	Download all
3-year_CC.txt.gz md5:409753482584dc1b585ad59ebf83d7d5	1.1 GB	Download
3-year_CC_openaire_ids.txt.gz md5:73b02fa1e6697299aadcc849340b41a1	5.5 GB	Download
AttRank_a0.2_b0.5_c0.3_rho-0.16_year2021-2023_error1e-12.txt.gz md5:b246f7290266c7c618e2ab4616ee583a	1.6 GB	Download
AttRank_a0.2_b0.5_c0.3_rho-0.16_year2021-2023_error1e-12_openaire_ids.txt.gz md5:c8389be098b831731b6ed2a3f1a4bfdc	6.2 GB	Download
CC.txt.gz md5:b67be52b84c365134e6ff18cd4e9de6c	1.2 GB	Download
CC_openaire_ids.txt.gz md5:39d336e577d2cd38257b6e104b8ab2c3	5.5 GB	Download
PR_a0.5_error1e-12.txt.gz md5:e731acd83747068230badf429599184c	1.5 GB	Download
PR_a0.5_error1e-12_openaire_ids.txt.gz md5:4e09a5245c32792d8628908ab95ccebf	6.0 GB	Download
RAM_c0.6_year2023.txt.gz md5:29f0c629e8dfbf3d377308626c22de72	1.4 GB	Download
RAM_c0.6_year2023_openaire_ids.txt.gz md5:6f056ac5a6470c0cbfb80acdf026b832	5.8 GB	Download
topic_based_impact_classes.txt.gz md5:cc12de0604b8e68912bc564227c07627	2.2 GB	Download

Additional details

OpenAIRE Nexus – OpenAIRE-Nexus Scholarly Communication Services for EOSC users 101017452: European Commission
GraspOS – GraspOS: next Generation Research Assessment to Promote Open Science 101095129: European Commission
SciLake – Democratising and making sense out of heterogeneous scholarly content 101058573: European Commission

R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

	All versions	This version
Views	4,741	374
Downloads	861	123
Data volume	2.8 TB	395.4 GB

BIP! DB: A Dataset of Impact Measures for Research Products

Creators

Description

Notes

Files

Files (38.0 GB)

Additional details

Funding

References