Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published February 22, 2023 | Version 9
Dataset Open

BIP! DB: A Dataset of Impact Measures for Scientific Publications

Description

This dataset contains impact measures (metrics/indicators) for ~136M distinct DOIs that correspond to scientific articles. In particular, for each article we have calculated the following measures:

  • Citation count: The total number of citations, reflecting the "influence" (i.e., the total impact) of an article.

  • Incubation Citation Count (3-year CC): This is a time-restricted version of the citation count, where the time window length is fixed for all papers and the time window depends on the publication date of the paper, i.e., only citations 3 years after each paper’s publication are counted. This measure can be seen as an indicator of a paper's "impulse", i.e., its initial momentum directly after its publication.

  • PageRank score: This is a citation-based measure reflecting the "influence" (i.e., the total impact) of an article. It is based on the PageRank1 network analysis method. In the context of citation networks, PageRank estimates the importance of each article based on its centrality in the whole network.

  • RAM score: This is a citation-based measure reflecting the "popularity" (i.e., the current impact) of an article. It is based on the RAM2 method and is essentially a citation count where recent citations are considered as more important. This type of “time awareness” alleviates problems of methods like PageRank, which are biased against recently published articles (new articles need time to receive a “sufficient” number of citations).  Hence, RAM is more suitable to capture the current “hype” of an article.

  • AttRank score: This is a citation network analysis-based measure reflecting the "popularity" (i.e., the current impact) of an article. It is based on the AttRank3 method. AttRank alleviates PageRank’s bias against recently published papers by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher’s preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current “hype” of an article.

More details about the aforementioned impact measures, the way they are calculated and their interpretation can be found here.

For version 5.1 onward, the impact measures are calculated in two levels:

  • The DOI level (assuming that each DOI corresponds to a distinct scientific article.
  • The OpenAIRE-id level (leveraging DOI synonyms based on OpenAIRE's deduplication algorithm4 - each distinct article has its own OpenAIRE id). 

Previous versions of the dataset only provided the scores at the DOI level.

Also, for version 7 onward, for each article in our files we also offer an impact class, which informs the user about the percentile into which the article score belongs compared to the impact scores of the rest articles in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).

For each calculation level (DOI / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format  “identifier <tab> score <tab> class”. The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study5 and the configuration of AttRank in the original paper.3 Files for the OpenAIRE-ids case contain the keyword "openaire_ids" in the filename.  

From version 9 onward, we also provide topic-specific impact classes for DOI-identified publications. In particular, we associated those articles with 2nd level concepts from OpenAlex (284 in total); we chose to keep only the three most dominant concepts for each publication, based on their confidence score, and only if this score was greater than 0.3. Then, for each publication and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format “identifier <tab> concept <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class”.

The data used to produce the citation network on which we calculated the provided measures have been gathered from (a) the OpenCitations’ COCI dataset (Dec-2022 version), (b) a MAG6,7 snapshot from Dec-2021, and (c) a Crossref snapshot from Jan-2023. The union of all distinct DOI-to-DOI citations that could be found in these sources have been considered (entries without a DOI were omitted). 

References:

  1. R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

  2. Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380

  3. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)

  4. P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).

  5. I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)

  6. Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839

  7. K. Wang et al., “A Review of Microsoft Academic Services for Science of Science Studies”, Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045    

Find our Academic Search Engine built on top of these data here. Further note, that we also provide all calculated scores through BIP! Finder’s API

Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.

More details about BIP! DB can be found in our relevant peer-reviewed publication:

Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

We kindly request that any published research that makes use of BIP! DB cite the above article.

Notes

Please cite: Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460

Files

Files (26.0 GB)

Name Size Download all
md5:b2d74f0f772c116c9b83015f5ca9a52a
1.3 GB Download
md5:a4e77f9df11a48bbbd357a03dd1f2d9a
2.8 GB Download
md5:66694c3ffa04f326e25c67d21fd755eb
3.2 GB Download
md5:5e88cb7c582c196178e041b3895df7cc
1.8 GB Download
md5:5034d22a2e32b0617e209b941cc371de
1.3 GB Download
md5:a58a272f57fc196459b3a7984dd5945f
2.8 GB Download
md5:070c5c399f3b0f2650fbc8c3879d12b3
2.0 GB Download
md5:b2d74140b9cb7d538b904821cac1f87e
3.1 GB Download
md5:a51baa0e9e80b2920c157324f04970c9
2.4 GB Download
md5:14cfdb2cd80c02e8f55be1f0efe1ceb5
3.1 GB Download
md5:cfda6c74c0be6165a5870c53279ece8b
2.1 GB Download

Additional details

Funding

OpenAIRE Nexus – OpenAIRE-Nexus Scholarly Communication Services for EOSC users 101017452
European Commission

References

  • R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
  • Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
  • I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
  • I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)
  • Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839
  • K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045
  • P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).