Published February 12, 2026 | Version v1
Dataset Open

Compact representation of the OpenAIRE citation graph

  • 1. ROR icon Czech Academy of Sciences, Institute of Computer Science

Description

The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and
over 2 billion citations. The current graph is available as a dump with metadata which uncompressed
totals ∼ 2.5 TB. This makes it hard to process on conventional computers. To make this network
more available for the community we provide a processed OpenAIRE graph which is downscaled to
16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data
in very simple format, which allows further straightforward manipulation. We also provide a python
pipeline, which can be used to process the next releases of the OpenAIRE graph.

 

The files are (size when loaded in RAM):

  • publications.tsv.xz (16GB) - The nodes in the citation graph
  • citations.tsv.xz (16GB*) - The edges in the citation graph
  • publication_large.tsv.xz (309GB) - The nodes, but with several fields for additional features. 
  • pipeline.tar.xz - The pyspark pipeline used to produce the other files. Contains a singularity container for reproducability and portability. The code is also found at Codeberg.

All files are compressed (.xz) files.

* Citations.tsv can be loaded using int32, thus reducing it's memory footprint.

 

The preprint of the data paper is available at Arxiv.

 

The fields in the publication_large.csv:

Field Explanation Memory usage (GB)
nodeId Unique internal identifier for the node (publication) 2
openaireId Identifier assigned by the OpenAIRE platform 18
doi Digital Object Identifier of the publication 13
title Title of the publication 28
authors List of authors associated with the publication 20
description Abstract or short description of the publication 192
date Date when the publication was published 11
container Journal, conference, or repository where it was published 13
citations Number of times the publication has been cited 2
language Language in which the publication is written 10

 

Files

Files (53.5 GB)

Name Size Download all
md5:d71eb32f759d27618befcc5e341bd954
8.5 GB Download
md5:d4ba24431cef2de7187ea24da2bbdfbe
1.8 GB Download
md5:bdd845fcdc637901980c82c62cc758cd
962.5 MB Download
md5:695e8fd95a74c65c158d35abfea149bf
42.2 GB Download