Published April 9, 2026 | Version v2
Dataset Open

Compact representation of the OpenAIRE citation graph

  • 1. ROR icon Czech Academy of Sciences, Institute of Computer Science

Description

We're making available a distilled version of the OpenAIRE citation graph. We share the complete graph as two files totalling ~11GB. We also provide a larger file including additonal publication fields (see table below). For complete description see the preprint of the data paper at Arxiv.

 

There are three data files each served in two formats; the TSV (Tab-Separated Values) and Parquet :

  • publications.tsv.xz and publications.parquet - The nodes in the citation graph, and their primary doi.
  • citations.tsv.xz and citations.parquet - The edges in the citation graph
  • publication_large.tsv.xz and publications_large.parquet - The nodes, but with several additional fields. 
  • pipeline.tar.xz - The pyspark pipeline used to produce the other files. Contains a singularity/apptainer container for reproducability and portability. The code is also found at Codeberg.

All files are compressed (.xz or .parquet) files.

 

Memory efficient loading of the Parquet files using Pandas

For memory efficient loading of the citation.parquet and publications.parquet in loading, use the PyArrow backend:

import pandas as pd

df_pubs = pd.read_parquet(
    "publications.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)

df_cites = pd.read_parquet(
    "citations.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)

The large publications file e.g. publications_large.parquet, will not fit inside memory on most machines. It is however simple to select a subset of coloumns, and only load those. 

df_large = pd.read_parquet(
    "publications_large.parquet",
    columns=["nodeId", "title", "pid_dois"],
    engine="pyarrow",
    dtype_backend="pyarrow",
)

Here, the nodeId, title, and pid_dois columns are selected. 

 

The fields in the publication_large:

Field Type Explanation Memory usage (GB) Percentage of filled fields
nodeId int32 Unique internal identifier for the node (publication) 0.8 100.00%
openaireId str Identifier assigned by the OpenAIRE platform 9.6 100.00%
title str Title of the publication 16.5 99.41%
authors list[str] List of authors associated with the publication 11.0 83.84%
description str Abstract or short description of the publication 131.3 57.17%
date datetime Date when the publication was published 0.8 97.33%
container str Journal, conference, or repository where it was published 2.2 68.45%
citations int Number of times the publication has been cited 1.6 97.33%
language str Language in which the publication is written 0.2 99.99%
pid_dois list[str] DOI indentifiers 5.6 80.70%
pid_mag_ids list[str] MAG IDs 2.0 44.27%
pid_pmids list[str] PubMed IDs 1.2 18.18%
pid_handles list[str] Persistent handles 1.1 8.33%
pid_pmcs list[str] PubMed Central IDs 0.9 4.77%
pid_arxiv_ids list[str] ArXiv IDs 0.9 1.38%

Memory usage is the size the column takes up in memory, when loaded into a panda using the code snippets below. The percentage of filled fields shows the amount of non-null entries in each column.

Files

Files (141.6 GB)

Name Size Download all
md5:6c62b3e68d5c64438c655e0d39c381bd
8.8 GB Download
md5:c6b0ba6c4b6bf6457d08b36ccd4a3c5c
9.7 GB Download
md5:84f4ce5887d8af5af57bcb859878e167
1.7 GB Download
md5:367f4596eb6211afab4320cf1363fe0b
1.9 GB Download
md5:d7502a2e7dc6c39a0476c7a5c66d853f
1.0 GB Download
md5:b6022a1b3dd877554676c7ead973a2a3
72.5 GB Download
md5:de53759ce562806411468cc6dc7ecc64
45.9 GB Download