Compact representation of the OpenAIRE citation graph

Skarding, Joakim; Sanda, Pavel

doi:10.5281/zenodo.19207803

Published April 9, 2026 | Version v2

Dataset Open

Compact representation of the OpenAIRE citation graph

1. Czech Academy of Sciences, Institute of Computer Science

When working with this dataset please cite to the accompanying article: Skarding, J. and Sanda, P. (2026) ‘Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation’, Journal of Open Humanities Data, 12(1), p. 63. Available at: https://doi.org/10.5334/johd.520.

We're making available a distilled version of the OpenAIRE citation graph. We share the complete graph as two files totalling ~11GB. We also provide a larger file including additonal publication fields (see table below). For complete description see the data paper at .

There are three data files each served in two formats; the TSV (Tab-Separated Values) and Parquet :

publications.tsv.xz and publications.parquet - The nodes in the citation graph, and their primary doi.
citations.tsv.xz and citations.parquet - The edges in the citation graph
publication_large.tsv.xz and publications_large.parquet - The nodes, but with several additional fields.
pipeline.tar.xz - The pyspark pipeline used to produce the other files. Contains a singularity/apptainer container for reproducability and portability. The code is also found at Codeberg.

All files are compressed (.xz or .parquet) files.

Memory efficient loading of the Parquet files using Pandas

For memory efficient loading of the citation.parquet and publications.parquet in loading, use the PyArrow backend:

import pandas as pd

df_pubs = pd.read_parquet(
    "publications.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)

df_cites = pd.read_parquet(
    "citations.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
)

The large publications file e.g. publications_large.parquet, will not fit inside memory on most machines. It is however simple to select a subset of coloumns, and only load those.

df_large = pd.read_parquet(
    "publications_large.parquet",
    columns=["nodeId", "title", "pid_dois"],
    engine="pyarrow",
    dtype_backend="pyarrow",
)

Here, the nodeId, title, and pid_dois columns are selected.

The fields in the publication_large:

Field	Type	Explanation	Memory usage (GB)	Percentage of filled fields
nodeId	int32	Unique internal identifier for the node (publication)	0.8	100.00%
openaireId	str	Identifier assigned by the OpenAIRE platform	9.6	100.00%
title	str	Title of the publication	16.5	99.41%
authors	list[str]	List of authors associated with the publication	11.0	83.84%
description	str	Abstract or short description of the publication	131.3	57.17%
date	datetime	Date when the publication was published	0.8	97.33%
container	str	Journal, conference, or repository where it was published	2.2	68.45%
citations	int	Number of times the publication has been cited	1.6	97.33%
language	str	Language in which the publication is written	0.2	99.99%
pid_dois	list[str]	DOI indentifiers	5.6	80.70%
pid_mag_ids	list[str]	MAG IDs	2.0	44.27%
pid_pmids	list[str]	PubMed IDs	1.2	18.18%
pid_handles	list[str]	Persistent handles	1.1	8.33%
pid_pmcs	list[str]	PubMed Central IDs	0.9	4.77%
pid_arxiv_ids	list[str]	ArXiv IDs	0.9	1.38%

Memory usage is the size the column takes up in memory, when loaded into a panda using the code snippets below. The percentage of filled fields shows the amount of non-null entries in each column.

Files

Files (141.6 GB)

Name	Size
citations.parquet md5:6c62b3e68d5c64438c655e0d39c381bd	8.8 GB	Download
citations.tsv.xz md5:c6b0ba6c4b6bf6457d08b36ccd4a3c5c	9.7 GB	Download
pipeline.tar.xz md5:84f4ce5887d8af5af57bcb859878e167	1.7 GB	Download
publications.parquet md5:367f4596eb6211afab4320cf1363fe0b	1.9 GB	Download
publications.tsv.xz md5:d7502a2e7dc6c39a0476c7a5c66d853f	1.0 GB	Download
publications_large.parquet md5:b6022a1b3dd877554676c7ead973a2a3	72.5 GB	Download
publications_large.tsv.xz md5:de53759ce562806411468cc6dc7ecc64	45.9 GB	Download

	All versions	This version
Views	246	94
Downloads	324	82
Data volume	5.9 TB	2.3 TB

Compact representation of the OpenAIRE citation graph

Authors/Creators

Description

Memory efficient loading of the Parquet files using Pandas

The fields in the publication_large:

Files

Files (141.6 GB)