Compact representation of the OpenAIRE citation graph
Authors/Creators
Description
We're making available a distilled version of the OpenAIRE citation graph. We share the complete graph as two files totalling ~11GB. We also provide a larger file including additonal publication fields (see table below). For complete description see the preprint of the data paper at Arxiv.
There are three data files each served in two formats; the TSV (Tab-Separated Values) and Parquet :
- publications.tsv.xz and publications.parquet - The nodes in the citation graph, and their primary doi.
- citations.tsv.xz and citations.parquet - The edges in the citation graph
- publication_large.tsv.xz and publications_large.parquet - The nodes, but with several additional fields.
- pipeline.tar.xz - The pyspark pipeline used to produce the other files. Contains a singularity/apptainer container for reproducability and portability. The code is also found at Codeberg.
All files are compressed (.xz or .parquet) files.
Memory efficient loading of the Parquet files using Pandas
For memory efficient loading of the citation.parquet and publications.parquet in loading, use the PyArrow backend:
import pandas as pd
df_pubs = pd.read_parquet(
"publications.parquet",
engine="pyarrow",
dtype_backend="pyarrow",
)
df_cites = pd.read_parquet(
"citations.parquet",
engine="pyarrow",
dtype_backend="pyarrow",
)
The large publications file e.g. publications_large.parquet, will not fit inside memory on most machines. It is however simple to select a subset of coloumns, and only load those.
df_large = pd.read_parquet(
"publications_large.parquet",
columns=["nodeId", "title", "pid_dois"],
engine="pyarrow",
dtype_backend="pyarrow",
)
Here, the nodeId, title, and pid_dois columns are selected.
The fields in the publication_large:
| Field | Type | Explanation | Memory usage (GB) | Percentage of filled fields |
|---|---|---|---|---|
| nodeId | int32 | Unique internal identifier for the node (publication) | 0.8 | 100.00% |
| openaireId | str | Identifier assigned by the OpenAIRE platform | 9.6 | 100.00% |
| title | str | Title of the publication | 16.5 | 99.41% |
| authors | list[str] | List of authors associated with the publication | 11.0 | 83.84% |
| description | str | Abstract or short description of the publication | 131.3 | 57.17% |
| date | datetime | Date when the publication was published | 0.8 | 97.33% |
| container | str | Journal, conference, or repository where it was published | 2.2 | 68.45% |
| citations | int | Number of times the publication has been cited | 1.6 | 97.33% |
| language | str | Language in which the publication is written | 0.2 | 99.99% |
| pid_dois | list[str] | DOI indentifiers | 5.6 | 80.70% |
| pid_mag_ids | list[str] | MAG IDs | 2.0 | 44.27% |
| pid_pmids | list[str] | PubMed IDs | 1.2 | 18.18% |
| pid_handles | list[str] | Persistent handles | 1.1 | 8.33% |
| pid_pmcs | list[str] | PubMed Central IDs | 0.9 | 4.77% |
| pid_arxiv_ids | list[str] | ArXiv IDs | 0.9 | 1.38% |
Memory usage is the size the column takes up in memory, when loaded into a panda using the code snippets below. The percentage of filled fields shows the amount of non-null entries in each column.
Files
Files
(141.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6c62b3e68d5c64438c655e0d39c381bd
|
8.8 GB | Download |
|
md5:c6b0ba6c4b6bf6457d08b36ccd4a3c5c
|
9.7 GB | Download |
|
md5:84f4ce5887d8af5af57bcb859878e167
|
1.7 GB | Download |
|
md5:367f4596eb6211afab4320cf1363fe0b
|
1.9 GB | Download |
|
md5:d7502a2e7dc6c39a0476c7a5c66d853f
|
1.0 GB | Download |
|
md5:b6022a1b3dd877554676c7ead973a2a3
|
72.5 GB | Download |
|
md5:de53759ce562806411468cc6dc7ecc64
|
45.9 GB | Download |