Compact representation of the OpenAIRE citation graph
Authors/Creators
Description
The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and
over 2 billion citations. The current graph is available as a dump with metadata which uncompressed
totals ∼ 2.5 TB. This makes it hard to process on conventional computers. To make this network
more available for the community we provide a processed OpenAIRE graph which is downscaled to
16 GB RAM, while preserving the full graph structure. Apart from this we offer the processed data
in very simple format, which allows further straightforward manipulation. We also provide a python
pipeline, which can be used to process the next releases of the OpenAIRE graph.
The files are (size when loaded in RAM):
- publications.tsv.xz (16GB) - The nodes in the citation graph
- citations.tsv.xz (16GB*) - The edges in the citation graph
- publication_large.tsv.xz (309GB) - The nodes, but with several fields for additional features.
- pipeline.tar.xz - The pyspark pipeline used to produce the other files. Contains a singularity container for reproducability and portability. The code is also found at Codeberg.
All files are compressed (.xz) files.
* Citations.tsv can be loaded using int32, thus reducing it's memory footprint.
The preprint of the data paper is available at Arxiv.
The fields in the publication_large.csv:
| Field | Explanation | Memory usage (GB) |
|---|---|---|
| nodeId | Unique internal identifier for the node (publication) | 2 |
| openaireId | Identifier assigned by the OpenAIRE platform | 18 |
| doi | Digital Object Identifier of the publication | 13 |
| title | Title of the publication | 28 |
| authors | List of authors associated with the publication | 20 |
| description | Abstract or short description of the publication | 192 |
| date | Date when the publication was published | 11 |
| container | Journal, conference, or repository where it was published | 13 |
| citations | Number of times the publication has been cited | 2 |
| language | Language in which the publication is written | 10 |
Files
Files
(53.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d71eb32f759d27618befcc5e341bd954
|
8.5 GB | Download |
|
md5:d4ba24431cef2de7187ea24da2bbdfbe
|
1.8 GB | Download |
|
md5:bdd845fcdc637901980c82c62cc758cd
|
962.5 MB | Download |
|
md5:695e8fd95a74c65c158d35abfea149bf
|
42.2 GB | Download |