COMET arXiv preprint author affiliation extraction and ROR ID matching results
Authors/Creators
Description
# arXiv Author Affiliations
This dataset contains author affiliation data extracted from arXiv works, matched to Research Organization Registry (ROR) identifiers.
## Dataset Description
This dataset was generated from all arXiv works as of 2025/12. The source PDFs were converted to markdown using [markitdown](https://github.com/microsoft/markitdown), and author affiliations were then extracted using [cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air](https://huggingface.co/cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air). The extracted affiliations were matched to ROR IDs using the [single search matching strategy](https://doi.org/10.71938/zz90-g810) in the ROR API.
The dataset contains approximately 2.8 million arXiv papers with 12.1 million author-affiliation pairs, of which 75.8% were successfully matched to ROR identifiers. Approximately 10,000 works were excluded because they exceeded the context size of the model.
## Data Format
Each line is a JSON object with the following structure:
```json
{
"arxiv_id": "arXiv:1109.3791",
"doi": "10.48550/arxiv.1109.3791",
"version": "v1",
"prediction": [
{
"name": "Author Name",
"affiliations": [
{
"affiliation": "Department of Computer Science, Example University",
"ror_id": "https://ror.org/example123"
}
]
}
]
}
```
### Fields
| Field | Description |
|-------|-------------|
| `arxiv_id` | arXiv identifier with prefix (e.g., `arXiv:1109.3791`) |
| `doi` | DOI derived from arXiv ID (e.g., `10.48550/arxiv.1109.3791`) |
| `version` | Paper version from arXiv (e.g., `v1`, `v2`) |
| `prediction` | Array of authors with their affiliations |
| `prediction[].name` | Author name as extracted from the paper |
| `prediction[].affiliations` | Array of affiliation objects |
| `prediction[].affiliations[].affiliation` | Raw affiliation text |
| `prediction[].affiliations[].ror_id` | ROR identifier URL, or `null` if no match found |
Files
Files
(2.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:91fb8154e4f422483fed54190ff6c745
|
2.4 GB | Download |
Additional details
Related works
- Is derived from
- Model: https://huggingface.co/cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air (URL)
- Dataset: 10.34740/kaggle/dsv/7548853 (DOI)
Dates
- Available
-
2025-02-16