Published February 16, 2026 | Version 1.0
Dataset Open

COMET arXiv preprint author affiliation extraction and ROR ID matching results

  • 1. ROR icon Stanford University
  • 2. ROR icon California Digital Library
  • 3. Research Organization Registry

Description

# arXiv Author Affiliations

This dataset contains author affiliation data extracted from arXiv works, matched to Research Organization Registry (ROR) identifiers.

## Dataset Description

This dataset was generated from all arXiv works as of 2025/12. The source PDFs were converted to markdown using [markitdown](https://github.com/microsoft/markitdown), and author affiliations were then extracted using [cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air](https://huggingface.co/cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air). The extracted affiliations were matched to ROR IDs using the [single search matching strategy](https://doi.org/10.71938/zz90-g810) in the ROR API.

The dataset contains approximately 2.8 million arXiv papers with 12.1 million author-affiliation pairs, of which 75.8% were successfully matched to ROR identifiers. Approximately 10,000 works were excluded because they exceeded the context size of the model.

## Data Format

Each line is a JSON object with the following structure:

```json
{
  "arxiv_id": "arXiv:1109.3791",
  "doi": "10.48550/arxiv.1109.3791",
  "version": "v1",
  "prediction": [
    {
      "name": "Author Name",
      "affiliations": [
        {
          "affiliation": "Department of Computer Science, Example University",
          "ror_id": "https://ror.org/example123"
        }
      ]
    }
  ]
}
```

### Fields

| Field | Description |
|-------|-------------|
| `arxiv_id` | arXiv identifier with prefix (e.g., `arXiv:1109.3791`) |
| `doi` | DOI derived from arXiv ID (e.g., `10.48550/arxiv.1109.3791`) |
| `version` | Paper version from arXiv (e.g., `v1`, `v2`) |
| `prediction` | Array of authors with their affiliations |
| `prediction[].name` | Author name as extracted from the paper |
| `prediction[].affiliations` | Array of affiliation objects |
| `prediction[].affiliations[].affiliation` | Raw affiliation text |
| `prediction[].affiliations[].ror_id` | ROR identifier URL, or `null` if no match found |

Files

Files (2.4 GB)

Name Size Download all
md5:91fb8154e4f422483fed54190ff6c745
2.4 GB Download

Additional details

Funding

The Navigation Fund
A Community-Built Source of Open Research Metadata

Dates

Available
2025-02-16