arXiv abstracts and titles from 1,469 single-authored papers (100 unique authors) in computer science
Contributors
Researchers:
- 1. ISTI-CNR
- 2. Scuola Normale Superiore Pisa
Description
This dataset is meant to be used for experiments of Authorship Analysis. The dataset consists of abstracts of single-author papers from arXiv crawled using the arXiv's API by querying a list of computer-science-related keywords ("deep learning", "machine learning", "information retrieval", "computer science", "data mining", "support vector", "logistic regression", "artificial intelligence", "supervised learning"'). The corpus somehow follows a power-law distribution, with few prolific authors and many authors accounting for very few papers each: we retained authors with at least 10 papers, resulting in a total of 1,469 documents from 100 authors. The most prolific authors (Peter D. Turney and Subhash Kak) have 34 abstracts to their names, the 10 most prolific authors have written 22 or more articles, while 50% of the authors have no more than 12 abstracts to their names. In order to divide the corpus into a training set and a test set we perform a stratified split, with the production of each author being split into a training set (70%) and a test set (30%). We use these documents as examples of "scientific communication", characterised by a precise and compact style, with an abundance of technical terminology.
Files
arXiv_100authors_comp_sci.csv
Files
(1.3 MB)
Name | Size | Download all |
---|---|---|
md5:53d8561cedfb08c55ae321c2ae4f9b0b
|
1.3 MB | Preview Download |