AminerMag X Dataset
Authors/Creators
- 1. Argonne National Laboratory
- 2. Georgia Institute of Technology
- 3. Oak Ridge National Laboratory
- 4. Wake Forest University
Description
A subset of the Microsoft Open Academic Graph (OAG), a dataset consisting of a unification of the Microsoft Academic Graph (MAG) and ArnetMiner (AMiner) academic graphs each respectively containing 166,192,182 and 154,771,162 papers. From this dataset, a subset of 37,732,477 papers with available abstracts and citation information were selected. These abstracts were preprocessed using stop words and stemming to form a vocabulary of 1,333 unique words. Together this vocabulary and corpus of papers were used to form a sparse 1,333 × 37,732,477 term-document matrix with 1,295,114,641 nonzeros, wherein each column represents a paper as a tf-idf vector. The resulting matrix was used as the X in the real world experiments. The symmetric graph Laplacian matrix S was then formed from the citation graph. Each of the 966,206,008 nonzeros of the resulting 37,732,477 × 37,732,477 matrix represents a citation between two papers.
Files
Files
(44.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6455adf4637ffcdc667a5f9fd02a5c17
|
301.9 MB | Download |
|
md5:569a619678bc5e68a9f1220304f98c31
|
43.9 GB | Download |
Additional details
Funding
- U.S. National Science Foundation
- CAREER: Communication-Avoiding Tensor Decomposition Algorithms 1942892
- U.S. National Science Foundation
- Collaborative Research: OAC Core: Robust, Scalable, and Practical Low-Rank Approximation 2106920