Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published January 20, 2022 | Version 1.0
Dataset Open


  • 1. Martin-Luther-Universität Halle-Wittenberg
  • 2. Leipzig University


The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). We sampled  1,000 anchor texts for documents with more than 1,000 anchor texts at random and all anchor texts for documents with less than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents in Version 1 and 97% of documents for Version 2). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with anchor text.

Cleaned versions of the MS MARCO Anchor Text 2022 dataset are available in ir_datasets, Zenodo and Hugging Face. The raw dataset with additional information and all metadata for the extracted anchor texts (roughly 100GB) is available on Hugging Face and

The details of the construction of the Webis MS MARCO Anchor Text 2022 dataset are described in the associated paper. If you use this dataset, please cite
  address =               {Berlin Heidelberg New York},
  author =                {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen},
  booktitle =             {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)},
  editor =                {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty},
  month =                 apr,
  publisher =             {Springer},
  series =                {Lecture Notes in Computer Science},
  site =                  {Stavanger, Norway},
  title =                 {{The Power of Anchor Text in the Neural Retrieval Era}},
  year =                  2022


Files (3.4 GB)

Name Size Download all
99.4 MB Download
139.2 MB Download
135.8 MB Download
230.8 MB Download
171.3 MB Download
254.8 MB Download
145.9 MB Download
238.4 MB Download
173.6 MB Download
284.2 MB Download
144.0 MB Download
253.6 MB Download
407.9 MB Download
751.7 MB Download