Published January 20, 2022 | Version 1.0
Dataset Open

Webis-MS-MARCO-Anchor-Texts-22

  • 1. Martin-Luther-Universität Halle-Wittenberg
  • 2. Leipzig University

Description

The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). We sampled  1,000 anchor texts for documents with more than 1,000 anchor texts at random and all anchor texts for documents with less than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents in Version 1 and 97% of documents for Version 2). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with anchor text.

Cleaned versions of the MS MARCO Anchor Text 2022 dataset are available in ir_datasets, Zenodo and Hugging Face. The raw dataset with additional information and all metadata for the extracted anchor texts (roughly 100GB) is available on Hugging Face and files.webis.de.

The details of the construction of the Webis MS MARCO Anchor Text 2022 dataset are described in the associated paper. If you use this dataset, please cite
@InProceedings{froebe:2022a,
  address =               {Berlin Heidelberg New York},
  author =                {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen},
  booktitle =             {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)},
  editor =                {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty},
  month =                 apr,
  publisher =             {Springer},
  series =                {Lecture Notes in Computer Science},
  site =                  {Stavanger, Norway},
  title =                 {{The Power of Anchor Text in the Neural Retrieval Era}},
  year =                  2022
}

Files

Files (3.4 GB)

Name Size Download all
md5:20162fb6366d949c7bbb2a44a790a757
99.4 MB Download
md5:b5b9cbaa8a8e2c0583e38ada924b4662
139.2 MB Download
md5:77d5120a27aa4687bbf3e70d1fe4cf11
135.8 MB Download
md5:9d8a7fc9d2b3966ffb4d95cf8433fac4
230.8 MB Download
md5:1db0099e7b095dc4c402ab96a930a529
171.3 MB Download
md5:fad1572ac5e3ae1074e9b5bf7a873d39
254.8 MB Download
md5:39897e3d9bac254d43eb39e025701771
145.9 MB Download
md5:0e9d6e2fc7cd60fd06035ec1e611d384
238.4 MB Download
md5:a7a52b20bde6309b2d6833bd5322c143
173.6 MB Download
md5:29044faf51e9a097151d7a6812bdf9a2
284.2 MB Download
md5:26ea8edbfbeaeae49e0eb89dedcd6157
144.0 MB Download
md5:c36ac19670388a550d60dcb12d7fcf4e
253.6 MB Download
md5:4f7af19c455976f7c2606b97ffb7a89f
407.9 MB Download
md5:8b96dbaf4efcae08e0ee307e03f3434d
751.7 MB Download