Published March 23, 2023 | Version Document Processors Version 1.0.0
Dataset Open

1st International Workshop on Open Web Search #wows2024 at ECIR 2024: Document Processors

Description

The First International Workshop on Open Web Search (WOWS) hosted at [ECIR 2024](https://www.ecir2024.org/) aimed to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. The workshop had two calls that support collaborative and open web search engines: (1) for scientific contributions, and (2) for open-source implementations. This repository collects the outputs of all submitted document processing components on public datasets for the second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the [TIREx Information Retrieval Evaluation Platform](https://www.tira.io/tirex) hosted on [TIRA](https://www.tira.io).

 

 

Citations

If you reuse the resources, please ensure to cite TIRA and TIREx and the corresponding datasets, the corresponding bib-entries are:

For TIREx:

@InProceedings{froebe:2023e,
  author =                   {Maik Fr{\"o}be and {Jan Heinrich} Reimer and Sean MacAvaney and Niklas Deckers and Simon Reich and Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =                {46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)},
  doi =                      {10.1145/3539618.3591888},
  editor =                   {Hsin{-}Hsi Chen and Wei{-}Jou (Edward) Duh and Hen{-}Hsen Huang and Makoto P. Kato and Josiane Mothe and Barbara Poblete},
  ids =                      {potthast:2023t},
  isbn =                     {9781450394086},
  month =                    jul,
  numpages =                 11,
  pages =                    {2826--2836},
  publisher =                {ACM},
  site =                     {Taipei, Taiwan},
  title =                    {{The Information Retrieval Experiment Platform}},
  url =                      {https://dl.acm.org/doi/10.1145/3539618.3591888},
  year =                     2023
}

for TIRA:

@InProceedings{froebe:2023b,
  address =                  {Berlin Heidelberg New York},
  author =                   {Maik Fr{\"o}be and Matti Wiegmann and Nikolay Kolyada and Bastian Grahm and Theresa Elstner and Frank Loebe and Matthias Hagen and Benno Stein and Martin Potthast},
  booktitle =                {Advances in Information Retrieval. 45th European Conference on {IR} Research ({ECIR} 2023)},
  doi =                      {10.1007/978-3-031-28241-6_20},
  editor =                   {Jaap Kamps and Lorraine Goeuriot and Fabio Crestani and Maria Maistro and Hideo Joho and Brian Davis and Cathal Gurrin and Udo Kruschwitz and Annalina Caputo},
  ids =                      {potthast:2023h},
  month =                    apr,
  pages =                    {236--241},
  publisher =                {Springer},
  series =                   {Lecture Notes in Computer Science},
  site =                     {Dublin, Irland},
  title =                    {{Continuous Integration for Reproducible Shared Tasks with TIRA.io}},
  url =                      {https://link.springer.com/chapter/10.1007/978-3-031-28241-6_20},
  year =                     2023
}


All query processors are described in the corresponding WOWS paper, please cite the papers and underlying approaches accordingly.

Forthermore, please cite the datasets that you use.

Args.me

If you re-use the Args.me indices, please additionally cite:

@InProceedings{bondarenko:2021d,
  address =                  {Berlin Heidelberg New York},
  author =                   {Alexander Bondarenko and Lukas Gienapp and Maik Fr{\"o}be and Meriem Beloucif and Yamen Ajjour and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen},
  booktitle =                {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021)},
  editor =                   {{K. Sel{\c{c}}uk} Candan and Bogdan Ionescu and Lorraine Goeuriot and Henning M{\"u}ller and Alexis Joly and Maria Maistro and Florina Piroi and Guglielmo Faggioli and Nicola Ferro},
  ids =                      {potthast:2021t},
  month =                    sep,
  pages =                    {450-467},
  publisher =                {Springer},
  series =                   {Lecture Notes in Computer Science},
  site =                     {Bucharest, Romania},
  title =                    {{Overview of Touch{\'e} 2021: Argument Retrieval}},
  volume =                   12880,
  year =                     2021
}

@InProceedings{bondarenko:2022f,
  address =                  {Berlin Heidelberg New York},
  author =                   {Alexander Bondarenko and Maik Fr{\"o}be and Johannes Kiesel and Shahbaz Syed and Timon Gurcke and Meriem Beloucif and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen},
  booktitle =                {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022)},
  editor =                   {Alberto Barr{\'o}n-Cede{\~n}o and Giovanni Da San Martino and Mirko Degli Esposti and Fabrizio Sebastiani and Craig Macdonald and Gabriella Pasi and Allan Hanbury and Martin Potthast and Guglielmo Faggioli and Nicola Ferro},
  ids =                      {potthast:2022j},
  month =                    sep,
  numpages =                 29,
  publisher =                {Springer},
  series =                   {Lecture Notes in Computer Science},
  site =                     {Bologna, Italy},
  title =                    {{Overview of Touch{\'e} 2022: Argument Retrieval}},
  year =                     2022
}

Antique

If you re-use the Antique indices, please additionally cite:

@inproceedings{hashemi:2020,
  author    = {Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and W. Bruce Croft},
  editor    = {Joemon M. Jose and Emine Yilmaz and Jo{\~{a}}o Magalh{\~{a}}es and Pablo Castells and Nicola Ferro and M{\'{a}}rio J. Silva and Fl{\'{a}}vio Martins},
  title     = {{ANTIQUE:} {A} Non-factoid Question Answering Benchmark},
  booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {II}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12036},
  pages     = {166--173},
  publisher = {Springer},
  year      = {2020},
}


CORD-19

If you re-use the CORD-19  indices, please additionally cite:

@article{voorhees:2020,
  author    = {Ellen M. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner{-}Fushman and William R. Hersh and Kyle Lo and Kirk Roberts and Ian Soboroff and Lucy Lu Wang},
  title     = {{TREC-COVID:} constructing a pandemic information retrieval test collection},
  journal   = {{SIGIR} Forum},
  volume    = {54},
  number    = {1},
  pages     = {1:1--1:12},
  year      = {2020},
}

@article{wang:2020,
  author    = {Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and Kathryn Funk and Rodney Kinney and Ziyang Liu and William Merrill and Paul Mooney and Dewey A. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and Brandon Stilson and Alex D. Wade and Kuansan Wang and Chris Wilhelm and Boya Xie and Douglas Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier},
  title     = {{CORD-19:} The Covid-19 Open Research Dataset},
  journal   = {CoRR},
  volume    = {abs/2004.10706},
  year      = {2020},
  eprinttype = {arXiv},
  eprint    = {2004.10706},
}

Cranfield

If you re-use the Cranfield  indices, please additionally cite:

@inproceedings{cleverdon:1967,
  title={The {C}ranfield tests on index language devices},
  author={Cleverdon, Cyril},
  booktitle={{ASLIB} Proceedings},
  year={1967},
  pages     = {173--192},
  organization={MCB UP Ltd. (Reprinted in Readings in Information Retrieval, Karen Sparck-Jones and Peter Willett, editors, Morgan Kaufmann, 1997)}
}

@inproceedings{cleverdon:1991,
  author    = {Cyril W. Cleverdon},
  editor    = {Abraham Bookstein and Yves Chiaramella and Gerard Salton and Vijay V. Raghavan},
  title     = {The Significance of the {C}ranfield Tests on Index Languages},
  booktitle = {Proceedings of the 14th Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval. Chicago, Illinois, USA, October 13-16, 1991 (Special Issue of the {SIGIR} Forum)},
  pages     = {3--12},
  publisher = {{ACM}},
  year      = {1991},
}

Medline TREC Genomics

If you re-use the Medline TREC Genomics indices, please additionally cite:

@inproceedings{hersh:2004,
  author    = {William R. Hersh and Ravi Teja Bhupatiraju and L. Ross and Aaron M. Cohen and Dale Kraemer and Phoebe Johnson},
  editor    = {Ellen M. Voorhees and Lori P. Buckland},
  title     = {{TREC} 2004 Genomics Track Overview},
  booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
  series    = {{NIST} Special Publication},
  volume    = {500-261},
  publisher = {National Institute of Standards and Technology {(NIST)}},
  year      = {2004},
}

@inproceedings{hersh:2005,
  author    = {William R. Hersh and Aaron M. Cohen and Jianji Yang and Ravi Teja Bhupatiraju and Phoebe M. Roberts and Marti A. Hearst},
  editor    = {Ellen M. Voorhees and Lori P. Buckland},
  title     = {{TREC} 2005 Genomics Track Overview},
  booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
  series    = {{NIST} Special Publication},
  volume    = {500-266},
  publisher = {National Institute of Standards and Technology {(NIST)}},
  year      = {2005},
}



Medline TREC Precision Medicine

If you re-use the Medline TREC Precision Medicine  indices, please additionally cite:

@inproceedings{roberts:2017,
  author    = {Kirk Roberts and Dina Demner{-}Fushman and Ellen M. Voorhees and William R. Hersh and Steven Bedrick and Alexander J. Lazar and Shubham Pant},
  editor    = {Ellen M. Voorhees and Angela Ellis},
  title     = {Overview of the {TREC} 2017 Precision Medicine Track},
  booktitle = {Proceedings of The Twenty-Sixth Text REtrieval Conference, {TREC} 2017, Gaithersburg, Maryland, USA, November 15-17, 2017},
  series    = {{NIST} Special Publication},
  volume    = {500-324},
  publisher = {National Institute of Standards and Technology {(NIST)}},
  year      = {2017},
}

@inproceedings{roberts:2018,
  author    = {Kirk Roberts and Dina Demner{-}Fushman and Ellen M. Voorhees and William R. Hersh and Steven Bedrick and Alexander J. Lazar},
  editor    = {Ellen M. Voorhees and Angela Ellis},
  title     = {Overview of the {TREC} 2018 Precision Medicine Track},
  booktitle = {Proceedings of the Twenty-Seventh Text REtrieval Conference, {TREC} 2018, Gaithersburg, Maryland, USA, November 14-16, 2018},
  series    = {{NIST} Special Publication},
  volume    = {500-331},
  publisher = {National Institute of Standards and Technology {(NIST)}},
  year      = {2018},
}


MS MARCO (TREC Deep Learning 2019 and 2020

If you re-use the MS MARCO  indices, please additionally cite:

@inproceedings{craswell:2019,
  author    = {Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen M. Voorhees},
  booktitle = {28th International Text Retrieval Conference, {TREC} 2019, Gaithersburg, Maryland, USA},
  editor    = {{Ellen M.} Voorhees and Angela Ellis},
  month     = nov,
  title     = {{Overview of the {TREC} 2019 Deep Learning Track}},
  publisher = {National Institute of Standards and Technology (NIST)},
  series    = {NIST Special Publication},
  year      = {2019}
}

@inproceedings{craswell:2020,
  author    = {Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos},
  editor    = {Ellen M. Voorhees and Angela Ellis},
  title     = {{Overview of the {TREC} 2020 Deep Learning Track}},
  booktitle = {Proceedings of the 29th Text REtrieval Conference, {TREC} 2020, Virtual Event, Gaithersburg, MD, USA, November 16-20, 2020},
  series    = {{NIST} Special Publication},
  volume    = {1266},
  publisher = {National Institute of Standards and Technology {(NIST)}},
  year      = {2020},
}
 

NFCorpus

If you re-use the next  indices, please additionally cite:

@inproceedings{boteva:2016,
  author    = {Vera Boteva and Demian Gholipour Ghalandari and Artem Sokolov and Stefan Riezler},
  editor    = {Nicola Ferro and Fabio Crestani and Marie{-}Francine Moens and Josiane Mothe and Fabrizio Silvestri and Giorgio Maria Di Nunzio and Claudia Hauff and Gianmaria Silvello},
  title     = {A Full-Text Learning to Rank Dataset for Medical Information Retrieval},
  booktitle = {Advances in Information Retrieval - 38th European Conference on {IR} Research, {ECIR} 2016, Padua, Italy, March 20-23, 2016. Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {9626},
  pages     = {716--722},
  publisher = {Springer},
  year      = {2016},
}

LongEval

Please cite the [corresponding dataset](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-5151).:

@misc{11234/1-5151,
 title = {{LongEval} Click-Model Relevance Judgements (Qrels)},
 author = {Galu{\v s}{\v c}{\'a}kov{\'a}, Petra and Devaud, Romain and Gonzalez-Saez, Gabriela and Mulhem, Philippe and Goeuriot, Lorraine and Piroi, Florina and Popel, Martin},
 url = {http://hdl.handle.net/11234/1-5151},
 note = {{LINDAT}/{CLARIAH}-{CZ} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
 copyright = {Qwant {LongEval} Attribution-{NonCommercial}-{ShareAlike} License},
 year = {2023} }

The index can be re-used in the [LongEval 2024](https://clef-longeval.github.io/) shared task hosted at [CLEF 2024](https://clef2024.imag.fr/). The documents (and thereby the derived PyTerrier Index are under the Qwant LongEval Attribution-NonCommercial-ShareAlike License and by reusing the indices you also accept and aggree to do this under the sharealike qwant license.

Files

2024-03-19-17-50-12.zip

Files (1.7 GB)

Name Size Download all
md5:07d8fd9ab63569f6ea2a80aee86db55a
55.8 kB Preview Download
md5:6698ba27029448dac2a8a7293d0376a2
20.0 MB Preview Download
md5:f22fdcb8c25255e64a2380de42414689
20.0 MB Preview Download
md5:b019d109841dfce4db65bb315750024b
17.6 MB Preview Download
md5:00d788fd9ccc6eba51558800ba94c731
460.2 kB Preview Download
md5:44504a96a16e4e4eaef760025cb9c91b
63.7 MB Preview Download
md5:e23d2535cc67c7bafe513c40510da6fa
95.1 MB Preview Download
md5:9b97188e7d7383d1cea4dc6630c37081
63.7 MB Preview Download
md5:047e5276a69ab7353f584ab08e817539
217.0 kB Preview Download
md5:1895e01418d7170ac02be6ac1a9a185f
43.7 MB Preview Download
md5:ea47bb4a0f7d8999db1e510d45096214
64.7 MB Preview Download
md5:365ff525cca8302608c2df113eaad170
7.8 MB Preview Download
md5:3d07a6c1364534a3c62825316703845a
373.7 MB Preview Download
md5:93d2166f8498bb664afc1782ffdf2106
507.3 MB Preview Download
md5:290a3fb49d516b4dee3c1ddcf61c0884
97.7 MB Preview Download
md5:467241170d83d8320df5207a20c95454
66.9 MB Preview Download
md5:cb813f8d7193b436132e035613f983d4
102.3 MB Preview Download
md5:4c6a959deadcbd8a4327c10270ff5b63
97.7 MB Preview Download
md5:0fe6df76b5fd121c395a5008c0485447
28.4 MB Preview Download

Additional details

Software

Repository URL
https://github.com/OpenWebSearch/wows-code/tree/main/ecir24
Programming language
Python
Development Status
Active