Published September 10, 2024 | Version 2.0.0
Software Open

ChatNoir Resiliparse

  • 1. Bauhaus-Universität Weimar
  • 2. Friedrich-Schiller-Universität Jena
  • 3. Leipzig University and ScaDS.AI

Description

This package contains ChatNoir Resiliparse, a collection of robust and fast processing tools for parsing and analyzing web archive data.

 

Paper Abstract Elastic ChatNoir

Elastic ChatNoir is an Elasticsearch-based search engine offering a freely accessible search interface for the two ClueWeb corpora and the Common Crawl, together about 3 billion web pages. Running across 130 nodes, Elastic ChatNoir features subsecond response times comparable to commercial search engines. Unlike most commercial search engines, it also offers a powerful API that is available free of charge to IR researchers. Elastic ChatNoir’s main purpose is to serve as a baseline for reproducible IR experiments and user studies for the coming years, empowering research at a scale not attainable to many labs beforehand, and to provide a platform for experimenting with new approaches to web search.

 

Paper Abstract FastWARC

Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data via off-the-shelf open source parsers in many programming languages, such as WARCIO, the de-facto standard for Python. When processing web archives at the terabyte or petabyte scale, however, even small inefficiencies in these tools add up quickly, resulting in hours, days, or even weeks of wasted compute time. Reviewing the basic components of WARCIO and analyzing its bottlenecks, we proceed to build FastWARC, a new high-performance WARC processing library for Python, written in C++/Cython, which yields performance improvements by factors of 1.6-8x.

 

Links

Link to papers on Springer and ArXiv:

Link to papers on Webis:

 

Citation

If you use ChatNoir or any of its tools (like Resiliparse) for a publication, please be sure to cite our paper:

@InProceedings{bevendorff:2018,
  address =             {Berlin Heidelberg New York},
  author =              {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =           {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =              {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  month =               mar,
  publisher =           {Springer},
  series =              {Lecture Notes in Computer Science},
  site =                {Grenoble, France},
  title =               {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =                2018
}

 

If you use FastWARC, you can also cite this paper:

@InProceedings{bevendorff:2021,
  author =                {Janek Bevendorff and Martin Potthast and Benno Stein},
  booktitle =             {3rd International Symposium on Open Search Technology (OSSYM 2021)},
  editor =                {Andreas Wagner and Christian Guetl and Michael Granitzer and Stefan Voigt},
  month =                 oct,
  publisher =             {International Open Search Symposium},
  site =                  {CERN, Geneva, Switzerland},
  title =                 {{FastWARC: Optimizing Large-Scale Web Archive Analytics}},
  year =                  2021
}

 

Files

chatnoir-resiliparse-v2.zip

Files (899.2 kB)

Name Size Download all
md5:754576ef1277f0ffbd84d8a3989f986b
899.2 kB Preview Download

Additional details

Related works

Is derived from
Software: https://github.com/chatnoir-eu/chatnoir-resiliparse (URL)
Is published in
Conference paper: 10.1007/978-3-319-76941-7_83 (DOI)

Funding

European Commission
OpenWebSearch.EU - Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty 101070014