ChatNoir Resiliparse
Authors/Creators
- 1. Bauhaus-Universität Weimar
- 2. Friedrich-Schiller-Universität Jena
- 3. Leipzig University and ScaDS.AI
Description
This package contains ChatNoir Resiliparse, a collection of robust and fast processing tools for parsing and analyzing web archive data.
Paper Abstract Elastic ChatNoir
Elastic ChatNoir is an Elasticsearch-based search engine offering a freely accessible search interface for the two ClueWeb corpora and the Common Crawl, together about 3 billion web pages. Running across 130 nodes, Elastic ChatNoir features subsecond response times comparable to commercial search engines. Unlike most commercial search engines, it also offers a powerful API that is available free of charge to IR researchers. Elastic ChatNoir’s main purpose is to serve as a baseline for reproducible IR experiments and user studies for the coming years, empowering research at a scale not attainable to many labs beforehand, and to provide a platform for experimenting with new approaches to web search.
Paper Abstract FastWARC
Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data via off-the-shelf open source parsers in many programming languages, such as WARCIO, the de-facto standard for Python. When processing web archives at the terabyte or petabyte scale, however, even small inefficiencies in these tools add up quickly, resulting in hours, days, or even weeks of wasted compute time. Reviewing the basic components of WARCIO and analyzing its bottlenecks, we proceed to build FastWARC, a new high-performance WARC processing library for Python, written in C++/Cython, which yields performance improvements by factors of 1.6-8x.
Links
Link to papers on Springer and ArXiv:
- Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl,
- FastWARC: Optimizing Large-Scale Web Archive Analytics
Link to papers on Webis:
- Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl,
- FastWARC: Optimizing Large-Scale Web Archive Analytics
Citation
If you use ChatNoir or any of its tools (like Resiliparse) for a publication, please be sure to cite our paper:
@InProceedings{bevendorff:2018,
address = {Berlin Heidelberg New York},
author = {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
booktitle = {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
editor = {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
month = mar,
publisher = {Springer},
series = {Lecture Notes in Computer Science},
site = {Grenoble, France},
title = {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
year = 2018
}
If you use FastWARC, you can also cite this paper:
@InProceedings{bevendorff:2021,
author = {Janek Bevendorff and Martin Potthast and Benno Stein},
booktitle = {3rd International Symposium on Open Search Technology (OSSYM 2021)},
editor = {Andreas Wagner and Christian Guetl and Michael Granitzer and Stefan Voigt},
month = oct,
publisher = {International Open Search Symposium},
site = {CERN, Geneva, Switzerland},
title = {{FastWARC: Optimizing Large-Scale Web Archive Analytics}},
year = 2021
}
Files
chatnoir-resiliparse-v2.zip
Files
(899.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:754576ef1277f0ffbd84d8a3989f986b
|
899.2 kB | Preview Download |
Additional details
Related works
- Is derived from
- Software: https://github.com/chatnoir-eu/chatnoir-resiliparse (URL)
- Is published in
- Conference paper: 10.1007/978-3-319-76941-7_83 (DOI)
Funding
Software
- Repository URL
- https://github.com/chatnoir-eu/chatnoir-resiliparse