warctika: warctika 1.0: First production release

Tom Nicholls

doi:10.5281/zenodo.12183

Published October 10, 2014 | Version v1.0

Software Open

warctika: warctika 1.0: First production release

Tom Nicholls¹

1. Oxford Internet Institute, University of Oxford

This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records.

The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work.

WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/

The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.

The software at this stage should be considered feature-complete, though it may have minor additions in the future.

Files

warctika-v1.0.zip

Files (24.5 kB)

Name	Size	Download all
warctika-v1.0.zip md5:be744df32c672ac10974957801b5f9f1	24.5 kB	Preview Download

Additional details

Is supplement to: https://github.com/pmyteh/warctika/tree/v1.0 (URL)

	All versions	This version
Views	177	127
Downloads	26	18
Data volume	621.9 kB	441.5 kB

warctika: warctika 1.0: First production release

Creators

Description

Files

warctika-v1.0.zip

Files (24.5 kB)

Additional details

Related works