Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published November 29, 2022 | Version v0.3.4
Software Open

Living-with-machines/alto2txt

  • 1. The Alan Turing Institute
  • 2. EPCC, The University of Edinburgh
  • 3. British Library
  • 4. The Alan Turing Institute; RiCEP, Academy of Finland

Description

alto2txt: Extract plain text from newspapers

Converts XML (in METS 1.8/ALTO 1.4, METS 1.3/ALTO 1.4, BLN orUKP format) publications to plaintext articles and generates minimal metadata.

Full documentation and demo instructions.

Added

  • Added PyPI version and MIT license badges to README.md
  • Added pytest-cov with default options to assess documentation
  • Added isort to .pre-commit-config.yaml to sort import consistency
  • Added pycln to .pre-commit-config.yaml to check unused imports
  • Added pycln configuration to pyproject.toml
  • Added alto2txt as a command line script in pyproject.toml

Changed

  • Switch from Apache v2.0 license to MIT license, inline with project recommendations.
  • Updated mypy in .pre-commit-config.yaml

Deprecated

  • Replace extract_publications_text.py with the alto2txt command line interface script specified in pyproject.toml

Removed

  • setup.py
  • requirements.txt

Fixed

  • Fixed python = ">3.6.0" in pyproject.toml rather than >3.7 for consistency with documentation
  • Fixed licensing ambiguity (now all should be MIT)
  • Fixed typos in README.md
  • Fixed surperflous imports via pycln in pre-commit

Files

Living-with-machines/alto2txt-v0.3.4.zip

Files (1.0 MB)

Name Size Download all
md5:948ce84fac76ea3d50ccfe202633b38b
1.0 MB Preview Download

Additional details

Related works

Is documented by
Software documentation: https://living-with-machines.github.io/alto2txt (URL)
Is supplement to
Software: https://github.com/Living-with-machines/alto2txt/tree/v0.3.4 (URL)

Funding

Living with Machines AH/S01179X/1
UK Research and Innovation