Published December 3, 2024
| Version v2.0.0
Software
Open
Trafilatura
Creators
Description
Breaking changes:
- Python 3.6 and 3.7 deprecated (#709)
bare_extraction()
:- now returns an instance of the
Document
class by default as_dict
deprecation warning → use.as_dict()
method on return value (#730)
- now returns an instance of the
bare_extraction()
andextract()
:no_fallback
deprecation warning → usefast
instead (#730)- downloads: remove
decode
argument infetch_url()
→ usefetch_response
instead (#724) - deprecated graphical user interface now removed (#713)
- extraction: move
max_tree_size
parameter tosettings.cfg
(#742) - use type hinting (#721, #723, #748)
- see Python and CLI deprecations in the docs
Fixes:
- set
options.source
before raising error on empty doc tree by @dmoklaf (#707) - robust encoding in
options.source
(#717) - more robust mapping for conversion to HTML (#721)
- CLI downloads: use all information in settings file (#734)
- downloads: cleaner urllib3 code (#736)
- refine table markdown output by @unsleepy22 (#752)
- extraction fix: images in text nodes by @unsleepy22 (#757)
Metadata:
- more robust URL extraction (#710)
Command-line interface:
- CLI: print URLs early for feeds and sitemaps with
--list
with @gremid (#744) - CLI: add 126 exit code for high error ratio (#747)
Maintenance:
- remove already deprecated functions and args (#716)
- add type hints (#723, #728)
- setup: use
pyproject.toml
file (#715) - simplify code (#708, #709, #727)
- better debug messages in
main_extractor
(#714) - evaluation: review data, update packages, add magic_html (#731)
- setup: explicit exports through
__all__
(#740) - tests: extend coverage (#753)
Documentation:
- fix link in
docs/index.html
by @nzw0301 (#711) - remove docs from published packages (#743)
- update docs (#745)
Notes
Files
adbar/trafilatura-v2.0.0.zip
Files
(31.8 MB)
Name | Size | Download all |
---|---|---|
md5:e59e93e69b715077c664b4e2ea8aa5db
|
31.8 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/adbar/trafilatura/tree/v2.0.0 (URL)
Software
- Repository URL
- https://github.com/adbar/trafilatura