Published January 16, 2020
| Version v0.6.1
Software
Open
adbar/htmldate: : Find original and updated publication dates of web pages using common patterns, heuristics and robust extraction
Description
htmldate finds original and updated publication dates of any web page. All the steps needed from web page download to HTML parsing, scraping and text analysis are included.
In a nutshell, with Python:
from htmldate import find_date find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern', original_date=True) '2016-06-23'
On the command-line:
$ htmldate -u "http://blog.python.org/2016/12/python-360-is-now-available.html" 2016-12-23
Releases used in production and meant to be archived on Zenodo for reproducibility and citability.
For more information see htmldate.readthedocs.io
Files
adbar/htmldate-v0.6.1.zip
Files
(1.2 MB)
Name | Size | Download all |
---|---|---|
md5:6f6fcb96b7d0086bec22cd7a8682ea6a
|
1.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/adbar/htmldate/tree/v0.6.1 (URL)
- Conference paper: https://konvens.org/proceedings/2019/papers/kaleidoskop/camera_ready_barbaresi.pdf (URL)
- References
- Conference paper: https://hal.archives-ouvertes.fr/hal-01371704v2/document (URL)