There is a newer version of the record available.

Published January 16, 2020 | Version v0.6.1
Software Open

adbar/htmldate: : Find original and updated publication dates of web pages using common patterns, heuristics and robust extraction

  • 1. Berlin-Brandenburg Academy of Sciences

Description

htmldate finds original and updated publication dates of any web page. All the steps needed from web page download to HTML parsing, scraping and text analysis are included.

In a nutshell, with Python:

from htmldate import find_date
find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23'
find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern', original_date=True)
'2016-06-23'

On the command-line:

$ htmldate -u "http://blog.python.org/2016/12/python-360-is-now-available.html" 2016-12-23

Releases used in production and meant to be archived on Zenodo for reproducibility and citability.

For more information see htmldate.readthedocs.io

Files

adbar/htmldate-v0.6.1.zip

Files (1.2 MB)

Name Size Download all
md5:6f6fcb96b7d0086bec22cd7a8682ea6a
1.2 MB Preview Download

Additional details