Published August 12, 2025
| Version v6
Dataset
Open
A new, comprehensive database of all proceedings of the Australian Parliamentary Debates (1998-2022)
Description
This database contains data on the proceedings from each sitting day in the Australian Parliament by the House of Representatives from 02 March 1998 to 08 September 2022, in both CSV and parquet forms. These data were parsed entirely from the XML Hansard transcripts available on the Australian Parliament website.
The database is stored in the folder hansard-corpus.zip, which contains the full Hansard corpus in CSV form and in parquet form.
Since the last version released on 6 July 2023, we have made the following updates:
- Standardized the formatting of the "name" column for consistency and completeness.
- Re-populated the "name.id", "uniqueID", and "gender" variables to correct for any errors due to parsing or Hansard transcription. The correct "name" and "name.id" mapping was identified using data from the ausPH R package.
- Added "member" and "senator" flag variables using data from the AustralianPoliticians R package.
- Manually fixed any cases where an MP was quoting someone else in their speech, and that quotation was incorrectly separated onto a new row.
Files
hansard-corpus.zip
Files
(787.6 MB)
Name | Size | Download all |
---|---|---|
md5:88d9e352a8a63f061cd80dfdc60fe192
|
787.6 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/lindsaykatz/hansard-proj
- Programming language
- R