Published August 12, 2025 | Version v6
Dataset Open

A new, comprehensive database of all proceedings of the Australian Parliamentary Debates (1998-2022)

  • 1. University of Toronto

Description

This database contains data on the proceedings from each sitting day in the Australian Parliament by the House of Representatives from 02 March 1998 to 08 September 2022, in both CSV and parquet forms. These data were parsed entirely from the XML Hansard transcripts available on the Australian Parliament website.

The database is stored in the folder hansard-corpus.zip, which contains the full Hansard corpus in CSV form and in parquet form.

Since the last version released on 6 July 2023, we have made the following updates:

  1. Standardized the formatting of the "name" column for consistency and completeness.
  2. Re-populated the "name.id", "uniqueID", and "gender" variables to correct for any errors due to parsing or Hansard transcription. The correct "name" and "name.id" mapping was identified using data from the ausPH R package.
  3. Added "member" and "senator" flag variables using data from the AustralianPoliticians R package.
  4. Manually fixed any cases where an MP was quoting someone else in their speech, and that quotation was incorrectly separated onto a new row.

Files

hansard-corpus.zip

Files (787.6 MB)

Name Size Download all
md5:88d9e352a8a63f061cd80dfdc60fe192
787.6 MB Preview Download

Additional details

Software

Repository URL
https://github.com/lindsaykatz/hansard-proj
Programming language
R