Published October 8, 2025 | Version v7
Dataset Open

A new, comprehensive database of all proceedings of the Australian Parliamentary Debates (1998-2025)

  • 1. University of Toronto

Description

This database contains data on the proceedings from each sitting day in the Australian Parliament by the House of Representatives from 02 March 1998 to 31 July 2025, in parquet form. These data were parsed entirely from the XML Hansard transcripts available on the Australian Parliament website.

The database is stored in the folder corpus_1998_to_2025.parquet, which contains the full Hansard corpus.

Since the last version released on 12 August 2025, we have made the following updates:
  1. Standardized and validated the electorate column.
  2. Standardized and validated the party abbreviation column.
  3. Added a column with the full party name.
  4. Improved question and answer flagging by identifying rows starting with text such as "My question is to", "My question goes to", etc. which were not flagged as questions, and corrected those.
  5. Fixed many instances of incorrectly flagged interjections
  6. Separated out interjections which had been detected that were not on their own row.
  7. Validated that all MPs who were present on each sitting day were actually Members of Parliament on that day.
  8. Identified and fixed any rows with a null body.

Files

Files (579.6 MB)

Name Size Download all
md5:4910153c4c555ea7811887b1803f39e0
579.6 MB Download

Additional details

Software

Repository URL
https://github.com/lindsaykatz/hansard-proj
Programming language
R