Published February 22, 2022 | Version v1
Dataset Open

University of Notre Dame News: A Reading

Authors/Creators

  • 1. University of Notre Dame

Description

I have done a bit of analysis -- reading -- against the set of news distributed by the University of Notre Dame, and below is some of what I learned.

Notes

I used program called wget to crawl the news site and cache the result. Upon closer inspection of the cache, I noticed how some of the Web pages were echoed and indexed in a number of auxiliary pages. I deleted the echoes and index pages, and I copied all of the news stories to a single directory. I then applied a tool called the Distant Reader Toolbox against the directory. This resulted in a data set of news stories which I proceeded to analyze.

Methods

All Distant Reader data sets ("study carrels") use the same method of creation. First, a set of narrative files of just about any type and any number are saved in a folder/directory. Second, the plain text is pulled from each file and saved. Third, feature extraction is done against the plain text to create tab-delimited indexes of bibliographics, email addresses, URLs, parts-of-speech, named-entities, and computed keywords. Fourth, all of the indexes are reduced to an SQLite database file. Finally, everything (the original files, the plain text files, the indexes, and SQLite database) is compressed into a zip file for distribution. The result is a platform- and network-independent data set that can be read and processed by any number of GUI applications, programming languages, or a Python module called the Distant Reader Toolbox.

Files

index.zip

Files (1.2 GB)

Name Size Download all
md5:6fe87e4e1d8ce9ebac879e5c006da224
1.2 GB Preview Download

Additional details

Software

Repository URL
https://github.com/ericleasemorgan/reader-toolbox
Programming language
Python
Development Status
Active