University of Notre Dame News: A Reading

Morgan, Eric Lease

doi:10.5281/zenodo.11475087

Published February 22, 2022 | Version v1

Dataset Open

University of Notre Dame News: A Reading

Morgan, Eric Lease¹

1. University of Notre Dame

I have done a bit of analysis -- reading -- against the set of news distributed by the University of Notre Dame, and below is some of what I learned.

Notes

I used program called wget to crawl the news site and cache the result. Upon closer inspection of the cache, I noticed how some of the Web pages were echoed and indexed in a number of auxiliary pages. I deleted the echoes and index pages, and I copied all of the news stories to a single directory. I then applied a tool called the Distant Reader Toolbox against the directory. This resulted in a data set of news stories which I proceeded to analyze.

Methods

All Distant Reader data sets ("study carrels") use the same method of creation. First, a set of narrative files of just about any type and any number are saved in a folder/directory. Second, the plain text is pulled from each file and saved. Third, feature extraction is done against the plain text to create tab-delimited indexes of bibliographics, email addresses, URLs, parts-of-speech, named-entities, and computed keywords. Fourth, all of the indexes are reduced to an SQLite database file. Finally, everything (the original files, the plain text files, the indexes, and SQLite database) is compressed into a zip file for distribution. The result is a platform- and network-independent data set that can be read and processed by any number of GUI applications, programming languages, or a Python module called the Distant Reader Toolbox.

Files

index.zip

Files (1.2 GB)

Name	Size	Download all
index.zip md5:6fe87e4e1d8ce9ebac879e5c006da224	1.2 GB	Preview Download

Additional details

Is described by: https://distantreader.org/ (URL)
Is identical to: http://carrels.distantreader.org/curated-notre_dame_news-2022/index.zip (URL)
Is part of: http://carrels.distantreader.org (URL)
Is variant form of: http://carrels.distantreader.org/curated-notre_dame_news-2022/ (URL)

Repository URL: https://github.com/ericleasemorgan/reader-toolbox
Programming language: Python
Development Status: Active

	All versions	This version
Views	58	58
Downloads	22	22
Data volume	26.9 GB	26.9 GB

index.zip

Files (1.2 GB)

Related works

Software

University of Notre Dame News: A Reading

Authors/Creators

Description

Notes

Methods

Files

index.zip

Files (1.2 GB)

Additional details

Related works

Software