Module 4 Exercises

There are many different tools and approaches you could use to visualize your data, both as a preliminary pass to spot the holes and also for more formal analysis. In which case, for this module, I would like you to select two of these exercises which seem most germane for your own research project. You are welcome to work through more of them, of course, but I want the exercises to move your own research forward. Some of these I wrote; some are adapted from The Macroscope; others are adapted or used holus-bolus from scholars like Miriam Posner, Fred Gibbs, and Heather Froehlich (and I'm grateful that they shared their materials!)

nb If you start working with the R exercises below, I would suggest you read the introductory bits from Lincoln Mullen's book-in-progress, DH Methods in R, especially the 'setup' part under 'getting started' (pay attention to the bit on installing packages and dependencies). If you spot any exercises in Mullen's book that seem relevant to your project, you may do those as an alternative to the ones here. Alternatively, go to Swirl and learn the basics of R within R (It's an interactive tutorial. Try it out.). DHNow recently linked to a new Basic Text Mining in R tutorial which is worth checking out as well.

Texts Networks Maps Charts
Topic Modeling Tool Network analysis and visualization Simple mapping & georectifying Quick charts using RAW
Topic Modeling in R Converting 2-mode to 1-mode QGIS (tutorials by Fred Gibbs)
Text Analysis with OverviewProject Graphing the Net Geoparsing with Python
Corpus Linguistics with AntConc Choose your own adventure Palladio with Posner
Text Analysis with Voyant
Text Analysis in R

Exercise 1

Network Visualization

In exercise 1, you will transform your Texan Correspondence data into a network, which you will then visualize with Gephi. The detailed instructions are here. I would recommend that you also take a long look at Scott Weingart's series, Networks Demystified. Finally, heed our warning.


Exercise 2

Topic Modeling Tool

In exercise 2, you will use the 'Topic Modeling Tool' to create a simple topic model and a webpage that allows you to browse the results.

  1. Download the tool. (The site for the tool is https://code.google.com/p/topic-modeling-tool/.
  2. Make sure you have the Colonial Newspaper Database handy on your machine. (You can grab my copy from here).
  3. Double-click on the file you downloaded in step 1. This will open a java-based graphical user interface with one of the most common topic-modeling approaches, 'Latent Dirichlet Allocation'.
  4. Set the input to be the Colonial Newspaper Database.
  5. Set the output to be somewhere neat and tidy on your computer.
  6. Set the number of topics you'd like to model.
  7. Click 'train topics' to run the algorithm.
  8. When it finishes, go to the folder you selected for output, and find the file 'all_topics.html' in the 'output_html' folder. Click on that, and you now have a browser-based way of navigating your topics and documents. In the output_csv folder created, you will find the same information as csv, which you could then input into a spreadsheet for other kinds of visualizations (which we'll talk about in class.)

Make a note in your open notebook about your process and your observations. How does reading this material in this way change/challenge/or focus your understanding of the material?

Going Further Remember when we did the Canadiana API and WGET exercises in Module 2? Somewhere on your machine you have a collection of those materials. Now, you can load those materials into the Topic Modeling Tool if you have all of the txt files in a single folder. In the case of the slavery documents, that was something like 7500 items. That's a lot of drag-and-drop. You can 'flatten' the folder structure so that all of the documents in your subfolders are put into a single folder. If you are on a Mac, try these instructions. On a PC, try this one (there are scripts you can use, but for the time being this is probably simplest). Then, you can point your topic modeling tool at your flattened folder, and boom you have a topic model fitted to your collection.


exercise 3

Topic Modeling in R

Exercise 2 was quite a simple way to do topic modeling. There are more powerful ways, and one of these uses a program called R Studio, which is an interface for the R statistical programming language. R is a powerful language for exploring, visualizing, and manipulating all kinds of data, including textual. It is however not the most intutive of environments to work in. In which case, RStudio is what we need. Download the free version and install it on your machine. *Note also that you need to have R downloaded & installed first! Then, go to http://tryr.codeschool.com/ and work your way through some of that tutorial. This tutorial mimics working right in the R console. Remember working in git bash or the terminal in Module 3? It's somewhat similar to that, but just for R. A handy pdf that explains a bit more about working within the R Studio enivornment can be had here. In essence, you put your code in the script window, execute each line of it, and the output appears in the console. Or in the image plots window. This handout will guide you around the interface.

In this exercise, we're going to grab the Colonial Newspaper Database from my github page, do some exploratory visualizations, and then create a topic model whose output can then be visualized further in other platforms (including as a network in Gephi). The walkthrough can be found here. Each gray block is something to copy-and-paste into your script window in R Studio. Then, put the cursor at the start of the first line, and hit ctrl+enter to get RStudio to execute each line. In the walkthrough, when you get to another gray block, just copy and paste it into your script window after the earlier block. Work your way through the walkthrough. The walkthrough gives you an indication of what the output should look like as you move through it. (The walkthrough was written inside R, and then turned into HTML using an R package called 'Knittr'. You can see that this has implications for open research! For reference, here's the original Rmd (R markdown) file that generated the walkthrough.)

By the way: when you run this line: topic.model$train(1000) your console will fill up with data as it iterates 1000 times over the entire corpus, fitting a topic model to it. This is as it should be!

In this way, you'll build up an entire script for topic modeling materials you find on the web. You can then save your script and upload it to your open notebook. In the future, you'd be able to make just a few changes here and there in order to grab and explore different data.

Make a note in your open notebook about your process and your observations.

Going further If you wanted to use that script on the materials you collected in module 2, you would have to tell R to load up those materials from a directory, rather than by reading a csv file. Take a look at my script for topic modeling the Ferguson Grand Jury documents, especially this line:

documents <- mallet.read.dir("originaldocs/1000chunks/")

You feed it the path to your documents. If you are on a windows machine, the path would look a bit different, for instance:

"C:\\research\\originaldocs\\1000chunks\\"


exercise 4

Text Analysis with Overview

In exercise 4, we're going to look at the Colonial Newspaper Database again, but this time using a tool called 'Overview'. Overview uses a different approach that the topic models we've been discussing. In essence, it looks at word frequencies and their distributions within a document, and within a corpus, to organize the documents into folders of progressively similar word use.

You can download Overview to run on your own machine, but for our purposes, the hosted version at https://www.overviewdocs.com/ is sufficient. Go to that page, watch the video, create an account, and then log in. (More help about how Overview works may be found on their blog, including helpful videos.)

Once you're inside, click 'import from a CSV file', and upload the CND.csv (which you can download and save to your own machine from here <- right-click and save as. On the 'UPLOAD A CSV FILE' page in Overview click 'browse' and select the CND.csv. It will give you a preview. There are a number of options here - you can tell Overview which words to ignore, and which words to give added importance to. What words will you select? Make a note in your notebook. Then hit 'upload'.

A new page appears, called 'YOUR DOCUMENT SETS'. Click on the one you just uploaded. A file folder tree showing documents of progressively greater similarity will open; on the right hand side will be the list of documents within each box (the box in question will be greyed out when you click on it, so you know where you are). You can search for words in your document, and Overview will tell you where they are; you can tag documents that you find interesting. The Overview system allows you to jump between a distant, macroscopic view and a close, document level view. Jump back and forth, see what you can find. For suggestions about how to use Overview effectively, try their blog. Make notes about what you observe in your notebook. Also, you can export your tagged document set from Overview, so that you could visualize the patterns of tagging in a spreadsheet (for instance).

Going further Do you see how you could upload your documents that you collected during Module 2?


exercise 5

Corpus Linguistics with AntConc

Heather Froelich has put together an excellent step-by-step with using AntConc for exploring textual patterns within, and across, corpora of texts. Work your way through her tutorial

Can you get our example materials (from the Colonial Newspaper Database) into AntConc? This might help you to split the csv into individual txt files. Alternatively, do you have any materials of your own, already collected? Feed them into AntConc. What patterns do you see? What if you compare your materials against other corpora of texts?

FYI, here is a collection of corpora that you can explore


exercise 6

Text Analysis with Voyant

In module 3 if you recall, we worked through how to transform XML using stylesheets. Melodee Beals used a stylesheet to transform her database into a series of individual txt files. In the exercises above, a transformer was used to make the database into a single CSV file. In this exercise, we are going to use Voyant Tools to visualize patterns in word use in the database. Voyant can read either a CSV or text files. The advantage of uploading a folder of text files is that, if the files are in chronological order, Voyant's default visualizations will also be arranged in chronological order and thus we can see change over time.

Go to http://voyant-tools.org. Paste the URL to the csv of the CND database: https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv .

Now, open a new browser window, and go here http://voyant-tools.org/?corpus=colonial-newspapers&stopList=stop.en.taporware.txt

Do you see the difference? In the latter window, the individual articles have been uploaded individually, and thus are treated as individual documents in chronological order.

Explore the corpus, comparing terms over time, looking at keywords in context, and using the RezoViz tool to create a graph where people, places, and organizations that appear in the same documents (and across documents) are connected (you can find 'rezoviz' under the cogwheel icon at the top right of the panel). You can embed any of the tools in your blogs by using the 'save' icon and getting the iframe or embed code. You can apply 'stopwords' by clicking on the cogwheel in any of the different tools, and selecting stopwords. Apply the stopwords globally, and you'll only have to do this once! What patterns do you see? What do different tools highlight? Which ones are useful? What patterns do you see that strike you as interesting? Note this all down.

Going further Upload materials you collected in module 2 and explore them.


exercise 7

Quick Charts Using RAW

A quick chart can be a handy thing to have. Google spreadsheets, Microsoft Excel, and a host of other programs can make excellent charts quickly with their wizard functions. Never hesitate to turn to these. However, they are not always good with non-numeric data. In module 3, you used the NER to extract place names from a text. After some further munging with regex, you might have ended up with a CSV that looks like this. Can we do a quick visualization of this information? One useful tool is RAW. Open that in a new window. Copy the table of data of places mentioned in the Texan correspondence, and paste it into the data input box at the top of the RAW screen.

A quick data munge

You should get an error message, to the effect that you need to check 'line 2'. What's gone wrong? RAW has checked the number of values you have in that row, and compared it to the number of columns in row 1 (which contains all the column names). It sees that the two don't match. What we need to do is add a default null value in those cells. So, go to Google Sheets, click the 'go to google sheets' button, and then click on the big green plus sign to start a new sheet. Paste the following into the top-left cell (cell A1):

=IMPORTDATA("https://raw.githubusercontent.com/hist3907b-winter2015/module4-holes/master/texas.csv")

Pretty neat, eh? Now, here's the thing: even though your sheet looks like it is filled with information, it's not (at least, as far as the script we are about to run is concerned). That is to say, the sheet itself only has one cell of data, and that one cell is grabbing info from elsewhere on the web and dynamically filling the sheet. The script we're going to run works only on static values (more or less).

So, place your cursor in cell B1. On a Mac, hit shift+cmnd+downarrow. On a Windows machine, hit shift+ctrl+downarrow. Then on Mac shit+cmnd+rightarrow, on Windows shitf+crtl+rightarrow. Then copy all of that data (cmnd+c or ctrl+c). Then, under 'Edit' select 'paste special' -> 'paste VALUES only'.

The formula you put in cell A1 now says #REF!. You can delete this now. This mucking about is necessary so that the add on script we are about to run will work.

We now need to fill those empty values. In the tool bar, click add ons -> get add ons. Search for blanks. You want to add Blank Detector.

Now, click somewhere in your data. On Mac, hit cmnd+a. On Windows, hit ctrl+a. This highlights all of your data. Click Add ons -> blank detector -> detect cells. A dialogue panel will open on the right hand side of your screen. Click the button beside set value and type in null. Hit run. All of the blank cells will fill with the word null. Delete column A (which formerly had record numbers, but is now just filled with the word null. We don't need it). If you get the error, run exceeded maximum time just hit the run button again. This script might take a few minutes.

You can now copy and paste your table of data into the data input box in RAW, and you should get the green thumbs up saying x records have been successfully parsed!

Playing with RAW

RAW takes your data, and depending on your choices, passes it into chart templates built on the d3.js code library. D3.js is a powerful library for making all sorts of charts (including interactive ones). If this sort of thing interests you, you can follow the tutorials in Elijah Meeks' excellent new book.

With your data pasted in, you can now experiment with a number of different visualizations that are all built on the d3.js code library. Try the ‘alluvial’ diagram. Pick place1 and place2 as your dimensions - you click and drag the green boxes under 'map your data' into the 'steps' box. Leave the 'size' box empty. Under 'customize your visualization' you can click inside the 'width' box to make the diagram wider and more legible.

Does anything jump out? Try place3 and place 4. Try place1, place2, place3, and place4 in a single alluvial diagram. When we look at the original letters, we see that the writer often identified the town in which he was writing, and the town of the addressee. Why choose the third and fourth places? Perhaps it makes sense, for a given research question, to assume that with the pleasantries out of the way the writers will discuss the places important to their message. Experiment! This is one of the joys of working with data, experimenting to see how you can deform your materials to see them in a new light.

You can export your visualization under the 'download' box at the bottom of the RAW page - your choices are as a simple raster image (png), a vector image (svg) or a data representation (json).


exercise 8

Simple Mapping and Georectifying

In this exercise, you will find a historical map online, upload a copy to a mapwarper service, georectify it, and then display the map online, via a hosted service like CartoDB, and also through a map you will build yourself using leaflet.js. Finally, we will also convert csv to geojson using http://togeojson.com/, and we'll map that as a github gist. We'll also grab a geojson file hosted on github gist and import it into cartodb.

Georectifying

Georectifying is the process of taking an image (whether it is of a historical map, chart, airphoto, or whatever) and manipulating its geometry so that it matches a geographic projection. Think of it like this: you take your handdrawn map, and use pushpins to pin down known locations on your map to a globe. As you pin, your image stretches and warps. Traditionally, this has not been an easy thing to do, if you are new to GIS. In recent years, the curve has flattened significantly. In this exercise, we'll grab an image, upload it to the Harvard Library MapWarper service, and then export it as a tileset which can be used in other mapping programs.

  1. Get a historical map. I like the Fire Insurance plans from the Gatineau Valley Historical Society; I'm sure you can find others to suit your interests.
  2. Right-click, save as.... grab a copy. Save it somewhere handy.
  3. Go to Harvard World MapWarp and sign up for an account. Then login.
  4. Go to the upload screen:
    Imgur
  5. Fill in as much of the metadata as you can. Then select your map from your computer, and upload it.
  6. On the next page, click 'rectify'.
    Imgur
  7. Pan and zoom both maps until you're sure you're looking at the same area in both. Double click in a map, select the pencil icon, and click on a point (location) you are sure you can match in the other window. Then click on the other map window, select the pencil, and then click on the same point. The 'add control point' button below and between both maps will light up. Click on this to confirm that this is a control point you want. Do this at least three times; the more times you can do it, the better the map warp.
  8. Having selected your control points, click on 'warp image'.
  9. You can now click on the 'export' panel, and get the URL for your georectified image in a few different formats. If you clicked on the KML option, a google map window will open like so. For many webmapping applications, the Tiles (Google/OSM scheme): Tiles Based URL is what you want. You'll get a URL like this: http://warp.worldmap.harvard.edu/maps/tile/4152/z/x/y.png Save that info. You'll need it later.

You have now georectified a map. Let's use that map as a base layer in Palladio

We need some place data for Palladio. Here's what I'm using
Imgur
Note how I've formatted this data. I'll be copying and pasting it into Palladio. (For more on how to input geographic data into Palladio, see this tutorial). Basically, you want something like this:

Place Coordinates
Mexico 23.634501,-102.552784
California 36.778261,-119.4179324
Brazos 32.661389,-98.121667

etc: that is, a tab between 'place' and 'coordinates' in the first line, a tab between 'mexico' and the latitude, and a comma between latitude and logitude.

  1. Go to Palladio. Hit 'start' then 'upload spreadsheet or csv'. In the box, paste in your data. You can progress to the next step without having any real data: just paste or type something in - see the video below. Obviously, you won't have any points on your map, but if you were having trouble with that step, this allows you to bypass it to continue on with this tutorial.
  2. Click on 'map'. Under 'places', select 'coordinates'. Then, click 'add new layer'. In the popup, beside 'Choose one of Palladio default layers or create a new one.', select 'custom'. This is where you're going to paste it that tiles based URL from the map warper. Paste it in, but replace the /z/x/y part with {z}/{x}/{y}. Click add.

Here is a video walk through; places where you might have got into trouble include getting past the initial data entry box on Palladio, and finding where exactly to past in your georectified map url.

Congratulations! You've georectified a map, and used it as a base layer for a visualization of some point data. Here are some notes on using a georectified map with the CartoDB service.

Imgur


exercise 9

Text Analysis in R

I would suggest, before you try this, that you look at the walkthrough for exercise 3, and that you become familiar with R. Then, you can try this tutorial, starting at page 3. On that page, the author tells you to create a folder called /corpus/text, and to fill it with text files you'd like to analyse. So why not grab some of the materials you collected in module 2? The problem is, where is this folder supposed to go? In R studio, find out where your working director is by typing

getwd()

in the console. Then, you can create the /corpus/text folder & subfolder at that location. Alternatively, you can set the working directory to wherever you like like so:

setwd("C://my-working-folder//") on a pc, or setwd("~/my-working-folder/") on a mac.

Then, to get going, you'd need

install.packages("tm")

library(tm)

You can then work through the entire pdf, or jump ahead to page 37 to see what the completed script would look like (here's my version using the CND again. Makes notes of what you find. Google any error messages you find to try to figure out a solution.

exercise 10 QGIS

QGIS

There are many excellent tutorials around concerning how to get started with GIS. Our own library, in the MADGIC centre has tremendous resources and I would encourage you to speak with the map librarians before embarking on any serious mapping projects. In the short term, the historian Fred Gibbs has an excellent series on using the open source GIS platform QGIS to make and map historical data.

For this exercise, I would recommend you try Gibbs' first tutorial,

'Making a map with QGIS'

...and then, try georectifying a historical map and adding it to your GIS:

'Using historica maps with qgis'

Going Further

There are many tutorials at The Programming Historian that are appropriate here. Try some under the 'data manipulation' or 'distant reading' headings.

If you're into social media as a data source, you might try Twarc.