Game Walkthrough Corpus (GWTC)

doi:10.5281/zenodo.4559183

Published February 12, 2021 | Version 0.99

Dataset Open

Game Walkthrough Corpus (GWTC)

1. Leipzig University

Data collectors:

1. Leipzig University (Student)

Motivation

The Game Walkthrough Corpus (GWTC) contains 12,295 unique
walkthrough documents that cover a total of 6,117 games. For each game walkthrough,
it provides frequencies of unigrams and bigrams, treating it as a bag of words. In
addition, it provides word frequencies on the sentence level. Furthermore, the GWTC
contains a number of game-related metadata, including title, publisher, developer, year,
genre, etc. All the language statistics and metadata are stored in separate plain text files
and can be referenced by means of uniform resource names (URN). These URNs also
can be used to derive any combination of statistics and metadata. Researchers, for
instance, can investigate the most frequent unigrams for games in the “Adventure”
genre. This way, the GWTC can be reused in various ways, for different kinds of
research questions on the topic of gaming language, which may be summarized as
“distant playing”.

Copyright Information

Game walkthroughs are protected by individual copyright notices that are often very strict. That is why this data set does not include the documents but instead various data formats that are useful for text mining and distant reading methods while not allowing to recreate the documents. It is highly unlikely that even a single sentence can be reconstructed from the published data.
Since the documents are not -- not even in part -- published but only text mining statistics about them, no violation of copyright is done by this project.
Links to the original documents are available in the sourceUrls file in the data folder.

File Information

data folder: document data

bagofwords: Word frequencies per document
bigrams: Bigram frequencies per document
corpusstats: Min, avg and max token count, type count, type/token ratio, documents per game plus corressponding standard deviation
game_walkthrough_mapping: Documents per game
game_walkthrough_mapping: Number of documents per game
sentencecollocations: Word frequencies per sentence per document
sourceUrls: Links to original text
textlength: Number of characters per document
tfidf_deu: Word significance per document (German)
ifidf_eng: Word significance per document (English)
tokencount: Number of unique words per document
typecount: Number of words per document

metadata: game metadata

file names that do not start with "_": metadata [filename] per game
_all: All metadata in one file
_mapping_release_date*: Metadata combined with release data for time series

doc folder: documentation

createdata: Python script to create content of data folder
extractMetainformation: Python script to create content of metadata folder
metadata_rawg: Game metadata collected from RAWG
metadata_steam: Game metadata collected from Steam
metadata_symbol: Quality control. Relation of text in source HTML and extracted text
titlesandurns: Game titles mapped to project identifiers

Walkthrough Sources

https://portforward.com/games/walkthroughs/
https://www.neoseeker.com
https://www.spieletipps.de
https://jayisgames.com/
http://gamesetter.com/

Corpus Statistics

Number of unique games: 6,013
Number of documents: 12,295
Genre associations: 3,806
Gameplay tags: 10,246
Release dates: 2,443
Developers: 3,152
Publishers: 2,782
Steam IDs: 1,086
Platform associations: 5,293 (PC, Gameboy, iOS, Linux,...)
Game language associations: 4,631
Languages: English, German and a little bit of French

External Resources

Project Website: https://www.informatik.uni-leipzig.de/~jtiepmar/forschung/gwtc/
Bitbucket: https://bitbucket.org/jtiepmar/game-walkthrough-corpus/src/master/

There are two version of the GWTC available for download: ver. 0.99 contains all the above corpus files, plus the Git files. Note that after downloading ver. 0.99, the Git folders may be hidden per default, depending on you operating system. Ver. 1.0 is a cleaned up version that comes without the Git files.

Files

game-walkthrough-corpus.zip

Files (1.7 GB)

Name	Size	Download all
game-walkthrough-corpus.zip md5:66cdfdad17dcdf63fb2ce95f324d0d3a	1.7 GB	Preview Download

	All versions	This version
Views	1,393	403
Downloads	90	16
Data volume	113.0 GB	28.7 GB

Game Walkthrough Corpus (GWTC)

Creators

Contributors

Data collectors:

Description

Files

game-walkthrough-corpus.zip

Files (1.7 GB)