types2: Type and Hapax Accumulation Curves

Suomela, Jukka

types2 is a tool for analysing textual diversity, richness, and productivity in text corpora and other data sets.

With this tool, we can analyse data sets from the perspective of the following statistics:

  • number of words: the total number of running words in the text corpus
  • number of tokens: the words of interest in our study
  • number of types: how many distinct tokens we have seen
  • number of hapaxes: how many tokens have occurred only once

We are usually interested in comparing the number of types or hapaxes vs. the number of words or tokens. With types2, it is possible to analyse the relationship between types, hapaxes, words, and tokens.

The tool can be used for visualisation, statistical hypothesis testing, and exploratory data analysis. In the statistical analysis, we use nonparametric methods (more specifically, Monte Carlo permutation tests). The only modelling assumption is that, under the null hypothesis, individual “samples” are exchangeable.

The software is written by Jukka Suomela, and the system is designed and developed in collaboration with Tanja Säily.

