Preprint Open Access

Zeta & Eta: An Exploration and Evaluation of two Dispersion-based Measures of Distinctiveness

Du, Keli; Dudar, Julia; Rok, Cora; Schöch, Christof

In Corpus Linguistics, numerous statistical measures have been adopted to analyze large amounts of textual data in a contrastive perspective, in order to extract characteristic or “distinctive” features. While the most widely-used keyness measures are based on word frequency, an increasing number of research papers recently suggested dispersion-based measures as a better solution. These, however, are not new to Computational Literary Studies (CLS). In 2007, John Burrows introduced Zeta, a statistical measure that is mainly based on the degree of dispersion of a feature in a text corpus. In this paper, we also introduce Eta, a new measure of distinctiveness that is based on deviation of proportions suggested by Stefan Gries. By comparing Eta with Zeta, we demonstrate that both measures are able to identify relevant, interpretable distinctive words in a target corpus. Additionally, we make a first attempt to detect the key differences between these two measures by interpreting the top distinctive words.


DFG Schwerpunktprogramm SPP 2207 "Computational Literary Studies"


Teilprojekt: "Zeta und Konsorten. Distinktivitätsmaße für die Digitalen Literaturwissenschaften"


Files (1.2 MB)
Name Size
1.2 MB Download
All versions This version
Views 14794
Downloads 8860
Data volume 108.4 MB73.9 MB
Unique views 12485
Unique downloads 6753


Cite as