Computer-based statistical description of phonetical balance for Romanian utterances

,


Introduction
One of the important changes imposed by the Digital Era concerns the way in which we secure and have access to our assets. Traditionally understood as a physical object that belongs to the owner, the key that grants access to one's assets has gradually shifted towards something the owner knows and, more recently, towards who the owner is. In fact, the etymology of the word shows the key as a metal piece for opening locks (via Middle English keie cognate with the Middle Low German keie, which means lance or spear), emphasizing the physical nature of the object. The Digital Era, however, has transformed the key into a piece of information, something that (only) the owner knows, usually in the form of an alphanumeric sequence that the owner provides before accessing some digital resources, such as financial data in e-banking systems, medical files on e-health platforms, personal data on cloud storage provides, etc. [1]. In a way, the plethora of digital security solutions, in particular the software ones, such as the hashing methods used to conceal the digital keys, the mechanisms and algorithms used to tunnel and wrap them, the certification systems used to secure the data transfers, etc., mask the central position that the alphanumeric digital keys currently have in the information ecosystem. Moreover, although numerous hybrid authentication methods exist, e.g., the two-factor authentication, where the addition of a new recognition method is improving the authenticity of the owner's identity [2], we observe that one component of this scheme remains invariant: the (tried and tested) alphanumeric key.
Naturally, the assets themselves have shifted towards the digital realm as well, as have many of our activities. In fact, digital technologies have substantially changed the way we socialize and entertain ourselves, the way we work and learn, and so on, and we now have a new discipline, namely digital sociology, solely dedicated to the way these technologies are impacting our everyday life [3].
Finally, the past decades have placed more emphasis on (implicitly automated) biometrics security solutions, which allow the owner of some assets to access them not by using something that he possesses or something that he knows, as done in the past, but rather by identifying himself as the owner [4]. The term biometrics itself is defined as "automated recognition of individuals based on their behavioral and biological characteristics" (see ISO/IEC JTC1 SC37), the most common biometric traits used for authentication of users being the fingerprint, the face, the iris, the palm-print, the retina, and the voice. In fact, any human physiological and/or behavioral characteristic can be used as a biometric characteristic, as long as it satisfies a series of requirements such universality, i.e., every person using the system should possess the trait, uniqueness, i.e., the trait should be sufficiently different for individuals in the relevant population such that they can be distinguished from one another, etc., but at the moment the voice is one of the most used traits [4].
As part of building a voice-biometrics identification system which is largely text-independent (i.e., no pass phrase) and shows little sensitivity to ambiental noise [5], we revisit by means of extensive computer-based investigations the concept of phonetical balance for Romanian utterances. The goal of our investigation is to have statistical descriptors of the phonology of the Romanian language that will be helpful in the development stages of the aforementioned voice-biometrics identification system. To this end, we go beyond the standard distribution of phonems and analyze the distribution of consonant clusters for Romanian words to identify the most important ones. Moreover, we propose a simple indicator that measures vowel-consonant sequences to show that large clusters of vowels or consonants are infrequent.

Distribution of phonems
While the mathematics behind the distribution of phonems in a given text is relatively simple, the main technical challange comes from finding a set of texts, usually very large, that are representative for the language under scrutiny. In the case of Romanian language [6] the text was acquired using the Web-as-resource or Web-as-corpus approach (considering as sources mainly online Romanian news-papers and transcripts of the discussions in the European parliament) [7], which produced more than 9 million phrases, the largest Romanian plain text corpus to date [7]. The results of this analysis clearly show that the 34 phonems identified are qualitatively different, some of them being very common, while others somewhat infrequent. In fact, the first six phonems correspond to more than 50% of the entire phonem usage, while the last six phonems have an occurence frequency in between 0.27% and 0.03%.
The aforementioned distribution of phonems (complemented with related results in Ref. [8]) gives an accurate global description of the Romanian language which can be now compared with languages (such as English and French) for which such statistical descriptions have a long history. We note, however, that in the case of smaller texts this statistical description is insufficient, as two small texts of similar if not identical distribution of phonems may be, in fact, substantially different from a phonetic point of view. As an elementary example we note here two simple Romanian sentences, namely S1: "Oaia e proastȃ" (in English: The sheep is dumb) and S2: "Ia ta e poroasȃ" (in English: Your blouse is porous), which consist of the same phonems but differ considerably. The first sentence, for instance, has a series of four subsequent vowels (namely "oaia e") and a consonant cluster of two letters (namely "st"), while the second has no consonant cluster and the longest series of vowels is of length two (namely "ae" and "oa").
In the language of statistical physics, if the statistical ensemble is very large, than the distribution of phonems becomes the key descriptor of the text under scrutiny, as all other features (say, vowel and consonant clusters, distribution of words, and so on) average out. For short texts, however, the distribution of phonems should be complemented with additional information about the vowelconsonant sequences and the types of consonant clusters.

Distribution of consonant clusters
Motivated by the previous example we embarked on a detailed statistical study of the Romanian vocabulary using the database of Dexonline [9], which is an open source collection of the main dictionaries of the Romanian language. For our analysis we used a set of more than 90.000 words, which roughly correspond to the lexicon in Ref. [10], which is the main dictionary of the Romanian language.
Through a thorough statistical analysis we identified all two-consonant clusters, independent of their position in a word, and ranked them according their occurence race. To simplify the analysis, the results for the letter s include also the letter s , , while those for the letter t also include the letter t , . The bubble-plot in Fig. 1 shows the main consonant clusters, indicating their occurence through the size of the bubble. The main message of the plot is that there are a few frequent consonant clusters (such as "st", "nt", "tr", "pr", etc.) which appear in 5% to 10% of all words in the Romanian vocabulary and numerous other infrequent clusters (such as "lb", "sf", etc.). A rapid inspection of the plot shows the tendency to have consonant clusters using the letters in the second half of the consonant series. Moreover, there is a clear assymetry with respect to the first diagonal, meaning that the frequency rates of a given cluster and its inverse are substantially different. To understand this property let us look at the "tr" cluster which is more frequent than the "rt" cluster, or, more clearly at the "st" and "ts" ones. The first order of the consonants, "st", corresponds to one of the most important Romanian cluster, while "ts" is inexistent.
What is more interesting is that the distribution of the consonant clusters follows a free-scale-like distribution. Taking P (k) as the probability that a given cluster appears in k words, we observed (see Fig. 2, the upper panel) that P (k) ≈ k −γ where γ ≈ 3.2. This statistical behavior shows a striking resemblance to the so-called Zipf law which states that the frequency of a given word is inversely proportional to its rank in the frequency table [11]. A classical textual econometrics study on the Brown Corpus of American English showed "the" as the most frequent word in the vocabulary (with an occurence rate of almost 7%), "of" as the second most frequent word with an occurence rate of roughly 3.5%, etc. Similarly, it has been shown that the distribution of word sequences (the so-called n-gramas) follows the same pattern, provided the reference texts are large enough [12]. The key feature of the Zipf law is that on a log-log plot the distribution is linear with a negative slope, which is similar to what we noticed in Fig. 2 (the upper panel). In our case "st" was the most frequent consonant cluster with a frequency rate of 9.4 %, "nt" was the second most frequent cluster (8.5 % frequency rate), "tr" was the third one (7 % frequency rate), etc. This distribution is typical to many systems, ranging from social networks (such as the collaboration of movie actors in films and the co-authorship of papers), the internet, the protein-protein interactio, etc. [13]. While Zipf's law has been verified in numerous contexts the mechanisms behind it remain largely elusive, despite numerous models which capture some of its features. Zipf, for example, understood the law through the principle of least effort, which has been often revisited by means of advanced mathematical models [14], while others consider the preferential attachment mechanism which basically says that the speakers tend to use some words more often than others [15].
It is tempting to see the correlation between the frequency rates of the consonant clusters and their etymology, but such an analysis is of little insight, as the Romanian dictionaries only record the language from which a given words entered into the Romanian vocabulary, and not the language of origin [16]. Finally, we note that the distribution of three-and four-letters consonant clusters brings only a small correction to the aformentioned statistics and that a detailed study will be reported elsewhere.

Phonetical balance
The previous discussion on consonant clusters brings some clarification to the phonology of Romanian language, but an instrument is needed to quantify the vowel-consonant sequences. To this end, we introduce for each word in the vocabulary the function where n is the number of letters of the word and l is a boolean function equals to 1 if the j-th letter is a vowel and −1 if the j-th letter is a consonant. Please note that a normalization factor equal to 1/(n − 1) has been introduced in the definition of m such that from the values of m one can directly compare words of different lengths. For words consisting of perfect sequences of vowels and consonants such as "calamitate" (in English: calamity), "repetare" (in English: repetition), "sare" (in English: salt), etc., the m-function is numerically equal to −1, while for the (admittedly fewer) words consisting almost entirely of vowels or consonants, such as "oaie" (in English: sheep) and "ouȃ" (in English: egs) on the vowel side and "strâmb" (in English: crooked) and "prompt" (in English: prompt) on the consonant side, the m-function is always larger than 0. In fact for "oaie" and "ouȃ" the m-function is equal to exactly +1 but this is valid for a very short list of words. In Figs. 3 and 4 we show the distribution of the number of Romanian words as function of the parameter m, using a histogram plot with different number of bins, to show that the majority of words corresponds to negative values of m (m, the average value of m being, in fact, very close to −0.5), thereby indicating that most Romanian words are structured as (slightly) imperfect vowel-consonant sequences. Fig. 4, in particular, shows extremely clearly that there is a large set of words, around 18.5 % of the investigated vocabulary, which corresponds to m = −1.
This shows, incidentally, that the first sentence discussed in Sec. 2 consists of words from the tail of the distribution of m (though the consonant cluster is one of the very frequent ones), while the second sentence consists of words from the bulk part of the distribution. Let us also note that approximately 26 % of the Romanian words correspond to m < −0.75. Finally, let us mention that it is very tempting to compute the global m value considering the reported frequencies of Romanian words [17], but the computations were done before the advent of the computer and the validity of some of the reported frequencies has been questioned [16]. We can, however, use equation (1) for entire sentences, just like for words, and obtain m S1 = 3/11 for the first sentence and m S2 = −5/11, for the second one, which indicates that the first sentence is less representative in a statistical sense with respect to the vowel-consonant (or consonant-vowel) sequences than the second one. In this paper we have analyzed by computational means the phonetical balance of Romanian words and introduced two indicators that go beyond the standard distribution of phonems. We have shown that the distribution of consonant clusters in Romanian words obeys a scale-free-like distribution and that large clusters of vowels or consonants are infrequent. The distribution of consonant clusters is similar to the well-known Zipf law that gives the distribution of words and short sentences in that it shows that there are a few very frequent consonant clusters and numerous others which are considerably less frequent. Our results suggest that a reliable voice-based biometrics solution should be particularly benchmarked against utterances which consist of words with infrequent consonant clusters and words with positive m-values, as their statistical unrepresentativeness makes them good candidates for identifying the flaws of a given biometrics solution.
As a natural extension of this work we intend to refine the current results by taking into account the position of a consonant cluster with respect to the syllables of a word. Moreover, future research should be focused on a global indicator (such as the Shannon entropy) which considers not only the relations between the nearest neighbours letters, but also long-range in-word correlations between letters and clusters.