Relationship between phoneme-level spectral acoustics and speech intelligibility in healthy speech: a systematic review

ABSTRACT This study aims to systematically review original articles investigating the link between spectral acoustic measures in healthy talkers and perceived speech intelligibility, according to the PRISMA guidelines. Twenty-two studies were retained. Eighteen papers investigated vowel acoustics, one studied glides and eight articles investigated consonants, mostly sibilants. Various spectral measures and intelligibility estimates were used. The following measures were shown to be linked to sub-lexical perceived speech intelligibility ratings: for vowels, steady-state F1 and F2 measures, the F1 range, the [i]-[U] F2 difference, F0-F1 and F1-F2 differences in [è-A] and [q-è], the vowel space area, the mean amount of formant movement, the vector length and the spectral change measure; for consonants, the centroid energy and the spectral peak in the [s]-sound, as well as the steady-state F1 offset frequency in vowels preceding [t] and [d].To conclude, as speech is highly variable even in healthy adult speakers, a better understanding of the imprecisions in healthy spontaneous speech will provide a more realistic baseline for the investigation of disordered speech. To date, no acoustic measure is able to predict speech intelligibility to a large extent. There is still extensive research to be carried out to identify relevant acoustic combinations that could account for perceived speech variations (e.g. vowel and consonant reductions) and to gather normative data from a large number of healthy speakers. To that end, speech-related terms (e.g. intelligibility, comprehensibility, severity) need to be clearly defined and methodologies described in sufficient details to allow for replication, cross-comparisons/meta-analyses and pooling of data. .


Introduction
Speech is an essential function in everyday life that requires complex interactions between the generation of air pressure, the vibration of the vocal folds, and the modulation by the resonating cavities of the phonatory tract (Fitch, 2000;Honda, 2008). Not being correctly understood, for example in dysarthria (Stipancic, Tjaden, & Wilding, 2016), can limit educational, occupational and social participation, hence reducing the quality of life (Hustad, 2008). Therefore, when speech production is impaired, assessing and quantifying the deficit is essential to determine the overall degree of impairment as well as to provide a follow-up measure (Raymond D. Kent, 1992;Miller, 2013;Stipancic et al., 2016;Sussman & Tjaden, 2012).
However, speech is not only variable in a pathological context (Benzeguiba et al., 2007;Miller, 2013). Some healthy talkers are indeed more intelligible than others, which was shown to be linked to the speaker's acoustic-phonetic production rather than to the listener's perception (Bond & Moore, 1994;Cox, Alexander, & study, we will use the psycholinguistic model of Levelt (Levelt, 1995;Levelt, Roelofs, & Meyer, 1999) as the reference model of speech production. In this model Levelt et al., 1999), the constituent segments (phonemes) as well as the metrical frame (syllable number and lexical stress position) are retrieved for each word. The phonemes are then associated with the frame, and the resulting phonological syllable is confronted with the 'syllabary' (Schiller, 2006). The syllabary contains the articulatory gesture plans of frequent phonological syllables; for infrequent syllables, sub-syllabic units must be retrieved (Aichert & Ziegler, 2004;Levelt, 1995;Levelt et al., 1999). Other speech production models, such as Guenther's DIVAmodel (Bohland & Guenther, 2006;Guenther, 1995;Guenther, Ghosh, & Tourville, 2006), also consider phonemes and syllables as the basic units. Level's model leads us to the term 'intelligibility'. While it is used in various contexts determining the colour of its definition, in this work and in accordance with Levelt's model, intelligibility is defined as the accuracy with which the acoustic signal is decoded by the listener at the segmental (phoneme and syllable) levels (Ghio et al., 2018;Hustad, 2008;Lalain et al., 2020;Yorkston, Strand, & Kennedy, 1996). Both the chosen speech production model and definition of intelligibility thus led us to focus on phoneme-level measures in this review, keeping in mind that syllable-level measures also contribute to speech intelligibility in running speech.
As per the above definition, the most appropriate way to perceptually assess intelligibility would be the minimization of signal-independent (lexical, syntactic and semantic) cues (Ghio et al., 2018;Lindblom, 1990), in order to focus on the speech production processes of sub-lexical units. This can be done using vowel, consonant, syllable or word identification scores, or pseudowords (Ghio et al., 2018;Lalain et al., 2020;Tremblay et al., 2017). The Frenchay Dysarthria Assessment (Enderby & Palmer, 2008), for example, makes use of orthographic transcriptions to compute a percentage of correctly identified items. Speech intelligibility can also be assessed with an identification task using minimal pairs, as in the Diagnostic Rhyme Test -DRT (Voiers, 1983) or in the Single Word Intelligibility Test (Ray D. . Other tasks exist, such as overall ratings of speech intelligibility on visual analog scales, using sentences. Although not in line with the above definition, they are a substantial part of the measures commonly used in clinical practice under the umbrella term 'intelligibility'. They must therefore be considered but differentiated from measures that fit the more specific definition. While these various perceptual tasks are very informative, they rely on subjective ratings, biased, among other factors, by the familiarity of the rater with the subject's speech and with the test stimuli ); they are also usually time-consuming (Fontan et al., 2014). While perceptual measures still remain the gold standard in clinical settings (Kent & Kim, 2003;Stipancic et al., 2016;, the acoustic analysis of speech provides a more objective assessment method that helps alleviate the various inherent biases of perceptual methods. Therefore, these objective measures are increasingly gaining interest for speech assessment purposes (Carmichael & Green, 2004;T. Lee et al., 2016;Maniwa, Jongman, & Wade, 2009). An important question that arises is whether the imprecisions in healthy speech can be captured by acoustic-phonetic measures. Several studies, such as in healthy ageing (Hazan, 2017;Kuruvilla-Dugdale, Dietrich, McKinley, & Deroche, 2020), indicate that a large part of the variability in healthy speech is indeed 'traceable to specific acoustic-phonetic characteristics of the talker' (Bradlow, Torretta, & Pisoni, 1996;Metz, Schiavetti, Samar, & Sitler, 1990). The field of study of acoustic measurements in speech is vast, and we have chosen to conduct this systematic review on one of its aspects, the frequency-domain measures. Indeed, it has been shown that spectral cues have a greater contribution than temporal features in stimuli identification by normal-hearing listeners (Souza, Wright, Blackburn, Tatman, & Gallun, 2015). Furthermore, hearing-impaired listeners have difficulties in the identification of consonants (Dubno, Dirks, & Schaefer, 1989;Preminger & Wiley, 1985) and vowels (Li, Ning, Brashears, & Rife, 2008;Molis & Leek, 2011) due to a loss in the frequency content, which highlights the importance of spectral cues in phoneme-level intelligibility.
We have introduced the interest of focusing on the behaviour of segmental spectral measures in healthy speech before using these objective intelligibility measures in specific speech-disordered populations. Therefore, the objective of this study is to systematically review papers investigating the link between spectral acoustic measures and perceived speech intelligibility in 'natural' (that is, not consciously altered) speech in healthy talkers, as rated by healthy listeners without hearing loss or cognitive impairment and considering a 'normal' sound wave transfer (Fontan, 2012).

Protocol and registration
This systematic review has been carried out according to the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement and checklist. These recommendations help the researcher to carry out a rigorous and transparent review of the scientific literature ), by providing procedures on how to search for, how to select and how to analyse the retrieved papers from scientific databases.
This study was registered on PROSPERO under the registration number CRD42019129597.

Eligibility criteria
In order to be included in this review, articles had to: . address both notions of intelligibility 1 and speechrelated spectral acoustics (excluding papers addressing environmental acoustics); . investigate natural speech of healthy adult speakers over 18 years of age (thus also excluding papers studying modified or vocoded speech when no data about the unprocessed speech was also provided); . use segmental acoustics (not only global acoustic measures, such as the long-term average spectrum over a whole sentence); . be written in English; . be original articles (oral presentations, case studies, author letters, conference proceedings, and reviews were excluded); . include at least six healthy speakers.
Other exclusion criteria were: . the exclusive investigation of voice/phonation (dysphonia, voice quality measures), and not speech per se; . addressing tonal languages, for which intelligibility analyses additionally rely on lexical tone and prosody (Ding, McLoughlin, & Tan, 2003;Yiu, van Hasselt, Williams, & Woo, 1994); . the exclusive use of durational measures (such as vowel length or speaking rate); . the study of the perception of speech by hearing impaired listeners; . the application of automatic speech processing techniques, such as deep neural networks.
All eligibility criteria had to be met in order for the papers to be included in this review.

Information sources and search strategy
The literature search was carried out on the fourth of December 2018 in two biomedical databases: Embase and PubMed. No date-related exclusion criterion was used, as some relevant sources known to the authors date back to the mid-1950s. All references of the included papers were checked for additional relevant articles. The search terms and syntax are listed in Table 1.
The titles and abstracts were retrieved via EndNote X9 and screened by two independent raters (TP and MB), applying the aforementioned selection criteria. In view of the large number of abstracts, the whole set was divided into two. Each rater thus reviewed half of the whole set, plus a randomly selected set of 20% abstracts, taken from the other half. Hence, 40% of the abstracts were read by both raters, allowing for a weighted Kappa to be measured to assess the inter-rater agreement. Agreement interpretation guidelines (Landis & Koch, 1977) are: <.00: poor; .00-.20: slight; .21-.40: fair; .41-.60: moderate; .61-.80: substantial; .81-1.00: almost perfect. Differences in the eligibility ratings were resolved by reaching a consensus. The full-text articles of the selected papers were then retrieved and reviewed by each rater. A flowchart illustrating the article selection process according to the PRISMA guidelines ) is shown in Figure 1 in the Results section.

Critical appraisal of methodological quality and level of evidence
The methodological quality of the selected papers was rated using the QualSyst tool (Kmet, Lee, & Cook, 2004). This tool was developed as a scoring system in order to methodologically assess the quality of quantitative as well as of qualitative research papers, by analysing, among others, the study design, the research question, the study group selection and description, and the control of confounding factors. As interpretation guidelines, a score >80% was considered as strong methodological quality, 60-79% as good, 50-59% as appropriate and <50% as poor quality. The latter was considered as an exclusion criterion.
The National Health and Medical Research Council Hierarchy (NHMRC, 1999) was used to assess the level of evidence. Six levels are described: Level I Highest level, systematic reviews of randomized controlled trials, Level II Randomized controlled trials, Level III-1 Pseudo-randomized controlled trials, III-2 Comparative studies with concurrent controls and allocation not randomized (cohort studies, case control studies, or interrupted time series with a control group), Level III-3 Comparative study without concurrent controls, with historical controls, two or more single-arm studies, or interrupted time series without a parallel control group, and Level IV Lowest level, case series. Research reports of level IV and expert opinions were not further analysed, as well as systematic reviews.

Data items
After selection based on the eligibility criteria and the methodological quality assessment, the following information was extracted for each article: the study population (number, age, gender, language), the speech sample used for the acoustic measure(s) (targeted phonemes), the acoustic parameter(s), the perceptual intelligibility measure(s), the main conclusion regarding the link between acoustics and intelligibility and the descriptive data if available.
No contact was sought with authors to inquire about unreported data.

Study selection
A total of 4818 titles and abstracts were retrieved from the databases (after automatic removal of most of the duplicates). Each of the two independent raters screened half of these records (2405), as well as 20% (964) of the other half. The raters agreed on the eligibility criteria for 1792/1928 (93%) abstracts, with a weighted Kappa of .89corresponding to an 'almost perfect' agreement according to the guidelines of Landis and Koch (1977).
Two hundred and sixty-seven full-text articles were reviewed, of which 22 were retained. Nine of these studies addressed the association between spectral acoustic and perceptual measures (A01-A09). The remaining 13 papers, albeit not assessing the link per se, were retained because they provided quantitative data for both acoustic measures and perceptual ratings in healthy speakers, which provides useful information.
The study selection process is illustrated in Figure 1. A detailed synthesis of the 22 included studies is available in Appendix A. For readability purposes, an identification code has been assigned to each of the 22 papers (see Table 2), which will be used for the intext citations throughout this article.

Quality assessment
The QualSyst scores of the 22 papers ranged from 71% (good methodological quality) to 100% (strong quality). Only one article's methodological quality was graded as 'good', the other 21 were rated as 'strong'.
According to the NHMRC hierarchy for the level of evidence assessment, 14 papers were categorized as level III-2 evidence (comparative study with concurrent controls), the other eight papers were classified as level III-3 evidence ('comparative study without concurrent controls'). The rating for each individual paper can be found in Table 2.

Study populations
Out of the 22 studies, 14 originally included both a subject group and a healthy control group, of which only the healthy control group was kept for the present analysis. The remaining eight studies only included healthy speakers as a study group. Keeping in mind that only studies including more than five subjects were retained, the median size of the study sample was 15 (min.: 8, max.: 93), with an interquartile range of 18.5. Regarding the gender distribution in the samples, most of the studies (20/22, 91%) included Strong III-2 a Methodological quality: strong > 80%; good 60-79%; appropriate 50-59%; poor < 50%. b NHMRC hierarchy: Level I Systematic reviews; Level II Randomized control trials; Level III-1 Pseudo-randomized control trials; Level III-2 Comparative studies with concurrent controls and allocation not randomized (cohort studies), case control studies, or interrupted time series with a control group; Level III-3 Comparative studies with historical control, two or more single-arm studies, or interrupted time series without a control group; Level IV Case series. Note: The studies were ordered according to (1) the type of outcome: A01-A05 = direct correlation between acoustics and perceptual ratings; A06-A09 = indirect investigation of the link between acoustics and perceptual ratings; A10-A22: quantitative data for both acoustics and perceptual ratings, without investigation of the link; (2) the chronological order.
both men and women. In 13 of these studies (65%), the men/women ratio was 1:1 (i.e., perfect gender balance). Four studies showed a small gender imbalance (i.e., less than 20% difference between both gender groups), while three showed a preponderance of men (>20% difference). Of the two remaining studies, one included only men (A10), and the other did not mention the subjects' gender(s) (A14). With regards to the age factor, half of the studies were carried out on groups aged more than 50 years, 10 on subjects aged less than 50 years, and one did not report the study population's age (A03). Regarding the investigated languages, seventeen out of the 22 studies (73%) were carried out in English. Eleven of these used American English (of which three specified an Upper Midwest dialect), one used British English, one used New Zealand English, and the remaining four did not specify the English variant.
Two studies were carried out in Dutch, two in French (of which one in Quebec French), one in German, and one both in Korean and in English.

Speech samples and spectral measures
The different phonemes analysed in the studies were extracted from isolated words or from words in sentences. Two studies analysed isolated phonemes ( Glides. One article (A22) studied the two glides [w, j ] in addition to vowels, using the F2 slopes as a measure of the rate of phonatory tract modification.
Among these eight papers, five used spectral moment analyses. Four of them used the first moment, while the fifth used the second moment. The remaining acoustic measures were studied in single studies and are reported in the outcome table (Appendix A).
Perceptual measures Percent correct identification. Ten studies used the percentage of correctly identified stimuli. One paper did not describe the identification task (A02). The remaining nine all used a multiple-choice task, six in which the listener had to choose the target in a list of words, two in which the listener had to choose between two targets (A06 and A07), and one (A19) in which the listener had to choose the target vowel among 12 vowels (monophthongs or diphthongs). None of the studies used a transcription task.
Ordinal scales. Seven studies used Likert-type equal appearing interval scales, out of which five asked the listeners to rate the 'overall intelligibility', three asked them to rate the 'articulation', one the 'speech clarity', one the 'speech precision' and one the 'speech severity'. Two studies used rating scales where a high score indicated a good speech rating ('positive scales'); four studies used 'negative scales' (a high score meaning a negative rating). One study used both types of scales (A13).
Visual analogue scales (VAS). Five papers used visual analogue scales, out of which two asked the raters to judge the 'speech clarity' (A05, A16) and the others respectively the 'overall intelligibility' (A17), the 'speech precision' (A17), the 'articulatory precision' (A20), the 'ease of understanding' (A20) and 'how much [the listener] understood of what the person said' (A22).
Three of the studies used positive VAS scales (a high score meaning a good overall intelligibility), the other two used negative VAS scales (a high score indicating a low overall intelligibility).
Direct magnitude estimation (DME). Two studies used direct magnitude estimation with a modulus of 100. In one study, listeners were asked to rate 'overall severity' (negative scale) (A01), in the other they were asked to rate 'overall intelligibility' (instruction: 'ease to understand') on a positive scale (A12).
Outcome measure Nine of the 22 retained articles analysed the link between spectral acoustic and perceptual measures. Two different methodologies can be identified. Five articles (A01-A05) directly addressed the correlation between acoustic and perceptual measures (VAS, DME and Likert scales or percent-identification scores). Four other articles (A06-A09) indirectly investigated the link between acoustics and perceptual ratings, by investigating acoustic differences between groups that had been created based on their intelligibility (A09), or by analysing acoustic differences between two correctly perceived phonemes/syllables: [ae] and [I] vs.
[ɛ] (A08). The remaining 13 articles (A10-A22) analysed spectral measures as well as perceptual measures but did not directly address the association between both.

Summary of findings
The conclusions of the different studies are reported in the outcome table (Appendix A), sorted into three categories: the studies directly addressing the link between spectral and perceptual measures; the studies indirectly investigating this link; and the studies only providing descriptive data for acoustics and perceptual ratings without analysing the link.
Regarding the first category, the significant and non-significant correlations are shown in Table 3. Significant correlations between spectral measures and perceptual ratings have been measured in vowels only, for steady-state F1 and F2 measures (A04), the F1 range in men (A03), the [i] vs [u] F2 difference (A02), the vowel space area (A03), the relative change in the acoustic-articulatory vowel space area (A05), the mean amount of formant movement in women (A03) and the dynamic vector length measure (A04).
Among the studies that indirectly addressed the link between spectral acoustics and perceptual estimates (i.e., without correlations), A06 and A07 targeted consonant measures, whereas A08 and A09 focused on vowels. In A06, the fricative centroid energy and the fricative spectral peak in the [s]-sound in [si] and [su] were found to be acoustic underliers of the coarticulation effect, the values being significantly higher for the [s] in the syllables identified as [si]. A07 found significantly higher steady-state F1 offset frequencies in vowels preceding [t] than for [d], in native English speakers. The authors concluded that this acoustic measure is a good indicator of the correct perception of the voiced/voiceless contrast in apico-alveolar stop consonants. Regarding the measures targeting vowels, significant F0-F1 and F1-F2 differences were found in A08, for the correctly identified vowels in the pairs [ɛ-ae] and [I-ɛ]. Hence, the authors concluded that these measures are related to the speech intelligibility, as they seem to be linked to the perception of the tongue-height contrast. The F1-F2 difference was considered to be the primary cue, whereas the F0-F1 difference was interpreted as a secondary cue, linked to the F2-F1 difference. In A09, the 'spectral change' measure was found to be significantly larger for speakers with a high clear speech word identification benefit.

Discussion
The data from this review confirms the highly variable nature of speech in healthy adult speakers. In light of the differing rating tasks and instructions (e.g., rating on visual analog scales of intelligibility vs articulatory precision) and targeted speech units (e.g,. percent correct identification of phonemes vs words), no aggregated variability measure could be computed across the studies in this review. Among the studies using percent correct identification, for example, while four found values higher than 90% (on words, isolated vowels and vowels in CVC syllables), four others found mean scores between 60.6% and 71% (on phonemes in CVC syllables and on syllables). The speech variability in healthy speakers is also found between subjects in the different studies. For example, while three of the studies using percent correct identification scores report a relatively low standard deviation (ranging from 1.12% to 4%), the studies using ordinal scales show a higher variability: if all the results are normalized to percentages, the standard deviations range from 6.25% to 12%. These results illustrate that even in healthy talkers, the physiological limits do not always allow the speech production system to meet the many demands of spontaneous speech. The resulting 'imprecisions' are mainly found at the phoneme level (Rossi & Peter-Defare, 1998;Schiller, 2006), leading to a certain overlap of speech sound categories, i.e., vowel and consonant reductions, as well as phoneme omissions (Benzeguiba et al., 2007;Guenther, 1995;Meunier, 2007;Van Son & Pols, 1996). The aim of this review was to investigate further how the variations in healthy speech can be measured in order to be taken into account when analysing speech in patient populations. Indeed, the publication dates of the retained papersof which only three date back to the 1990sillustrate that the rise of technology has led to an increasing interest in the acoustic investigation of speech. This is mainly due to the fact that acoustic measures do not have to be carried out manually anymore and are thus faster to obtain as well as more reliable.
In the next section, we will thus focus on the spectral acoustic underpinnings of intelligibility.

Spectral measures of speech intelligibility in healthy speakers
In our review, most of the studies using spectral measures focused on vowels. Vowel reduction in informal speech is a well-described, universal phenomenon (Van Son & Pols, 1996). Two types of reduction are found (Maurová Paillereau, 2016): vowel centralization and contextual assimilation. Vowel centralization is observed when formant frequencies tend to those of a neutral vowel, whereas contextual assimilation occurs when a vowel's formant frequencies change toward the acoustic loci of neighbouring consonants.
The data in this review shows that steady-state formant measures (F1, F2, F1 range, F2 difference between /i/ and /u/, F1-F2 difference in [ɛ-ae] and [I-ɛ], vowel space area [VSA]) are linked with vowel identification scores (A02, A03, A04, A08). The VSA is commonly used to account for vowel centralization, often in pathological speech (Liu, Tsao, & Kuhl, 2005;Sapir, Połczyńska, & Tobin, 2009;Weismer, Jeng, Laures, Kent, & Kent, 2001), but has also been shown to be sensitive to intelligibility differences in healthy speech (Bond & Moore, 1994) and to articulatory changes in clear speech (Lam, Tjaden, & Wilding, 2012;Smiljanić & Bradlow, 2009). The VSA is to some extent related to the size and shape of the resonance cavities created by the jaw and tongue positions (Sandoval, Berisha, Utianski, Liss, & Spanias, 2013), and thereby provides a global overview of the articulatory working space. However, it has shown inconsistent results (Lansford & Liss, 2014; and might not be sensitive enough to subtle vowel articulation changes, both in healthy speech (Ferguson & Kewley-Port, 2007) and in motor speech disorders (Whitfield & Goberman, 2014).  explained that all Euclidean distances of the vowel space do not equivalently contribute to the differentiation between healthy and pathological speakers. In light of this asymmetry of the vowel formant sensitivity to articulatory changes, they suggested the use of the Euclidean distance between /i/ and /u/ instead, which was found to be the most sensitive marker. The F2 difference between /i/ and /u/ was also shown to be related to vowel intelligibility in A02. Furthermore, Lam et al. (2012) found that in clear speech, high tense and lax vowels (/i, ɪ, u, ʊ/) contributed most to the vowel space expansion. These observations indicate that the formant measures in these vowels should be prioritized for diagnostic purposes. Several alternatives to the VSA have been suggested, such as the vowel articulation index (VAI) (A15) and its inverse, the formant centralization ratio (FCR) (A20), designed to minimize inter-speaker variability and maximize the sensitivity to vowel reduction (Sapir, Ramig, Spielman, & Fox, 2010. However, all of the above measures only use the midpoint of three to four corner vowels of the vowel space. Whitfield and Goberman (2014) therefore suggested another alternative measure, the acoustic-articulatory vowel space (AAVS), which interestingly uses formant measures across the voiced portions of a whole utterance in continuous speech and thus provides a more global, also supposedly more sensitive measure (Whitfield & Goberman, 2014, 2017. Furthermore, the AAVS has been shown to be significantly larger in clear speech (A05) (Whitfield & Goberman, 2017). It would therefore be interesting to further investigate how the AAVS correlates with segmental perceptual intelligibility estimates and accounts for variations in healthy speech.
Regarding dynamic formant measures, the 'formant movement' (A03), 'vector length' (A04) and 'spectral change' (A09) measures show that vowels with larger changes in the F1×F2 space are significantly better identified. Lam et al. (2012) showed that dynamic vowel formant measures also showed increased values in clear speech. These measures are related to intra-vowel antero-posterior tongue movements and changes in tongue height. They could thus also be useful in the investigation of imprecisions due to motor constraints in informal speech and subsequently in pathological speech.
Studies targeting the spectral features of consonants are rarer in our review, although consonants are reduced as much as vowels in informal speech and this articulatory reduction affects their intelligibility (Van Son & Pols, 1999). In this review, the fricative centroid energy and the fricative spectral peak in the [s]-sound in [si] and [su] were found to be acoustic underliers of the coarticulation effect (A06). The fricative centroid energy (or 'centre of gravity' [CoG]) is the first of the four spectral moment measures (Jongman, Wayland, & Wong, 2000) and corresponds to the 'frequency that divides the spectrum into two halves' (Yoon, 2015). It has been shown to be decreased in non-plosives in spontaneous speech of healthy talkers (Van Son & Pols, 1996, making it a relevant acoustic measure of consonant reduction. Spectral moment measures consider and describe the whole spectrum as a statistical distribution. Evers et al. argued that it is wiser to consider the global aspect of sibilant spectra rather than specific frequency regions (Evers, Reetz, & Lahiri, 1998). Indeed, sibilants are characterized by two sound sources, one at the tongue constriction and one at the teeth (Fant, 1960), which makes spectral peak locations difficult to predict. Also, the spectral shape of consonants is less defined than the clear vowel formant structure. Therefore, the description of the overall spectral shape of consonants should be preferred to the use of specific frequency regions ('formant patterns') (Fant, 1960;Stevens & Blumstein, 1978). Another argument in favour of using spectral moments is that they are said to be correlated with the length and shape of the cavity in front of the articulatory constriction (Behrens & Blumstein, 1988;Kay, 2012;Stevens, 1998;Yoon, 2015). Hence, they can lead to an articulatory interpretation. However, study A06 demonstrates that spectral moments are likely to vary according to the vowel context/to coarticulation.
Just as in vowels, another type of measure that has been used in the retained papers are the dynamic formant transitions, among which the F2 slope. The F2 slope measure, used in glides in A22, is 'a dynamic measure that reflects the rate at which speech movements can be performed' (R. D.  and is thus related to speaking rate. Van Son and Pols (1999), investigating acoustic correlates of consonant reduction in healthy speech, found that the F2 slope difference (i.e., difference between the F2 slope in the VC-and CV-boundaries in VCV syllables) is lower in spontaneous than in read speech. This reduced F2 slope difference indicated a lower consonant-induced coarticulation in the VCV syllable, thus a reduced consonant articulation. The use of formant transition measures is all the more noteworthy since it has been shown that in healthy ageing a decrease in intelligibility can be partly attributed to slower tongue movements (Kuruvilla-Dugdale et al., 2020).
To summarize this discussion, we highlighted the importance of investigating variations at the phoneme level in healthy speech, using acoustic measures to analyse both vowel and consonant reductions. Various spectral acoustic measures, mainly on vowels, proved to be related to perceived speech intelligibility in healthy speakers. However, the results show that none of these measures account for a large percentage of the variance in the perceptual intelligibility scores. While acoustic measures allow for a more objective investigation of speech, they do not comprehensively represent the speech signal, but rather target specific cues that are believed to be theoretically relevant. One should also keep in mind that the accurate perception of phonemes relies on several phonemic features (Jakobson, Fant, & Halle, 1951) and it is not one sole feature, but the whole set of speech units that makes up the notion of intelligibility (Flanagan, 1972, p. 311). Hence, a combination of acoustic measures, taking into account various phonemic traits and spectral aspects, could be a first way to a more comprehensive assessment of speech intelligibility (e.g., Bradlow et al., 1996;Ray D. Kent, Kent, et al., 1989;J. Lee, Hustad, & Weismer, 2014;Lindblom, 1990;Weismer, 2008). Furthermore, there is a complex entanglement of segmental acoustic features with factors at other levels of granularity such as intonation, stress (e.g., acoustic differences between stressed and unstressed vowels in A19), voice quality and speech rate. This has been demonstrated in connected speech (Metz et al., 1990) as well as in clear speech (Kuruvilla-Dugdale et al., 2020;Smiljanić & Bradlow, 2009;Whitfield & Goberman, 2017). Eventually, before using segmental acoustic measures on specific patient populations, extensive research is still needed to get a better understanding of their behaviour in the healthy speakers, to identify relevant acoustic combinations that could account for perceived speech variations and to provide normative data from a large set of healthy speakers.

Further perspectives and future directions of research
From the analyses made throughout this review, a few leads for further studies can be considered. First, the diversity of the methodologies used in the retained papers demonstrates that speech can be investigated in many different ways at a perceptual as well as at an acoustic level. Of the 22 retained papers in our review, only five addressed the definition of the targeted speech-related concept(s), of which four (A08, A12, A20, A22) provided a definition of intelligibility. In light of the various terms used to refer to speech productioneach of which refers to a specific conceptunambiguous definitions should be provided in research papers. Also, the rating tasks and the acoustic measures should be extensively described, so as to allow the reader to interpret the results accordingly, as well as for the methods to be replicable. It can be observed that even if several studies use the same measure, the study population, the phonemic sample, the computing method and the reporting of the results are very different and sometimes not reported (according to the aim of each study), which makes it difficult to relate the resulting values. To illustrate this point, an attempt to compare the results of similar acoustic measures used in the different studies is shown in Appendix C.
In this review, we have observed a majority of studies focusing on vowels when it comes to spectral cues. Vowels play an important role in speech intelligibility (Chen, Wong, & Wong, 2013;Cole, Yan, Mak, & Fanty, 1996;Kewley-Port, Burkle, & Lee, 2007) and are also more convenient to analyse spectrally, as they are by definition voiced and composed of periodic waveforms and can be sustained (in contrast to plosive consonants). However, consonants also significantly contribute to speech intelligibility. Lindblom (1990) already postulated that despite the coarticulation effects, a combination of spectral features could allow for a good distinction between stop consonants. Furthermore, while vowels were found to have a more important effect on talker identity discrimination, consonants are essential for word identification (Bonatti, Peña, Nespor, & Mehler, 2005;Owren & Cardillo, 2006). The consonant intelligibility, their variability and reductions in healthy speech, as well as related spectral cues (in addition to the more investigated time-domain cues), should therefore be further explored.
Some considerations can also be highlighted with regard to the study populations. The majority of the studies included both men and women in a balanced ratio. However, very few of them actually differentiated the results by gender, especially in the control groups, for which the results are very often pooled. It is well known that vowel formant values, for example, vary between men and women (Bradlow et al., 1996;Coleman, 1971;Yang, 1996). Generally speaking, greater account still needs to be taken of this factor, and the study group's gender information should systematically be specified. One possible way to address the issue of across-sex value comparisons is to use Bark scales (Fletcher, McAuliffe, Lansford, & Liss, 2017), as could be observed in some of the studies in this review. Also, while half of the studies were carried out on study groups aged more than 50 years, none of the studies investigated the impact of age in adults on the spectral measures or on perceived speech intelligibility. It would be noteworthy to take the age factor into account in order to analyse the evolution of speech-related acoustics and perceived intelligibility in normal ageing (Kuruvilla-Dugdale et al., 2020). Indeed, speech has been shown to vary across the lifespan due to physiological and neuromuscular modifications (Benjamin, 1997;Bilodeau-Mercure & Tremblay, 2016;Hazan, 2017;Hazan et al., 2018;Hooper & Cralidis, 2009;Tremblay et al., 2017). The study of speech modifications in 'normal' ageing as compared to pathological ageing might help further understand speech production strategies in healthy speech.

Limitations
The studies discussed in this systematic review have been retrieved from two databases (PubMed and Embase) that were thought to include papers from the targeted topic. We are, however, aware that there might be studies from other sources that address the subject but that are not referenced in these two databases.
Regarding the acoustic measures considered in this review, we would like to underline that time-domain measures were not taken into account in order to limit the noise in the initial database search (e.g., studies about the speaking rate, prosody and pauses in fluency disorders …). As explained in the introduction, only frequency-domain measures were included. In a future study, it would, however, also be interesting to investigate the link between time-domain measures (such as the voice onset time) and perceived intelligibility, as time-and frequency-domain measures provide complementary data (Floegel, Fuchs, & Kell, 2020;Li et al., 2008). The resulting higher number of studies focusing on vowels might also stem from this methodological decision. Further studies on timedomain measures could clarify if this is a general trend among phoneme-level measures, or if it is limited to spectral measures.
Last but not least, while this review focused on studies written in English, it would also be informative to review studies written inand thus focusing onother languages. The most contrastive example to illustrate the interest of investigating other languages are tonal languages. In the latter, the acoustic and perceptual underliers of speech intelligibility might be very different from those in Western languages. Suprasegmental measures (eg. F0 contour) might for example contribute to a higher degree to intelligibility, as compared to phoneme-level measures (Chen & Loizou, 2011).

Conclusions
Our results highlight that speech is highly variable within and across healthy adult speakers, which stresses the need for further studies regarding the acoustic underpinnings of speech intelligibility in healthy speech. Healthy speech shows inherent imprecisions and is thus not, as often presumed, 100% accurate. A better understanding of the imprecisions in healthy spontaneous speech will provide a more realistic baseline for the investigation of disordered speech.
The direct investigation of the correlation between spectral cues and speech intelligibility estimates remains scarce, especially in consonants. In this review, for vowels, the following measures were shown to be linked to sub-lexical perceived speech intelligibility ratings: steady-state F1 and F2 measures, the F1 range, the [i]-[u] F2 difference, F0-F1 and F1-F2 differences in [ɛ-ae] and [I-ɛ], the vowel space area, the mean amount of formant movement, the vector length and the spectral change measure. For consonants, only the fricative centroid energy and the fricative spectral peak in the [s]-sound, as well as the steady-state F1 offset frequencies in vowels preceding [t] and [d] have shown a significant link with phoneme identification scores.
An important question is raised by this review: Can perceived intelligibility be quantified by single acoustic measures? It indeed appears that, to date, no acoustic measure is able to predict speech intelligibility to a large extent. There is still extensive research to be carried out to identify relevant acoustic combinations that could account for perceived speech variations (e.g., vowel and consonant reductions) in healthy speech. Subsequently, normative data will have to be gathered from a large number of healthy speakers in order to then investigate these measures in specific patient populations. To that end, speech-related terms (e.g., intelligibility, comprehensibility, severity) need to be clearly defined and methodologies described in sufficient details to allow for replication, cross-comparisons/meta-analyses and pooling of data. Note 1. Studies using perceptual assessment methods that fitted the umbrella-term 'intelligibility' rather than the more specific definition focusing on low-level segmental units were not excluded a priori but differentiated in the Discussion. Overall speech severity: direct magnitude estimation (DME) using a modulus with the value of 100 ( = moderately severe)

A. Studies describing associations between acoustic variables in healthy speakers and auditory perception
Association: . Regression between vowel space area and overall speech severity: not significant . Regression between first moment difference and overall speech severity: not significant Intelligibility rating on a 10-point Equal Appearing Interval scale (1-10: totally unintelligible-completely intelligible) Perceptual . Intelligibility rating: N.R. (1) Vowel Space Area (Fletcher et al., 2017): the first and second formant values of the corner vowels of the investigated language are used as coordinates in an F1/F2 space to construct a vowel triangle or quadrilateral. The area of the resulting triangle or quadrilateral are then computed using classic formulas such as:

Acoustic
(where v1, v2 and v3 are the corner vowels of the vowel triangle) (2) Articulatory-Acoustic Vowel Space (A05, A16): 'This space is calculated as the square root of the generalized variance of all sampled vowel formants in the F1-F2 coordinate plot. The generalized variance for the AAVS is calculated as the product of the variance of the F1 data, the variance of the F2 data, and the portion of the unshared variance between them. The square root of the generalized variance provides a measure of formant variability that is the equivalent to a bivariate standard deviation in F1-F2 space. Therefore, an increase in the range or spread of F1 or F2 values in an utterance would yield a larger AAVS.' (Whitfield & Goberman, 2017) (3) Steady-state F1 and F2 measures: the first and second formants are extracted, usually at temporal midpoint. They can then be compared for example between vowels, or between speaker groups. (4) F1 and F2 ranges: subtraction of the lowest F1/F2 value from the highest (5) F0-F1 difference (A08): Euclidean distance between the fundamental frequency and the first formant (6) F1-F2 difference (A02, A08): Euclidean distance between the first and second formants (7)  Glide measure (1) F2 slope (A20): the overall frequency shift in Hertz (transition extent) in a glide, divided by the transition duration (in ms), as a measure of the rate of phonatory tract modification