Effects of oral and oropharyngeal cancer on speech intelligibility using acoustic analysis: Systematic review

The development of automatic tools based on acoustic analysis allows to overcome the limitations of perceptual assessment for patients with head and neck cancer. The aim of this study is to provide a systematic review of literature describing the effects of oral and oropharyngeal cancer on speech intelligibility using acoustic analysis.


| INTRODUCTION
Head and neck cancer (HNC) has major functional repercussions on the upper aerodigestive tract (breathing, swallowing, and phonation/speech). Because of the sensorymotor impairment related to the presence of the tumor in the anatomical regions involved in the articulation of the speech, a functional impairment at the level of communication is likely to appear. 1 The speech-related quality of life will also be impacted. 2,3 In this oncological context, various factors can affect the quality of speech, including the treatments, the size of the tumor, 4-6 or its location. 7,8 With the increasing rate of oropharyngeal cancer incidence, 9,10 the evaluation of speech and its disorders becomes a major issue in the management of patients with HNC.
This evaluation is mainly based on a perceptual assessment: therapists, mainly speech pathologists, assess the quality of the patient's speech production. But these methods have two major limitations. First, most of the tools are intended for voice quality assessment in laryngeal cancers, 11 whereas speech disorder is the most common symptom in cancers of the oral cavity and the oropharynx. 12 Second, these measures are known to show great interjudge and intrajudge variability. Indeed, the reliability of the perceptual estimates is mostly listener-dependent. 13 The degree of familiarity of the listener with the patient or with the task might increase predictability and improve the functional speech scores given by the rater. The rating by an expert in the pathology field or by a rater who is familiar with the patient can be very different of that by a naive listener. Moreover, the reproducibility of the perceptual assessment is also subject to intrajudge variations. The emotional context or the mental alertness of the judge at the time of the assessment may influence the outcome. 14 Recently, the technology development allows investigating new tools for speech evaluation, based on objective data. 15 For this purpose, acoustic speech analysis is currently a growing field of research.

| Review question
The aim of this article is to provide a systematic review of literature describing the effects of HNC on speech intelligibility using acoustic analysis. This review will focus on speech intelligibility in adults with oral or oropharyngeal cancer assessed by acoustic measures.

| Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

| Protocol and registration
The methodology and reporting on this systematic review were guided by the preferred reporting items for systematic reviews and meta-analyses (PRISMA) statement and checklist. The PRISMA statement and checklist are designed to guide researchers in the essential and transparent reporting of systematic reviews. 16,17

| Eligibility criteria
To be eligible for inclusion in this systematic review, articles were required to describe the effects of oral and oropharyngeal cancer on speech intelligibility using acoustic analysis.
Only articles with the following criteria were included: • Assessment of speech intelligibility, • Use of acoustics and related terms (such as acoustic analysis, phonetics, signal processing, sound spectrography, etc), • Patients with oral or oropharyngeal cancer.
In this study, speech intelligibility is defined as the level which a message can be understood by a listener, 18 the proportion of understood speech, 19 or the correctly transcribed word rate. 20 Speech intelligibility impairment is described as the functional speech deficit decreasing the ability to interact with someone else. 21 Exclusion criteria were: • The absence of the original larynx (exclusion of total or partial laryngectomies, larynx prostheses, etc), • Studies addressing children populations, • Papers that were not original articles, such as abstracts, conference proceedings, and reviews, • Case studies, • Articles not published in English.

| Data sources and search strategies
A literature search was performed in two different electronic databases, to gather relevant literature: PubMed and Embase. These two databases were selected based on the subject of this research. Note that a third database, Web of Science (WoS), did not retrieve any new reference.
All publications dated up to December 4, 2018 were included, with no limitations regarding the publication dates.
The search terms are listed in Table 1. All abstracts were reviewed by two independent raters. Differences of opinion about the eligibility of articles were settled by consensus. A flowchart of the selection process according to PRISMA 16 is shown in Figure 1.

| Methodological quality and level of evidence
The National Health and Medical Research Council (NHMRC) Evidence Hierarchy was used to assess the level of evidence, from I (Systematic reviews) to IV (Case series). 22 The QualSyst critical appraisal tool by Kmet et al 23 provides systematic, reproducible, and quantitative means of assessing the methodological quality of research over a broad range of study designs. A QualSyst score higher than 80% was interpreted as strong quality, 60% to 79% as good quality, 50% to 59% as adequate quality, and lower than 50% as poor methodological quality. Studies with poor methodological quality were excluded from further analysis.

| Data extraction
After assessment of methodological quality, data from all remaining articles were extracted for the following categories: number of participants in the study and their characteristics (age, diagnosis, and language spoken), acoustic parameters (and their definitions), comparison criterion/a, speech sample, and authors' main conclusions.
Additionally, geographic bibliometric data were extracted using the Netscity tool 1 (by the Netscience project of the Labex SMS, Toulouse, France).

| Study selection
A total of 488 records were retrieved from the two electronic databases. Two independent reviewers screened all records and assessed 196 full-text articles for eligibility. A final total of 22 articles met the inclusion criteria and were included in this review (see Figure 1).

| Quality assessment
The overall quality of the studies, as assessed by the QualSyst tool, ranged from "good" to "strong," with four studies ranked as "good" and 18 as "strong." Based on the NHMRC evidence hierarchy, 23 20 studies were classified as level III evidence (14 as III-3: "Comparative studies with two or more single-arm studies"; six as III-2: "Comparative studies with concurrent controls and allocation not randomized [cohort studies], or case control studies"), and two as level IV evidence (Case series). No article of a low level of evidence had to be excluded. The ratings of all 22 included articles are listed in Table 2.
The full outcome table on the 22 retained articles can be found in Appendix A.

| Bibliometric data
The field of acoustic parameters in speech analysis in patients treated for HNC mainly concerns teams located in three geographical areas: Western Europe (mainly the Netherlands), North America, and the Far East (Japan and South Korea). Some collaborations between teams are noted: between Finland and Canada, and between South Korea and the United States (see Figure 2).
This will have an influence on the languages of the study speech samples. Most of the studies selected in this review have been published since 2010 (13/22, 59%). The use of cepstral coefficients and of machine learning tools in speech assessment in an oncological context started around 2010 (see Figure 3). The field of speech acoustic analysis is therefore growing, due to the recent use of new acoustic measures.
Two studies 30,34 use patient data from retrospective corpora.
All participants in the 22 studies had cancers of the oral cavity or of the oropharynx at the time of the study.
In total, 11 studies (50%) address patients treated for cancer of the oral cavity only. The anatomical sites mainly (9/11) involve the tongue (treated by total 28,35,38,39,43,44 or partial glossectomy 24,31,33 ). The remaining two studies investigate maxillary tumors. 37,45 Six studies (27%) include both patients treated for cancer of the oral cavity and patients treated for oropharynx cancer. 26,27,29,30,34,40 Only five (23%) include only patients with an oropharyngeal tumor location. Two addressed patients with a tumor extension to the soft palate. 41,42 The other three relate to the tonsil, alone 25 or in comparison to the area of the base of the tongue. 32,36 The distribution of the tumor locations is illustrated in Figure 4.

| Comparison outcomes
The different comparison outcomes used in the studies are shown in Table 4.
Six studies (27%) compare acoustic measures with a perceptual outcome. The latter is an intelligibility score assigned by judges using a Likert-type ordinal scale, either globally 24 or on specific parameters such as articulation, nasality, or "weakness." 26,27,29,45 One study uses the percentage of correct identified consonants. 25 Five studies (23%) investigate the performance of acoustic scores either by analyzing differences between the investigated parameters or by comparing the results with existing data: comparison of formants, 28 comparison of the performance of two spectral parameters, 30,34 and comparison with the same parameters from other software or with existing norms. 37,41 Three studies (14%) compare acoustic parameters before and after treatment. 32,35,36 Eventually, eight studies (36%) compare the same parameters between a subject and a control group. 31 One study carries out analyses at the sentence-level, 40 and four use a more global analysis on a read text. 25,30,34,41 One study does not report the composition of its speech sample. 42 These results are shown in Table 5.

| Acoustic measures
The acoustic parameters analyzed in the included studies, reported below, are shown in Appendix B. Figure 6 represents the distribution of the units of analysis in the articles.

| Nasalance (seven articles)
Seven articles focus on the analysis of nasality. Three studies carried out the nasalance analysis on vowels, 32,36,37 one on sentences, 40 and two on a read text. 25,41 One study does not report the speech unit used. 42 Most of the studies compute a nasality score by using dedicated software (Praat, 36 Dr. Speech 37 ) or nasometers. [40][41][42] The ratio of the acoustic energy emerging from the nasal and from the oral cavity is calculated in two studies. 25,32 Nasalance score presents a significant association with perceptual assessment, in extended resection or reconstruction of the soft palate. 25  Of these, three also study F3 28,31,33 and one analyzes formants up to F12. 45 The vowel space area (VSA) is used in two studies 26,43 and the transition slope is only found in one. 43 Four studies investigate acoustic differences before and after treatment. After tongue surgery, significant differences are found in F1 and F2, 35,39 with F1 generally being increased and F2 being lowered. The acoustic measures are impacted by local reconstruction, 43 as well as by a well-adapted palatelowering prosthesis, which is shown to modify F1, F2, and F3 in patients treated for a subtotal glossectomy. 28 Two studies show a correlation between acoustic measures and perceived intelligibility: F2 of /i/ (r = 0.35) and the size of the VSA are linked with intelligibility (r = 0.39, P < .05) and articulation (r = 0.42, P < .05) ratings, 26 and F7 and F12 of /i/ are also highly correlated with perceptual ratings (r = 0.84). 45 A single study does not find any significant correlation between acoustics and perceptual assessment on F0, F1, and F2. 29 The studies comparing subjects and healthy controls find that F2 and F3 are lower in the patient group. 31 For women, significant correlations are found between subjects and controls for F2 and F3, but only for F1 for men. 33

| Consonants (five articles)
Three studies analyze spectral moments on plosives and fricatives: the center of gravity/spectral mean 24,32,38 and the spectral skewness. 24,26,38 The Klatt Voice Onset Time (VOT) is also analyzed on both consonant groups in one study. 24 On plosive consonants, the duration of air pressure release is measured twice. 26,32 The /t/ consonant peak energy frequency and the formant transition in the syllable /ta/ are analyzed in Reference 44.
On fricatives, the friction duration and the band energy are calculated in two studies. 32,38 In Reference 32, F1, F2, and F3 are measured on liquids /l/ and /R/.
The results show that the duration of the air pressure release in /k/ is linked with intelligibility and articulation estimates. 26 In addition, the center of gravity and the skewness correlate with the perceptual evaluation in specific contexts (iCi and αCα context). 24 The comparison pretreatment vs post-treatment allows considering the spectral mean and the skewness as good measures for short-term effects, and friction duration on /s, z/ does not seem to be relevant for long-time effects. One year after chemoradiotherapy, the spectral burst peak frequency of /k/ is weakened, a significantly higher F3 with lower intensity is found on /l/, and a significant higher spectral burst frequency on /t/ (higher spectral burst frequency) is noted. 32 Across different contexts, the Klatt VOT seems congruent with the perceptual assessment. 24 Last, the formant variance F2-F3 at the transition between plosive and vowel returns to normal after surgery, and the consonant peak energy frequency is lower presurgery for some subjects. 44

| Global speech (three articles)
Two articles study the performance of different acoustic features, computed from existing corpora, in order to classify speech into two categories (intelligible/unintelligible). The investigated features are: Mel-frequency cepstrum coefficients (MFCC) and Mel S-transform cepstrum coefficients (MSCC) in Reference 30 and multiresolution sinusoidal transform coding (MRSTC) in Reference 34. These features are fed to different classifying algorithms that output a binary decision on the intelligibility: article 34 uses a regression-based classifier, article 30 a support vector machine (SVM). A third article uses an artificial neural network (ANN) to predict articulation quality and nasalance. 27 MSCC yield better results than MFCC in classifying intelligible and unintelligible speech on retrospective corpora, and MRSTC show a better classification when they are fed to an SVM. 34 ANNs significantly predict perceived articulation quality on /α/, as well as perceptual hypernasality on /i/ and /u/. 27

| DISCUSSION
The main goal of this study was to review the scientific literature studying the effects on speech intelligibility of oral or oropharyngeal cancer, using acoustic parameters.
Two main lines of thought emerge from the analysis of the 22 selected articles, regarding the choice of the acoustic parameters, and the unit of analysis chosen to assess intelligibility.

| Acoustic parameters according to participants' characteristics
If we look at the most investigated acoustic analyses used in the studies retained for this review, two main fields can be determined: the nasality measures and the vowel acoustics.
The location of the tumor plays a role in the choice of the acoustic parameter. Most of the studies, including patients with oropharyngeal cancer, use nasalance measures as one of the criteria impacting intelligibility. Among these studies, five include only oropharyngeal cancer 25,27,36,41,42 and two include patients undergoing surgery for the oropharynx or the cavity oral. 32,40 The oropharyngeal pathology, because of its location, has an impact on the dynamics of the anatomical structures that account for speech nasality, particularly by its effect on the soft palate or the tonsil.
The majority of the studies, including patients with oral cavity cancer, analyze acoustics on vowels and consonants. If nasalance is mainly assessed at a sentence or text level, most of the other analyses, however, focus on the acoustic characteristics of isolated vowels, produced singly or more rarely extracted from syllables or continuous speech. The analyses are mainly carried out on the first formants, which are known to be directly impacted by the oral pathology: the opening of the jaw modifies F1 and the position of the tongue modifies F2. The studies making the link between these formant measures and perceived intelligibility (perceptual comparison criteria are used in three articles out of nine addressing formant measures) put forward the interest of three main parameters: the size of the VSA, 26 F2 in the vowel /i/, and ANN-based nasalance scores on /i/. 27 Regarding the analyses on consonants, their type induces the use of different acoustic parameters. On the plosives [p t k], the spectral analysis of the burst and the air pressure release seem relevant. 26,32 The center of gravity, spectral slope, and band energy are more commonly used for fricatives. 26,32,38,44 Thereby, the acoustic parameters analyzed depend on the location of the tumor: the analyses on vowels and consonants relate mainly to oral cavity patients, whereas nasalance concerns mainly patients treated for oropharynx cancer. This is congruent with the expected functional impact of the morphological and dynamic changes consecutive to the treatment. It therefore seems appropriate to adapt the choice of acoustic parameters to the pathology presented by the patient in the clinical assessment.
Regarding the size of the tumor, the intelligibility in the context of small tumors is mainly analyzed on vowels (mostly formant analysis), and on consonants (spectral moments). Nasality is only investigated in one study, using an ANN on vowels. 27 The three studies including larger tumors 30,34,42 mainly use cepstral coefficients (MFCC, MSCC, MRSTC). 30,34 The use of feature extraction and of neural networks is fairly recent in the field of intelligibility assessment and shows promising performances in terms of intelligible/unintelligible binary classification, with the perceptual judgment as the external validation criterion.
The size of the tumor, in accordance with the impact on the anatomical structures involved in speech production, seems to determine the acoustic criteria. Phoneme-specific acoustic parameters are thus mainly used in tumors of small volumes, having a lesser impact on speech dynamics. Regarding tumors of larger volumes, studies look for more general speech-quality parameters to categorize speech as intelligible or unintelligible.
Subcategory analyses by treatment and by language did not reveal any trend, particularly because of the small numbers of studies and patients in each category. Only two languages are found in more than one article: English and Dutch in four studies. Among them, two analyzes the same cardinal vowels and the first two formants 26,29 : they show different results regarding the correlation between these scores and the perceptive assessment of intelligibility. More studies are thus required to specifically study the effect of the phonemic constitution of a language on patients' intelligibility after treatment.
To summarize, a tight link seems to exist between the acoustic parameters and the tumor location, as well as between these parameters and the tumor size. Moreover, there is a great variability in acoustic parameters used in the different studies, mainly at the segmental level. The use of cepstral parameters and machine learning tools allows continuous speech analysis, but these techniques are still very recent and research needs to be developed. Currently, acoustic parameters seem to be relevant to complete the perceptual assessment of speech, carried out in current practice. It would therefore seem appropriate to investigate more comprehensive analysis models that not only classify patients' speech according to their functional intelligibility performance, but also study the fine acoustic impact of a tumor to enable targeted management of analytic deficits.

| Speech samples
The analysis of the speech samples on which the acoustic parameters are measured shows a predominance of the study of isolated phonemes (vowels or consonants). Sentences or texts are rather used for the measurement of cepstral coefficients (such as MFCC or MSCC) or nasalance.
However, in a functional point of view, the analysis of semi-spontaneous or spontaneous speech would be the closest way to predict the intelligibility in the patient's daily life. From our review, we notice that there are no studies on such tasks, such as an image description or spontaneous speech analysis.

| Study limitations
This systematic review surveyed two databases (PubMed and Embase). The WoS was also surveyed, but no entry was found that was not also present in the first two databases (ie, all articles found in the WoS were duplicates of the PubMed and Embase entries). However, it is not excluded that other studies exist outside the scope of this search.
In the 22 articles that were selected, two studies were carried out on identical or very similar corpora: References 26 and 27 and References 38 and 39. However, both were retained because the main objectives were different and complementary: Reference 26 focused on formant analysis while Reference 27 used ANN; Reference 38 investigated the analysis of the spectral moments on consonants, while Reference 39 studied formants in vowels.
The great variability of the included studies underlines the need for the development of standardized tools of acoustic evaluations in patients treated for HNC. Standardization can enable to carry out more precise and reliable assessments in the diagnosis of speech disorder and its severity, but also in intraindividual comparisons in patient follow-up.

| Future directions for research
Numerous acoustic parameters allow differentiating subjects suffering from cancer of the oral cavity or of the oropharynx, from healthy controls. This is the case for formant analysis mainly in cancers of the oral cavity, 31,33,39,43,44 but also for nasality scores in two studies. 40,42 The clinical validity of these measures has thereby been underlined. Other parameters allow the measurement of a change before/after treatment, such as spectral burst frequencies on /t/ and /k/ 32 or nasalance scores 36 for patients with oropharyngeal cancer, and F1 and F2 for patients with oral cavity cancer. 35 These parameters therefore show a good responsiveness.
However, one important question still needs to be addressed: Which golden standard can be used to evaluate the criterion validity of these different parameters? Six studies choose the perceptual evaluation as a golden standard, which is currently the standard in clinical practice. The discussion on the choice of this golden standard remains open.
When conducting our initial database search, the inclusion term "intelligibility" has led to many articles not addressing speech per se, but the quality of voice. It seems that no consensus is reached in the literature regarding the definition of intelligibility.
Moreover, most of the studies focused on the quality of acoustic-phonetic decoding on phonemes (vowels and consonants), to account for the speech intelligibility. However, there are several additional factors that can affect the quality of speech. The inclusion of other elements of the speech signal in addition to the acoustico-phonetic decoding 21 -such as nasality, speech rate, 46 and other temporal and/or prosodic parameters related to perceived impairment 47 -defines the more complex notion of speech disorder severity.
The differentiation between the notions of intelligibility and severity of a speech disorder can also be applied to the question of the impact of these disorder levels at a functional (ie, communication) and at a psychosocial level.
The automatic speech analysis is mainly performed at the segmental level, which is a context allowing a better control of the speech production of the patient. Speech assessment on a read text, which is a semi-spontaneous speech, allows controlling the context of speech production. Although the majority of the speech units from the selected studies are isolated phonemes, and more rarely sentences or texts, none investigated semi-spontaneous or spontaneous speech. True spontaneous speech is based on nonconstrained productions, such as conversational speech. But the automatic analysis of this spontaneous speech is more complex to perform because it does not allow any reference to which comparing the performance of the patient, and that it includes many associated linguistic dimensions (phonemic, lexical, syntactic, prosodic). However, the functional impact of the speech disorder lies in the decrease of the patient's ability to transmit a message. Despite these challenges, acoustic measurements on spontaneous speech need to be developed. This context of production is the closest to communication situations experienced by patients on a daily basis, in communication with peers. Thus, the development of automatic tools objectively measuring speech on picturedescription task or spontaneous speech (such as talking about the last holidays), using specific parameters (eg, acoustics on phonemes, coarticulation, prosody, speech rate, etc) seems to be an interesting lead for future research, facilitated by the recent evolution of technology. 48 Within a perspective of speech evaluation closely reflecting the patient's daily production, the functional impact of the speech disorder must be taken into consideration.
Thus, an overall assessment of speech seems relevant. It would include an objective assessment using specific acoustic measures-specifically according to tumor location-a perceptual evaluation (which is more global because it involves the complexity of speech disorder perception), and new tools for measuring the functional speech impairment (such as self-questionnaires). On the one hand, this would allow a more reliable and accurate assessment of deficits caused by the tumor or its treatment. Relevant linguistic units are to be searched and studied in speech signal to improve the intelligibility measurement of speech production disorders. On the other hand, this overall assessment could better take into account the functional consequences on daily life communication, by the assessment of associated deficits or communication needs. Indeed, the correlation between severity of speech impairment perceptively assessed and quality of life is only moderate. 4 A multidimensional assessment of speech disorders will allow customizing the therapeutic protocols in rehabilitation by capturing new information in speech signal and targeting more objectively deficits and, but also anticipating the functional and psychosocial impact by adapting therapeutic strategies.
Moreover, the automatic acoustic analysis tools, in addition to categorizing speech into intelligible/unintelligible, could also be used to determine finer cutoff points for speech disorder severity levels, depending on the functional impact.

| CONCLUSION
Speech assessment in patients with cancer of the oral cavity or of the oropharynx by objective acoustic measures is in development. While many studies focus on the acoustic analysis of isolated phonetic features, the link with functional consequences and psychosocial repercussions must be studied.
More studies are needed to develop new automatic tools and to study which information they allow eliciting about the self-perceived impairment and the speech-related quality of life.