A Phonological Control Method on A Speech Compensation System for Dysarthria Using A Standardized Space

We have developed a speech compensation system for dysarthria. The system aims at improving the phonological properties of vowels without losing speaker individuality. We propose a method for phonological control of vowels using a standardized space to control vowels in the normalized articulation space, normalized for speaker individuality. The method maps an original dysarthric speaker's normalized articulation space to a standardized space, then from the standardized space to the target speaker's normalized articulation space assuming normality to improve the phonological properties of vowels. We confirm phonological control of vowels by performing a processing simulation, comparison different target speakers and a processing simulation using a dummy original speaker as a dysarthria.

Abstract-We have developed a speech compensation system for dysarthria. The system aims at improving the phonological properties of vowels without losing speaker individuality.
We propose a method for phonological control of vowels using a standardized space to control vowels in the normalized articulation space, normalized for speaker individuality. The method maps an original dysarthric speaker's normalized articulation space to a standardized space, then from the standardized space to the target speaker's normalized articulation space assuming normality to improve the phonological properties of vowels. We confirm phonological control of vowels by performing a processing simulation, comparison different target speakers and a processing simulation using a dummy original speaker as a dysarthria.

I. INTRODUCTION
Speech is one of the important means of communication in daily life, and deterioration of the phonological features of speech due to dysfunction of the speech organs can make communication difficult. To assist people with speech impairments, we have been developing a speech compensation system.
In a previous study, we proposed a method for normalizing degraded formant frequencies in an articulatory space called the "hv-plane," which normalizes speaker differences in vocal tract length [1]. We also proposed a function for inverse mapping (formant restoration) from the hv-plane to a unique coordinate point in the actual formant space [2]. If the speaker from which the formant restoration function is created and the speaker to be restored are the same, formant restoration can be performed with an average restoration error rate of about 5%, regardless of speaker or context [3]. The hv-plane can normalize speaker individualities from the actual formant space and can separate phonological features, suggesting that the formant restoration function includes speaker individualities such as vocal tract length and shape [4].
In this paper, we propose a vowel phonological control method for speech compensation systems. Dysarthric speakers have different vowel hv distributions depending on the type This work was supported by JSPS KAKENHI Grant Number 17K01568 and a donation by Research & Consulting of Regional Science Co., Ltd. and severity of the disorder. Various vowel hv distributions for each dysarthric speaker are standardized using a standardized space, and vowel phonological control is performed to obtain a normal vowel hv distribution.

II. SPEECH COMPENSATION SYSTEM
A. System overview A speech compensation system improves the phonological features of vowels in impaired speech without losing the individuality features of speakers [5]. As Figure 1 shows, this system consists of three blocks: an analysis block, a normalization block, and a synthesis block.
• Analysis block This block analyzes the acoustic features of input dysarthric speech. It estimates formant frequencies (F 1 , F 2 , F 3 ) by an inverse filter control (IFC) method [6], estimates the fundamental frequency (F 0 ) by autocorrelation function peak picking of a vocal fold source differential wave obtained by passing input speech through an inverse filter composed of estimated formant frequencies, and extracts the root mean square (RMS) from input speech. • Normalization block This block normalizes degraded vowel formant frequencies using features extracted by the analysis block. It first performs mapping from an actual formant space to a normalized articulation space that normalizes differences in vocal tract length for each speaker. This normalized articulation space is the hv-plane, and we call mapping to the hv-plane "hv conversion." This block next performs normalization processing on the hv-plane, then finally performs inverse mapping from the hv-plane to the actual formant space (formant restoration).
In this paper, we propose a vowel phonological control method using a standardized space as a normalization process.

• Synthesis block
This block performs speech synthesis by formant analysis and synthesis using the formant frequencies normalized by the normalization block.

B. hv conversion
To normalize differences in speaker vocal tract lengths from the actual formant space consisting of the first to third formant frequencies, variables h and v are defined as where h (horizontal) is a parameter corresponding to the frontback position of the tongue and v (vertical) corresponds to opening of the chin. (h, v) = (1, 1) indicates the articulation position of the neutral vowel (F 1 : F 2 : F 3 = 1 : 3 : 5) in the actual formant space. Vowel colorization has been proposed as a method for normalizing speaker individuality in a formant space [7]. In this method, the first to third formant cyclic ratios correspond to the primary color signals RGB, with vowels colorized by mapping from an actual formant space to RGB space as thereby rendering neutral vowels colorless.
In the speech compensation system, the hv-plane is used as a space in which the articulation state can be expressed in addition to the speaker normalization ability of the RGB space.

C. Formant restoration
Articulation points (h, v) in the hv-plane have a one-tomany relationship with sample points in the actual formant space (F 1 , F 2 , F 3 ) and cannot be uniquely inverse-mapped to the actual formant space. If an arbitrary articulation point in the hv-plane is (h , v ) and the formant space point obtained by inverse mapping is (F 1 , F 2 , F 3 ), it can be expressed as a modification of (2) as When (3) is applied to the relation of the formant ratio, By providing a constant (c), it is possible to inversely map from the hv-plane to a unique point in the actual formant space, as If (h , v ) = (1, 1) and c = 500 in (5), the formant frequency (F 1 , F 2 , F 3 ) = (500, 1500, 2500) is the neutral vowel of an adult male with a vocal tract length of 17.5 cm. For the articulation shape corresponding to articulation point (h , v ), the formant ratio of each speaker is uniquely determined by (4), and the absolute value of the formant is determined by (5) with a constant c depending on the vocal tract length. In (5), the constant c gives the value of F 3 . If we define a function g(h , v ) with (h , v ) as a variable and set c ≡ 500g(h , v ), we can express (5) as (6) can uniquely recover the formant frequencies (F 1 , F 2 , F 3 ) from any articulation point (h , v ). The function g(h , v ) is called the formant restoration function, and inverse mapping from the hv-plane to the formant space is called formant restoration. g(h, v) is defined for each speaker as a threedimensional surface on the hv-plane as in (7), and coefficients a 0 , a 1 , . . . , a 9 for each speaker are estimated by multiple regression analysis.
III. VOWEL PHONOLOGICAL CONTROL METHOD THROUGH A STANDARDIZED SPACE

A. Method overview
To improve the phonological properties of vowels, in this method the hv-plane of the impaired speech (the original speaker's hv-plane) is mapped to the standardized space and the hv-plane of the target normal speech (the target speaker's hv-plane). Dysarthric speakers have unique vowel hv distributions depending on the type and severity of the disorder. Through the standardized space, it is possible to normalize differences in the vowel hv distribution that vary depending on the dysarthric speaker. Therefore, even when the target dysarthric speaker changes (when the original speaker's hvplane changes), the same normalization processing can be performed.

B. Mapping from the original speaker's hv-plane to the standardized space
From the hv distribution of dysarthric speech, five basic points (h k , v k ) corresponding to the gravity center of the five Japanese vowels "a, i, u, e, o" and a gravity center (h g , v g ) of the five basic points are calculated. The declination of (h k , v k ) when (h g , v g ) is the origin point is calculated as As Figure 2(a) shows, (h k , v k ) is rotated using (9) so that the declination θ a of (h a , v a ) is 0 rad with origin (h g , v g ). The rotation coordinates (h tk , v tk ) of the point are The declination θ tk of (h tk , v tk ) is calculated as From the above, θ ta is 0 rad. Consider the case where an arbitrary point X(h x , v x ) is input. Calculate the rotation coordinate X t (h tx , v tx ) and declination θ tx by setting k ≡ x in (9) and (10). As Table I shows, segments seg 1 to seg 5 are set from θ ta to θ to according to the argument range. seg 1 to seg 5 determine which segment θ tx belongs to, and set the two basic points (h 1 , v 1 ) and (h 2 , v 2 ) corresponding to the segment to which they belong. Point P (h p , v p ) lies on a straight line passing through basic points (h 1 , v 1 ) and (h 2 , v 2 ) and intersecting a straight line passing through the origin point and point X t . Point P transitions linearly between two basic points, and its position can be represented as a ratio β from 0 to 1, which is based on (h 1 , v 1 ) as As Figure 2(b) shows, the radials of points X t and P are r tx and r p , and radials r n and declination θ n mapped to the standardized space can be calculated as (12) is an interpolation function for standardizing the argument. If the declinations of basic points (h 1 , v 1 ) and (h 2 , v 2 ) are θ m and θ m+1 , coefficients c 1 and c 2 in (12) are Assuming that a point in the standardized space represented by radius r n and declination θ n is X (h x , v x ), the coordinates are C. Mapping from standardized space to target speaker's hvplane By mapping from the original speaker's hv-plane to the standardized space, the declination and radius are standardized when the gravity centers of the basic points for the five vowels differ for each speaker. Next, we map from the standardized space to the target speaker's hv-plane with normal phonological properties.
For the point X (h x , v x ) mapped from the original speaker's hv-plane to the standardized space, the declination ∆ s to the corresponding basic point (h 1 , v 1 ) is obtained. The ratio α of ∆ s to declination of two basic points (h 1 , v 1 ) and (h 1 , v 1 ) is calculated as In the target speaker's hv-plane, using two basic points (h t1 , v t1 ), (h t2 , v t2 ) and ratio α corresponding to the segment of point X in the standardized space, a point (h s , v s ) on a straight line connecting the two basic points is calculated as (h s , v s ) in (16) corresponds to the declination in the target speaker's hv-plane. Using radius r n of point X in the standardized space, (h , v ) of the target speaker's hv-plane corresponding to point X are obtained as D. Processing under proposed method Figure 3 shows an example of processing using the vowel phonological control method through the normalized space. Figure 3(a), (b) show the original speaker's formant frequencies and hv-plane, (c) shows the standardized space, (d) and (e) show the target speaker's hv-plane and restored formant frequencies. An appropriate male speaker is set as the original speaker, and a female speaker F5 of the phoneme balance words (ATR 216 words [8]) is set as the target speaker F5-ATR. The squares in Figure 3 represent the gravity center position of each vowel. For target speaker hv-planes, an area limitation process is performed to prevent mapping to unusual points such as extreme articulation states in the restoration to the actual formant space. Namely, if mapping is performed outside the bold line, we remap to the point on the bold line at the intersection when the mapping point and the origin are connected by a straight line. The broken line connecting the dots in Figure 3 is an example of the normalization process in which the hv transition obtained by hv transformation of the formant trajectory of the continuous Japanese vowels /ieaou/ is input to the original speaker's hv-plane. In the original speaker's hv-plane, the standardized space, and the target speaker's hv-plane, each vowel centroid point has a corresponding relation. Therefore, hv transition on the original speaker's hv-plane also continuously transitions in other spaces. Figure 4 compares hv trajectories on the target speaker's hv-plane when various speakers are set as the target speakers. The same input and original speaker's hv-plane as in Figure 3 were used. Figure 4(a), (b) show the hv transition results when a female speaker F3 of ATR 216 words is set as the target speaker, Figure 4(c) and (d) shows the results when a male speaker M6 is set. For each speaker, the gravity center of the five vowels represented by squares has a slightly different distribution and the restricted area represented by the bold line has a unique range. It can be seen that in each target speaker's hv-plane, the hv transition of the original speaker's hvplane can be converted according to the positional relationship between the five vowel centroids. Figure 5 shows a processing simulation using a dummy speaker as a original speaker. A dummy speaker imitates a dysarthria. Figure 5 confirm the effect of normalization even in speakers with disabilities.
These results suggest that when mapping from the original speaker's hv-plane to the target speaker's hv-plane through the standardized space, the vowel phonology can be controlled by setting the target speaker.

IV. CONCLUSION
We proposed a vowel phonological control method for a speech compensation system for dysarthria using a standardized space. In this method, different hv distributions for each dysarthria can be normalized by mapping from the original speaker's hv-plane to the standardized space, so that even if the original speaker changes, as long as we can obtain the gravity center of the speaker's five vowels, it is possible to map from the original speaker's hv-plane to the target speaker's hvplane in the same processing. Using the proposed method, we set the original speaker and the target speaker and performed a processing simulation, thereby confirming correspondence of each space from the hv transition of the original speaker's hv-plane, the standardized space, and the target speaker's hv-plane. The results suggest that vowel phonology can be controlled by setting an arbitrary target speaker.
Topics for future study include implementation of speech synthesis using the restored formant frequency, as well as