Robust Speech Coding Algorithm

Speech communication in Electronic Warfare (EW) environment should be resistant to interception, masquerade and tolerant to communication channel errors. <br><br>In this paper, we described an algorithm which provides speech compression, strong encryption, error tolerance and speaker authentication features. This Robust Speech Coder (RSC) is backward compatible with the existing codecs with capability to opt for additional features as and when required.


INTRODUCTION
Speech coders are classified into waveform coders, parametric and hybrid coders. Waveform coders like Pulse Code Modulation (PCM) and Adaptive differential PCM (ADPCM) attempt to preserve the original shape of the input signal and work at bit rate of 32 kbps and above. Parametric coders parameters of input speech signal are estimated and these parameters are used to synthesize the speech signal. This class of coders work typically in the range of 2 to 5 kbps. Example coders of this class linear prediction coding (LPC) and Mixed Excitation linear Prediction (MELP). Hybrid coder combines the strength of waveform coder with that of parametric coder. These coders typically operate between 5to 32 Kbps. Code excited Linear prediction algorithm, its variants, mixed excitation linear prediction algorithm and its variants belong to this class.
In parametric speech coding, 256 samples of input speech signal are buffered into frames and passed through linear prediction filter. The frame can be represented by ten filter coefficients, plus scale factor. 4096 bits corresponding to 256 samples of original speech frame are converted into 45 bits per frame.
The speech coding procedure is summarised as under: • Encoding Derive the filter coefficients from the speech from Derive the scale factor from the speech frame.
Transmit filter coefficients and scale factor to the decoder.
• Decoding Generate white noise sequence.
Multiply the white noise samples by the scale factor.
Electronic copy available at: https://ssrn.com/abstract=3625459 Construct the filter using the coefficients from the encoder and filter the scaled white noise sequence. Output speech is the output of the filter.
LPC coder uses a fully parametric model and produces intelligible speech at 2.4 kbps. However, it generates annoying artefacts such as buzzes, thumps and tonal noises. MELP utilizes additional parameters to capture the underlying signal dynamics. MELP voice encoder is reviewed in this paper. In symmetric encryption system, both sender and receiver use the same key. If the sender and receiver each use different keys, the system is referred to as asymmetric or public-key encryption system. A block cipher processes the plain text input in fixed size blocks and produces a block of cipher text of equal size for each plain text block.
Block symmetric encryption is suitable for use with parametric speech coders because both buffer the input data into frames and process the same frame by frame.

Block Diagram of MELP [1]
A block diagram of the MELP model of speech production is shown in Figure 1, which is an attempt to improve upon the LPC model. MELP decoder utilizes a sophisticated interpolation technique to smooth out inter frame transitions. A randomly generated period jitter is used to perturb the value of the pitch period so as to generate an aperiodic impulse train. The MELP coder extends the number of classes into three: unvoiced, voiced, and jittery voiced. The latter state corresponds to the case when the excitation is aperiodic but not completely random, which is often encountered in voicing transitions. This jittery voiced state is controlled in the MELP model by the pitch jitter parameter and is essentially a random number. A period jitter uniformly distributed up to +/-25% of the pitch period produced good results. The short isolated tones, often encountered in LPC coded speech due to misclassification of voicing state, are reduced to a minimum.

Figure1. The MELP model of speech production
Shape of the excitation pulse for periodic excitation is extracted from the input speech signal and transmitted as information on the frame. The shape of the pulse contains important information and is captured by the MELP coder through Fourier magnitudes of the prediction Electronic copy available at: https://ssrn.com/abstract=3625459 error. These quantities are used to generate the impulse response of the pulse generation filter (Figure 1), responsible for the synthesis of periodic excitation.
Periodic excitation and noise excitation are first filtered using the pulse shaping filter and noise shaping filter, respectively; with the filters' outputs added together to form the total excitation, known as the mixed excitation, since portions of the noise and pulse train are mixed together.
In Figure 1, the frequency responses of the shaping filters are controlled by a set of parameters called voicing strengths, which measure the amount of ''voicedness.'' The responses of these filters are variable with time, with their parameters estimated from the input speech signal, and transmitted as information on the frame.

Shaping Filters
The MELP speech production model makes use of two shaping filters ( Figure 1) to combine pulse excitation with noise excitation so as to form the mixed excitation signal. Responses of these filters are controlled by a set of parameters called voicing strengths; these parameters are estimated from the input signal. By varying the voicing strengths with time, a pair of timevarying filters results. These filters decide the amount of pulse and the amount of noise in the excitation, at various frequency bands.
In FS MELP, each shaping filter is composed of five filters, called the synthesis filters, since they are used to synthesize the mixed excitation signal during decoding. Each synthesis filter controls one particular frequency band, with pass bands defined by 0-500, 500-1000, 1000-2000, 2000-3000, and 3000-4000 Hz. The synthesis filters connected in parallel define the frequency responses of the shaping filters. Figure 2 shows the block diagram of the pulse shaping filter, exhibiting the mechanism by which the frequency response is controlled. VS 1 to 5 are the voiced strengths. Electronic copy available at: https://ssrn.com/abstract=3625459 Thus, the two filters complement each other in the sense that if the gain of one filter is high, then the gain of the other is proportionately lower, with the total gain of the two filters remaining constant at all times.

1.2Kbps / 2.4 Kbps MELP Speech Coders [2]
The MELPe or enhanced-MELP (Mixed Excitation Linear Prediction) is a United States Department of Defence speech coding standard used mainly in military applications and satellite communications, secure voice, and secure radio devices. In 2002, the US DoD adopted MELPe as NATO standard, known as STANAG-4591, enabling the same quality as the old 2400 bit/s MELP at half the rate.
The 2.4Kbps MELP algorithm divides the 8Kbps sampled speech signal into 22.5ms frames for analysis, whereas The 1.2Kbps MELP algorithm divides the 8Kbps sampled speech signal into groups of three 22.5ms frames into a 67.5ms super frame for analysis. Depending upon the type of speech present in the signal, inter-frame redundancy can be exploited to efficiently quantize the parameters.

Bit Allocation
The allocation scheme of FS MELP [1] is summarised in Figure 3. A total of 54 bits are transmitted per frame, at a frame length of 22.5 ms. 2.4 kbps bit-rate is required to transmit 54 bits per frame.

Data Encryption Algorithm (DEA)
An encryption scheme computationally secure if the cost of breaking the cipher text generated by the scheme exceeds the value of the encrypted information and the time required to break the cipher exceeds the useful lifetime of the information. In battlefield, the lives of soldiers depend on information and therefore value of the information incalculable. However, the useful lifetime of the information is known and the time required to break the cipher can be calculated.
Assuming there are no inherent mathematical weaknesses in the algorithm, brute-force approach makes reasonable estimates about the time. Brute-force approach involves trying every possible key until intelligible translation of the cipher text into plain text is obtained. Assuming that it takes 1 micro second to perform single decryption, it takes 10.01 hours [3] to break a 56-bit key size DES, and 5.4 x 10**30 years to break a 168-bit key size DES.
In 1999, Triple DES (3DES) was incorporated as part of the Data Encryption Standard and published as FIPS PUB 46-3. 3DES uses three keys and three executions of the DES algorithm. 3DES is very resistant to cryptanalysis and makes the system robust.
3DES processes the input data in 64-bit blocks. 54 bits are required to encode one frame of input speech of 22.5 ms. Remaining 10 bits are utilised for error protection to make the coder more robust.

Authentication
In terms of communication security issues, a masquerade is a type of attack where the attacker pretends to be an authorized user of a system in order to gain access to it or to gain greater privileges than they are authorized for. For example, enemy after gaining access to the netcentric communication links will be on listening mode while tactical operations progress and at a critical time, take control of the connection and pass operational orders favourable to him, that would result in losing a battle. A security alert by the communication system will save a country from defeat.
Using speech recognition algorithms, the speaker is identified from the original speech input frames.
Index to the speaker is obtained from the data store. A secure hashing algorithm SHA-512 is applied on this speaker index and 512 bit message digest is obtained. 512 bits are combined with MELP encoded speech frames at the rate of 1-bit per frame.
At the receiving end, the message digest is recovered. New speaker Id is calculated at the receiving end from the synthetic speech and index to the speaker is obtained from local database at the receiving end. Hashing algorithm is applied on the index and a new message digest is obtained. The new message digest and received message digest are compared. If there is mismatch, the user at the receiving end is alerted.

Error Protection [4]
In 1950, Hamming introduced the (7,4) code. It encodes 4 data bits into 7 bits by adding three parity bits. Hamming (7,4) can detect and correct single-bit errors. With the addition of an overall parity bit, it can also detect (but not correct) double-bit errors.
In MELP algorithm, Forward Error Correction (FEC) is implemented in the unvoiced mode only. The parameters that are not transmitted in the unvoiced mode are the Fourier magnitudes, band pass voicing and the aperiodic flag. FEC replaces these 13 bits with parity bits from three Hamming (7,4) codes and one Hamming (8,4) code. However, no error correction is provided for the voiced mode MELP coder. The DES/3DES encryption algorithms process input data in 64-bit blocks. 54 bits are allocation for MELP encoded speech frame. 1-bit per frame is added for authentication. Remaining 9 bits are utilised for FEC parity bits for voiced mode from three Hamming (7, 4) codes. Figure 4 gives the block diagram of RSC voice coder with 3DES encryption scheme, SHA-512 authentication and 9-bit FEC incorporated in its algorithm.

Step 1: Data Compression
The original speech is buffered into 22.5 ms frames and passed through MELP coding filter. The 22.5 ms frame coded into 54 bits compressed speech frame.

Step 2: Authentication
The original speech corresponding to 512 frames is buffered and using speech recognition algorithm, the speaker is identified. All the authorised speakers' names are recorded in the local database. A replica of this database is loaded at all destination receiving stations. All the names are indexed. The index number corresponding to the speaker is retrieved from the local database. The index is hashed using SHA-512 and resulting 512-bit Message Digest (MD) is buffered and 1-bit per frame is added to 54-bit compressed speech frame.
In case an unauthorized person speaks, the database will return a special code corresponding unknown speaker and the destination recipient will get alert.
One bit per frame (22.5 ms) is added to the 54-bit compressed speech frame. In case SHA-1 hashing algorithm is used, the size of the message digest is 64-bits length and it takes 22.5 x 64 = 1440 ms. (1.5 seconds approximately) to buffer the original speech. Therefore the speaker is authenticated every 1.5 seconds. In case stronger secure hashing algorithm like SHA 512 is used, the authentication period would be 12 seconds.

Step 3: Forward Error Correction
In MELP algorithm, Forward Error Correction (FEC) is implemented in the unvoiced mode only. RSC algorithm uses 9 parity bits to provide error correction. It uses one Hamming (31,26) code and one Hamming (15,11) code. LPC parameters are coded with 25 bits (Refer Figure 2 above). Hamming (31,26) code is applied to 25 bits of LPC parameter bits and one MSB bit of band pass voicing parameter. Hamming (15,11) code is applied to 5 bits of second gain parameter, 3 bits of first gain parameter and three LSB bits of band pass voicing parameter. Total 2 bits are corrected over 39 bits of data which cover four parameters, that is, LPC, first gain, second gain and band pass voicing parameter. Thus 9 parity bits are used to correct 2 errors over 39 bits out of 55 bits. These 39 bits are covering most critical parameters in the RSC algorithm.
Step 4: Encryption 54 bits of compressed speech, 9 bits of forward error correction and 1 authentication bit are buffered into a 64 bits compressed speech frame. This block of 64 bits is encrypted with 3DES and resulting 64 bit encrypted compressed speech is transmitted to receiver end.

Step 5: Decryption
The 64 bits of encrypted speech is input to 3DES decryption. The resulting 64-bit decoded speech is passed to the next stage.
Step 6: Application of FEC and Reassembly of MD 9 parity bits are used to correct errors, if any. 54 bits of compressed speech is separated and given to MELP Decoding filter. 1 authentication bit per frame is buffered and the original 512 bits of original message digest (MD) is reassembled. The original MD is used to compare with the new MD which calculated from Synthesized speech.
Step 7: Speech Synthesis 54 bits of compressed speech frame is passed through the MELP Decoder which produces the synthesized speech frame of 22.5 ms.
Step 8: Calculation New MD and Alert Generation 512 frames of synthetic speech is buffered and given as input to speaker identifier. The index of the speaker is retrieved from the local database. SHA 512 hashing is done on the index and new MD is produced. New is compared with the original MD reassembled from the received speech frames. In case both the MDs are the same, then there is no masquerade. In case they don't match, then an alert is given the receiver that there is a change in the speaker at the sending end.

CONCLUSIONS
RSC voice processor uses 64-bit allocation scheme for 22.5 ms frame which would get translated to 2844 bps bit-rate. However, the RSC voice processor is interoperable with existing MELP based communication systems in non-encryption mode. The secure mode can be optionally switched over at the extra cost of 444 bps.
The encryption algorithm will introduce very small delay in processing time. Advances in microelectronics and the vast availability of low cost programmable processors and dedicated chips have enabled rapid technology transfer to product development. Assuming one microsecond for encryption / decryption, the delay added to the processing time is negligible and overall delay would be less than acceptable 150 ms from speaker to receiver and the conversation will not be impaired after switching to encryption mode.
Net-centric communications are accessed by large number of users and therefore, there is a need to provide protection against security attacks and suitable security systems should be introduced to match with the speed of migration to net-centric communications.
In this paper, we described RSC algorithm which provides speech compression, encryption, authentication against masquerade and forward error correction in the presence of errors produced by harsh acoustic noise disturbances produced in the battlefield. RSC algorithm described in this paper is suitable for secure net-centric communications in the battlefield and is a robust and secure voice processor with explicit encryption and error correction features.