Voice Biometric Identity Authentication Model for IoT Devices

Behavioral bio-metric authentication is considered as a promising approach to securing the internet of things (IoT) ecosystem. In this paper, we investigated the need and suitability of employing voice recognition systems in the user authentication of the IoT. Tools and techniques used in accomplishing voice recognition systems are reviewed, and their appropriateness to the IoT environment are discussed. In the end, a voice recognition system is proposed for IoT ecosystem user authentication. The proposed system has two phases. The first being the enrollment phase consisting of a pre-processing step where the noise is removed from the voice for the enrollment process, the feature extraction step where feature traits are extracted from user’s voice, and the model training step where the voice model is trained for the IoT user. And the second being the phase verifies whether the identity claimer is the owner of the IoT device. Based on the resources limitedness of the IoT technologies, the suitability of text-dependent voice recognition systems is promoted. Likewise, the use of MFCC features is considered in the proposed system.


INTRODUCTION
Biometrics based authentication is about the automatic verification of an identity claimer using his/her physiological and behavioral traits. Using biometric authentication for securing the IoT ecosystem is a promising approach [1]. In general, biometric authentication systems involve two steps. These are the enrollment and verification of user. The two steps are discussed in Section 2. Due to the portability, stability, and privacy of the voice features, voice recognition authentication has attracted extensive attention and application in recent years [2]. Voice recognition systems are versatile, simple to use, and non-intrusive by nature. It is considered accurate and does not require specialized tools, just a smartphone is enough for remote authentication to various services. Likewise, among other biometric authentication parameters voice is the simplest and easiest unimodal to require and use for user authentication [3].
As a result, in recent years, voice recognition has attracted various technology leading companies. For example, Google has provided Android-based Trusted Voice to allow users unlock their smartphones. Saypay's mPayment consumers use a voice password to conduct transactions [2]. In addition, Google has promoted the employment of automatic speaker recognition for authenticating users in the IoT [4]. Regarding the nature of the IoT ecosystem, especially its mobile remote control, the use of voice recognition for user authentication may give an overwhelming advantage [5]. In addition, the IoT ecosystem related advantage of voice biometric include requiring of small storage, ease of transmission, and non-intrusiveness [6].
In this paper, a voice recognition authentication system to be used in the IoT ecosystem is proposed. The resource limitedness of the IoT devices and remote access are taken into consideration. For example, the proposed system uses of MFCC to extract features, and Support Vector Machines (SVMs) for user verification which is fundamental to Remote Speaker Identification [7].
Voice features that can be extracted from acquired voice data can be of high-level or low-level attributes. Low-level attributes, related to the vocal tract, are derived from spectral measurements, while the high-level is derived from behavioral cues such as dialect, word usage, conversation patterns, etc. High-level attributes are difficult to extract but are less sensitive to noise [8,9]. In this light, extraction of low-level cues is necessary for IoT user authentication.
The rest of the paper is as follows: Section 2 is the background, Section 3 presents the related work, and Section 4 discusses the research gap, Section 5 the proposed system is presented, Section 6 presents the limitations and assumptions, and Section 7 is the conclusion.

BACKGROUND
Interconnected environments such as machine to machine (M2M), Machine to Individual, or Individual to Individual are what make up the IoT ecosystem. In the IoT ecosystem smart objects can communicate between themselves, things can detect each other, and everything can interact with each other and with the local environment. These interconnections are facilitated with remote sensing and tracking capabilities, and every entity is provided with data transfer through the internet, Wi-FI, ZigBee, or Bluetooth. In particular, organizations may need such data for business, social, or research analytics [10]. For this reason, a vast variety of information is stored, managed, and processed. Access to these private data needs to practice secure access control. Employing conventional authentication mechanisms such as passwords is reported to have fallen short of the IoT ecosystem. Thus, biometric technology is considered as a better substitute for the protection of IoT private data [11].
Biometrics are either physiological or behavioral. The voice recognition authentication mechanism is part of the behavioral biometric schemes [12]. Based on the personal particulars of voices, researchers have proposed a number of authentication schemes that employ voice recognition. Voice recognition is "the process of automatically recognizing who is speaking based on the signals of the voice" [13]. However, voice recognition schemes commonly suffer from the issues of the owner's voice change and the use of a recorded owner's voice. That is, the voice of the owner may change because of environmental reasons, such as fatigue, cold, or flu. Likewise, attackers could voice record the legitimate voice owner and later use it for illegal authentications [14].
There are two main steps taken in the voice recognition process. That is voice enrollment and voice verification as depicted in Fig. 1. The former is required for determining whether the voice is a sample in the database, and the latter identifies which sample it is in the database. In the voice enrollment process, some papers claim that the process consists of four steps including data collect, feature extraction, feature template creation, and template storage. There are also some researchers who have added one more step that comes after data collection and before the feature extraction steps. They call it as pre-processing, and it aims at removing noise from the collected data. Likewise, the verification process consists of steps such as data collecting, feature extraction, template matching and matching decision. These are discussed in the following subsections:

Data Collection/Acquisition
The process of collecting voice is nothing but the digitization of the speaker's voice. This is usually accomplished by using a microphone that captures the voice at a sampling rate. Subsequently, these data are later sent to a computing device for processing. Some of the researchers refer to this process as dataset generation or data sample collection [15]. There are two main ways of collecting voice, that is, fixed text and random number string. However, systems usually use the latter where each string of numbers is set to have 8 Arabic numbers with a range of 0 to 9 [16]. This process may include a pre-processing step where noise is removed from the original voice [17].

Feature Extraction
Features are extracted from the voice data collected and pre-processed in the preceding process. These features must be robust to intrinsic variability that may cause to user's voice distortion due to stress or diseases. In general, there is a number of techniques that are involved in extracting features from the user's voice. These may include but not limited to, Linear Prediction Cepstral Coefficients (LPCCs) and Mel-Frequency Cepstral Coefficients (MFCC) [16,17]. The second has been employed in some research in order to overcome the issue of constrained resources and uncontrolled operating conditions that are similar to the nature of IoT technologies [18].
It is after this process where the enrollment and verification processes take different routes. For example, for the enrollment process, template creation and template storage come after the feature extraction, while in the verification process, template matching and taking decision follow the feature extraction.

Template Creation and Storage
This process involves the creation of templates from common features that correspond to its owner. Subsequently, the templates are stored in a voice recognition database. There are several databases such as VidTimit database and MEEI database.

Template Matching and Analysis
This process tries to find an exact or near-exact match between the identity claimer's voice and the previously stored voice templates. This can be accomplished by using Fourier transforms or linear predictive coding (LPC) [19]. Subsequently, after templates are created, the system is trained on the templates. Training methods include vector quantization (VQ) which is based on LindeBuzoGray (LBG) algorithm. In addition, Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) are also used for feature training [17].

Matching Decision
This process is to determine whether the identity claimer corresponds to the claimed identity based on the similarities of the two voices. Subsequently, the match is either rejected or accepted. In this process, two kinds of errors may happen, false-negative or false positive. False-negative means the system has failed to identify a genuine claimer, while false positive refers to granting access to a non-authorized user [20].

Text-Dependent
On the other hand, based on the textual contents of the speech data, voice recognition systems can be classified into two categories; Text-dependent, the identity claimer is expected to produce the same words as those pronounced during the enrollment; in this method, the speaker has to satisfy two conditions, knowing the word and being the rightful owner of the voice [2]. Textindependent, the user can speak freely during enrollment and verification phases [21]. Most of the research claims that Text-dependent recognition systems have better performance and are simpler compared to the Text-independent systems [22]. Thus, in connection to the resource limitedness of smart devices Text-dependent voice recognition approach would better fit for the IoT authentication.

Evaluation Metrix
To measure the effectiveness of voice recognition systems a number of parameters are studied. These parameters include the false acceptance rate (FAR) which is the number of attacks being incorrectly labelled as authentic by the system. False rejection rate (FRR) refers to the number of authentic interactions being incorrectly rejected as attacks. Relative operating characteristics (ROC) represents a compromise between FAR and FRR. It helps systems minimize both FAR and FRR [23].

RELATED WORK
In general, biometric based authentication systems employed in the IoT ecosystem are of two types. Human physiology for instance, face, eyes, fingerprint or electrocardiogram. And behavioral features such as signature, voice, gait, or keystroke. For example, the researchers in [24] introduced a gaze feature based model which is secure against iterative and side channel attacks. Likewise, in [25] the researchers used electrocardiogram for the development of their method in which they proved the good candidacy of biometric features for authentication of IoT devices. The result of the implementation of this scheme reveal that it has 1.41% FAR and 81.82% TAR for 4 seconds of signal acquisition. One of the main strengths of the scheme is that it conceals the biometric features during authentication, but privacy preservation mechanism is not taken into consideration.
The researchers argue in [26] the suitability of signature based authentication systems for the IoT devices whereby they presented three categories of signature based scheme namely, offline, online and behaviour. Some Gait recognition based authentication systems proposed for IoT devices are also in the literature [27]. A touch screen based authentication scheme is proposed in [28].
In [29], a keystroke dynamics based authentication scheme with three steps enrollment, classifier and user authentication is proposed. Similarly, in [30] a fingerprint based authentication system is provided. In [31], the researchers have introduced an authentication and authorization scheme that uses face recognition which can be used for IoT ecosystem. Iris based authentication system used for unlocking mobile IoT devices is proposed in [32].
To the best of our knowledge, there are only two researchers who adopted voice biometrics as an authentication mechanism for the IoT ecosystem. Shin and Jun [33] have implemented voice recognition technology to verify authorized users for controlling and monitoring an automated home environment. The researcher proposed a voice recognition system that is divided into server and device parts. The role of the server part of the system is for user preregistration, user recognition, and control command analysis. The role of the device part is device command reception and device controlled then response. The type of models and techniques employed in this research is not discussed. Likewise, the implementation of the model is not reported.
Oscar et al. [1] have proposed a multimodal biometric approach for IoT based on face and voice modalities. The researchers have designed their system in order to scale to the limited resources of IoT technologies. For the voice recognition part of the system, the researchers were able to extract MFCC features from voice with the use Fourier transform. In the light of this, the filter banks are decorrelated with the application of a discrete cosine transform. This system is not fully utilizing voice recognition. Although, it has been implemented in a case study, yet the end result of this model cannot be compared to a system that fully utilizes voice recognition.
The overall advantages of such biometric based schemes are that cannot be lost, they are very difficult to copy, they are hard to distribute, and they cannot be easily guessed. Conversely, conventional password-based authentication methods are suffering from a number of drawbacks and can be easily guessed, hacked and cracked. The performance of the reviewed biometric systems is shown in Table 1.

RESEARCH GAP
Only few researches have been done on the areas related to deployment voice recognition systems to the IoT ecosystem for access control and user authentication. Building a working voice recognition system or integrating it to the IoT ecosystem is lacking in the literature. However, there are some sufficient projects done in the area of voice recognition in general. Some are adopted to the mobile and cloud computing paradigms. The challenges of IoT devices' restricted computational, storage and power resources are threatening development of sophisticated authentication systems. Hence, a novel biometric approach has to be proposed [11,34].

OUR WORK
We envision an automatic voice biometric authentication system that would be suitable for managing and monitoring IoT devices from remote. As discussed previously, our model will have a training or an enrollment phase and a verification or authentication phase (Fig.2). The following section broadly discuss different components of the model.

Enrollment Phase
Sound capture This step captures the sound or voice of the IoT device owner for training. This is expected to be done by the smartphone where the owner uses control apps of the IoT devices. The output of this step is converted files with a suitable file format. Electronic copy available at: https://ssrn.com/abstract=3667519

Pre-processing
In this step, the collected voice data is validated for defects. This is accomplished by decomposing the data at different frequencies at different scales. And the resulting wavelet are checked for existence of any clipping. Subsequently, identified noise wavelets are removed and the noise free data is obtained. There are two ways of removing noise from collected sound data. That is with the use threshold based de-noising method [35,36,37] and recursive least squares adaptive filtering method [38,39]. However, the first is adopted in our work for its suitability of smart devices.

Feature extraction
Voice features deemed important for the system are extracted in this step. The extraction and selection of such feature vectors adds to the quality of the of the voice recognition system. Feature traits extracted from owner's voice are expected to be different that of others, must be robust to noise and distortion, should be easily extractable, difficult for playback attacks, and should not change with the change of environment or health of the owner. As such, the most appropriate features used in this model are MFCC features. The MFCC coefficient is selected for its computing simplicity which is suitable for resource constrained characteristics of the IoT ecosystem. And it's mimicking nature of human auditory of the human ears. Likewise, the MFCC divides the voice signal into frames by subsequently applying a hamming window for every frame [40].
There are two well-known signal analysis tools that are used in existing voice recognition systems. Discrete cosine transformation and Hidden Markov Model Toolkit (HTK). Hence, in this proposed system these tools will take care the details of obtaining the cepstral features of each frame.

Model Training
After extraction of the MFCC features, the voice model is trained for the IoT owner. The HMM model is adopted for this system. The reason is HMMs are considered very effective for phones because the system app is to be used on a smartphone. Finally, the Voiceprint are stored in database.

Verification/recognition phase
Once the user enrollment phase is accomplished the system is now expected to verify whether the identity claimer is the owner of the IoT device. The same steps of voice data collection and feature extraction are conducted to the claimer's sound via smartphone. Subsequently, the extracted MFCC features are tested against the trained model for verification. Support vector machines (SVMs) are used in this step for training classifiers in order to provide a good generalization to automatically determine the verification data from the enrollment data. And finally, the decision is made for either rejection or acceptance. The authentication is rejected if the claimer's voice features fail to pass the test against the trained model.

LIMITATIONS AND ASSUMPTIONS
One of the main limitations of this work is that the model is conceptual and not yet implemented in its intended environment. Most of the tools and algorithms proposed or promoted for the use in the system are not technically evaluated too. Authors focused on the resource constrained nature of the IoT technology and proposed different tools for that aspect. The robustness and resilience of the tools are not thoroughly studied scientifically as well. Nevertheless, all these limitations will be handled in our upcoming research contributions.

CONCLUSIONS
To prevent unauthorized users from accessing IoT ecosystem, behavioral biometrics authentication systems are considered most. Through voice recognition, it is believed that IoT user authentications will be more secure, accurate and robust. Hence, in this paper we proposed a text dependent voice recognition system for IoT ecosystem. The system consists of two phases: the enrollment phase where the user is supposed to enroll the voice, and the verification of authentication phase where the identity claimer is expected to utter the voice and subsequently compared with enrolled one. In the future, we plan to develop and test the system for its security and performance in comparison to other biometric schemes proposed for the same area. Furthermore, combining various techniques that have been reviewed in this paper, we will optimize the usability for voice feature extraction and recognition. We will also consider using them in cloud for improved computational requirements.