Federated Learning for Network Intrusion Detection in Ambient Assisted Living Environments

Given the Internet of Things’ rapid expansion and widespread adoption, it is of great concern to establish secure interaction between devices without worsening the quality of their performance. The use of machine learning techniques has been shown to improve detection of anomalous behavior in these types of networks, but their implementation leads to poor performance and compromised privacy. To better address these shortcomings, federated learning (FL) has been introduced. FL enables devices to collaboratively train and evaluate a shared model while keeping personal data on site (e.g., smart homes, intensive care units, hospitals, and so on), thus minimizing the possibility of an attack and fostering real-time distribution of models and learning. This article investigates the performance of FL in comparison to deep learning (DL) with respect to network intrusion detection in ambient assisted living environments. The results demonstrate comparable performances of FL with DL while achieving improved data privacy and security.

T he increase in sensors, the cloud, and big data analytics as well as the need to automate and ease processes has contributed to the fastpaced development of Internet of Things (IoT) networks. The IoT facilitates many diverse functions that are provided for the household, industry, infrastructure, and transportation by a massive number of unique devices from diverse manufacturers. It has also received attention from the medical community as a promising way for early diagnosis, prevention, treatment, and administration of drugs while patients remain in the comfort of their own homes. 1 Nevertheless, diverse sensors and devices create a number of issues, primarily related to security and privacy. As health and personal information become remotely obtainable, the risk of violating the privacy of patients' data and their electronic health records increases substantially. Also, these devices can be reconfigured or turned off, 2 which may contribute to severe consequences for the patient's health. As most network intrusion detection systems (NIDSs) cannot fully provide the necessary protection for IoT networks because of the ever-increasing pace of new attack types and methods, 3 novel ways to detect potential anomalies must be sought.
Recently, machine learning (ML) algorithms have been applied for NIDSs and shown good results in detecting such anomalies. Because IoT devices are limited in storage and power and cannot apply complex artificial neural networks (NNs), it is necessary to have a central server to process the data. The centralized approach introduces several limitations, most notably, sharing patients' data and compromising their privacy. Federated learning (FL) has shown great potential as a new distributed variant that can speed up the detection and handling of network anomalies without compromising patient data and maintaining privacy. However, FL-based solutions lack accuracy, robustness, and ubiquity compared to their centralized learning counterparts. Moreover, the works that focus on FL-based network intrusion in ambient assisted living (AAL) environments are not prevalent in the literature.
This article analyzes the aspect of anomaly detection in AAL environments regarding network intrusion by utilizing FL. It also performs parameter characterization and attack grouping to improve the model's accuracy and robustness. The structure of the article is as follows. The "Related Work" section discusses the state of the art in FL and deep learning (DL) approaches for anomaly detection in IoT networks. The "Dataset and Methodology" section presents the used dataset and experimental and evaluation setup as well as the performance metrics of interest. The "Results and Discussion" section gives an overview of the experiments and obtained results and compares the FL and DL models. It also discusses how grouping attacks and parameter characterization can improve overall accuracy of the models. The "Conclusion" section summarizes the article and presents possible future directions and improvements regarding the given problem.

RELATED WORK
Conventional signature-based techniques focus on detecting already-known and established patterns, while network intrusion detection techniques can detect both known and unknown attacks. This implies that network intrusion detection demands more computational power and achieves lower overall accuracy in the process. Recent research has shown that leveraging different ML and DL algorithms for network anomaly detection purposes is highly beneficial for building more adaptable and accurate IDSs.
Saheed et al. 4 suggest an ML-supervised, algorithmbased IDS for IoT networks. After performing normalization and dimensionality reduction on the UNSW-NB15 dataset, six different ML models were trained. All the models present an accuracy in detection of 99%. Gao and Thamilarasu 5 also evaluated the possibility of using different ML algorithms to detect security attacks in medical devices. The results show that the decision tree-based algorithms achieve the highest detection accuracy (90%). Intelligent and dynamic ransomware spread detection in medical cyberphysical systems was the topic of interest in the work of Fern andez Maim o et al. 6 In this research, two different ML models proved to be successful in detecting and classifying these types of attacks, with naive Bayes obtaining an accuracy of 99.99%. Otoum et al. 7 and Kathamuthu et al. 8 present DL-based solutions that tackle IDSs for IoT networks. The first use is of a spider monkey optimization algorithm and stacked-deep polynomial network, achieving an overall accuracy of 99.02%, while the second is of a deep Q-learning-based NN with a privacy-preservation method and achieve an accuracy of 93.74%.
However, these algorithms also have their drawbacks, mainly because of the centralized approach. Having the entire dataset on one server can be computationally expensive and time consuming. In addition, data signatures can be very large in size, so it can be very difficult to collect the data in an efficient and realtime manner. In most network intrusion and detection scenarios, swift detection is of utmost importance. Moreover, this can compromise security and privacy when transferring data from IoT nodes to the server and vice versa. As such, FL, 9 which enables distributed training of models, has emerged as a potential and adaptable strategy that can address these drawbacks.
The authors in Huong et al. 10 designed LocKedge, an FL-based IDS for IoT networks that detects anomalies at the edge layer. Nevertheless, when evaluated on the BoT-IoT dataset, the FL model achieved lower performances than the DL model. Rahman et al. 11 tried to keep data privacy intact while suggesting a new FL-based system for IoT intrusion detection. However, the evaluation process on the NSL-KDD dataset shows an oscillating accuracy of 83.09%, which is not substantial for real-time IDS purposes. Li et al. 12 utilize homomorphic encryption, as well as a convolutional NN for the development of a distributed IDS based on FL. The model is tailored to analyze and block only distributed denial-of-service (DDoS) traffic on satelliteterrestrial networks. Mothukuri et al. 13 also develop an FL model to deliver real-time anomaly detection of DDoS attacks in IoT networks, which is based on the gated recurrent unit concept. Both models in Li et al. 12 and Mothukuri et al. 13 exhibit high accuracies but are tailored for only a specific type of attack and lack ubiquitous applicability.
The related FL works focus primarily on a limited number of attacks, such as DDoS, which significantly limits their applicability to real-world scenarios. This is highly detrimental for classification purposes as anomaly detection systems require diverse and updated data to foster high accuracy and robustness. This work presents a novel FL solution based on anomaly detection that builds upon the weakness of state-of-the-art works. It achieves satisfactory performances for a large plethora of IoT-based attacks, and it is comparable to the results achieved by DL. Also, to the best of the authors' knowledge, this work is the first to focus on anomaly detection-based NIDSs in AAL. Additionally, the article presents a novel idea for grouping attacks based on their similarity, which can significantly improve the performance of FL-based network intrusion detection (above 98%), while preserving the detection capabilities for different types of attacks.

DATASET AND METHODOLOGY
This section provides insight related to the dataset of interest. It also gives a thorough explanation of the system architecture as well as the design of the DL model. Moreover, it introduces the specific performance metrics of interest.

Used Dataset
For the purpose of this research, we used a publicly available dataset called IoTID20. 14 The testbed is a typical AAL environment, which includes a camera, smartphone, home speaker (artificial intelligence speaker), and several computers. By simulating network traffic and monitoring at different time periods, researchers were able to create the dataset and then extract 83 network features from the pcap files using Wireshark. 15 The distribution between the normal network traffic and the different types of anomalies is as follows. The

System Architecture and NN Model
The proposed FL system architecture is given in Figure 1(a). It is constructed of two main components: 1) FL clients (AAL environments) and 2) the central server. The FL clients train a local model on site using their local data. After a specific number of local epochs, the FL clients send the trained models (i.e., model weights) to the central server. The central server aggregates the received models by averaging each model's weight across all clients, known as federated averaging (FedAvg) strategy. The updated model is then sent back to the clients, completing one FL round. The process is repeated for a number of rounds until achieving the required performances or model convergence. As such, the FL approach does not share nor expose any information from the AAL dataset and environment, fostering a high level of privacy.
It is assumed that all clients have the same NN model [ Figure 1(b)] and use the same number of local epochs. The model consists of a feedforward NN (FFNN) with two fully connected layers with 64 and 32 neurons, respectively. Both layers utilize the rectified linear unit activation function. The two layers are followed by a dropout layer with a 0.2 rate. The output layer is a softmax layer consisting of 10 neurons, which represent the classes of attacks in the dataset. In the experiments where attack grouping is performed, the number of neurons in the output layer is reduced to seven.

Evaluation Setup and Metrics of Interest
The dataset consists of one normal traffic data class and nine different types of attacks, resulting in a total of 10 classes. The data are further split into a training and test

CONNECTED LIVING
subset. The training subset contains 80% of the data, while the remaining 20% is present in the test subset. The evaluation does not incorporate any tuning of the NN parameters, so no validation dataset is necessary.
For performance comparison, we use a DL baseline. From an information-theoretical perspective, no FL model can achieve higher accuracy than a centralized DL model when the FL is using the same underlying NN. The reason is related to the manipulation with the dataset. Specifically, the DL model is trained on the whole dataset, while the FL trains the local models on portions of the dataset and then aggregates them into a global model, hence losing valuable information due to the partitioning and averaging. The DL model is based on the same FFNN from Figure 1(b). In the DL experiments, the training dataset is used for training, and the test dataset is used for evaluation. We used a maximum of 35 epochs to train the DL model.
For the FL, the experiments are executed with a different dataset distribution because of the nature of the FL itself: no data leave the device. Therefore, the complete data are split among 50 clients (in our case, each client refers to an AAL environment), where every client holds a different portion of the test and training datasets [see Figure 1(a)]. The training subset of each client is used to train the local models. The global model is evaluated (in each round) using the combined test subsets from all the clients.
In each round of the FL, a subset of random clients is selected for local training, controlled by the fraction fit parameter. Each of the clients uses only five epochs for the training of the local DL models. As mentioned previously, the FedAvg optimizer is used for aggregation of the local DL models into the global FL model, which comes as a simple, yet effective solution. After each round, the aggregated global model weights are distributed to the clients and used as starting point for the local model training in the next round. In the experiments, we use a maximum of 35 rounds for training of the global FL model.
By careful analysis of the dataset, it can be concluded that many attacks are highly related according to their type and inherent family features. For example, there are several distinct Mirai attacks that exhibit very similar network intrusion behavior. As the primary goal of NIDSs is to accurately and timely detect attacks, it can be highly beneficial if the system can group the attacks and improve its detection capabilities. As the grouping will be done over the same family of attacks, the system will still be able to identify the type of attack; however, its granularity will be coarser.
The performance metrics of interest in this study is the models' accuracy as a function of the number of epochs/rounds required to finish the training.
Specifically, the evaluation focuses on the FL's accuracy in dependence of the number of FL rounds as well as the fraction fit parameter (i.e., the percentage of FL clients used in each round).

RESULTS AND DISCUSSION
In this study, we conducted three experiments to investigate the capabilities and limitations of the FL model for anomaly detection in AAL environments. The first experiment focuses on training, testing, and comparing the FL model with the baseline DL model. In this experiment, we focus on classification performances for all 10 available classes in the dataset. The second experiment serves to examine the benefits of attack grouping with the aim to improve detection performances of the FL model. The final experiment concentrates on the parameter characterization of the FL models.
In the head-to-head comparisons between the DL and FL models concerning the convergence, we associate training epochs for the DL model with training rounds for the FL model. Even though this may appear to be unfair because the FL additionally uses five epochs for training of the local models, the local models are trained on significantly smaller dataset portions (1/50).

Model Accuracy
The first experiment is used to compare accuracy and convergence performances of the FL model with the baseline DL model. In this experiment, the fraction fit parameter is fixed to 1, meaning that all the clients are participating in each round of the FL training process. Figure 2(a) depicts the models' accuracy in dependence of the number of epochs (for DL) and rounds (for FL). It can be seen that the FL model achieves a slightly lower accuracy (84%) compared to the DL model (86%) for the test dataset. Furthermore, it can be noted that after the 20th round, the FL model seems to achieve its convergence. On the contrary, the DL model still tends to improve its accuracy as the number of epochs increases, but it encounters slow convergence and a higher performance variability (model instability). This result clearly shows the benefits of using FL for anomaly detection in AAL scenarios. At the price of a slight classification performance decrease, one can preserve user privacy in these scenarios as the FL model does not share and expose the AAL dataset, only the model's weights. Furthermore, the FL provides better stability (mostly due to the FFNN's weight averaging) and faster convergence. Figure 2(c) and (e) shows the confusion matrices for the DL and the FL models, respectively. The results show that most of the misclassifications of both models occur among classes 1, 4, 5, and 6, which correspond to similar types of attacks, i.e., the Mirai attacks. Intuitively, this indicates that grouping the Mirai types of attacks into one class would improve the anomaly classification performances.

Attack Grouping
The second experiment groups the four Mirai classes into one class, leaving the dataset with seven distinct classes. This also decreases the number of nodes in the final layer of the FFNN. Similar to the first CONNECTED LIVING experiment, the fraction fit parameter for the FL model is set to one. Figure 2(b) shows the head-to-head comparison of the DL and FL models in terms of accuracy and convergence when applying the Mirai attacks grouping, while Figure 2(d) and (f) shows the DL and FL confusion matrices, respectively, for this case.
The results clearly show that the attack grouping substantially improves the accuracy of both the DL and FL models, and it is also noted that the grouping has more benefits for the FL model. Specifically, the performance difference between the DL and FL models is smaller, compared to the case when there is no attack grouping. Furthermore, the FL model tends to improve its performance, even after the 20th round. The reason behind this behavior can be found by reconsidering the confusion matrices in Figure 2(c) and (e) (without the Mirai grouping). It is evident that the FL model is more affected by misclassifications between the Mirai types of attacks. Specifically, due to the similarity between these attacks and the substantially smaller datasets (1/50), the local models fail to learn the differences between the Mirai classes. Therefore, grouping of the multiple Mirai classes into one results in a more substantial performance gain for the FL model.

FL Parameter Characterization
The final experiment focuses on parameter characterization of the two FL models, i.e., the FL model uses all classes and the FL model uses the Mirai grouping. Besides the number of rounds, this experiment also investigates the fraction fit parameter and its contribution to the accuracy and the convergence of the models. The fraction fit is an important parameter in FL as it controls client selection and stochasticity of the learning process. Randomly choosing a subset of clients in each training round reduces the computation and communication overhead and can reduce overfitting in the resulting global FL model. The results are obtained for three fraction fit parameter values: 0.2, 0.6, and 1. In particular, a fraction fit of 0.2 means that in each round of the FL, only 20% of randomly chosen clients participate in the FL training (i.e., 10 randomly selected clients out of 50 in our case). Figure 3(a) and (b) shows the convergence and accuracy results for both FL models with respect to the fraction fit parameter. Similar behavior can be observed for both FL models, i.e., the fraction fit does not significantly impact the accuracy and convergence performances for the chosen problem of anomaly classification. Only minor differences (<0.3%) can be seen among the different choices of fraction fit, however, there are a few important considerations that should be noted. A smaller fraction fit provides a slightly better convergence rate for a smaller number of FL rounds (<10). A fraction fit of 0.6 provides the best accuracy when the number of FL training rounds is more than 15, e.g., the FL model with Mirai grouping achieves an accuracy of 98.3% at round 25. A fraction fit of one seems to experience some minor model overfitting. In conclusion, the results clearly show that there is an optimal fraction fit in the trade-off among accuracy, convergence, and FL overhead.

CONCLUSION
This article discussed the applicability of FL for network intrusion detection for AAL environments. The article also introduced the concept of attack grouping to improve the overall detection performance of FL models. The analysis showed that FL achieves very similar performances to that of its DL counterpart without sharing any personal or patient data. Additionally, the results show that the attack grouping significantly improves the detection accuracy of both DL and FL, with FL having a larger benefit from the grouping process. Further work will include implementing some security mechanisms (e.g., differential privacy) into the FL models and evaluating the trade-off between privacy and accuracy. New FL optimizers can also be tested and evaluated on the same and new AAL datasets. Another potential venue for future exploration is the system-level specifics of FL with respect to bandwidth efficiency, noisy data, and computational overhead.