Exploring Artificial Intelligence methods for recognizing human activities in real time by exploiting inertial sensors

The aim of this work is to present two different algorithmic pipelines for human activity recognition (HAR) in real time, exploiting inertial measurement unit (IMU) sensors. Various learning classifiers have been developed and tested across different datasets. The experimental results provide a comparative performance analysis based on accuracy and latency during fine-tuning, training and prediction. The overall accuracy of the proposed pipeline reaches 66 % in the publicly available dataset and 90% in the in-house one.


I. INTRODUCTION
Over the last decade, an increasing interest in human activity recognition (BAR) has emerged. Several methods have been leveraged to tackle this challenging classification problem. It is a crucial yet open issue that needs to be addressed to assist the eldercare and healthcare. Due to technological advances, BAR may be performed by utilizing the sensory ecosystem of wearable devices and smartphones, captured images or a video data stream [1] [2].
BAR is a problem that tries to predict different activities based on sensor data derived from human motion. It is a time series classification problem which tries to map activities with associated human movement or action. Movements and actions are performed during walking, jogging, talking and sitting but more specific movements can exist while cooking or handcrafting [3]. Therefore, sensors can translate these movements and actions into numerical time series data. The mapping can be utilized between features extracted from the sensor data and activities or raw sensor data and activities. The first is utilized on machine learning (MI.) methods while the latter on deep learning (DI.) ones.
This work provides a comparison analysis which is based on time series acquisition data from Inertial Measurement Unit (INILJ) sensors. From the literature, Zhan et al. [2] proposed an activity recognition system that collects time series data and video from an accelerometer and a first person camera, respectively. Features were extracted from both sensor signal and frame data in order to train two MI. classifiers. A local and a structured model Vi/ere implemented to provide predictions on human activities. The first depends on the raw sensor data, while the other considers temporal dependencies to capture contextual relations. Possas et al. [4] implemented a Recurrent Neural Network (RNN) architecture which is the Long Short-Term Memory (LSTM) architecture, to classify human activities relied on motion sensor data and the Longterm Recurrent Convolutional Network (I.RCN) on vision data. In an attempt to balance these two different techniques, reinforcement learning policies have been exploited.
In the work presented by Song et al. [5], the Fisher Kernel framework was applied to fuse features from sensors and video recordings, Later, Song et al. [6] developed a multistream Convolutional Neural Network (CNN) to captivate characteristics from videos. In parallel, multi-stream LSLI\!1 was integrated to learn features from accelerometer and gyroscope. The last two networks were fused along with various pooling techniques. Arabaci et al. [7] focused on an adaptive framework that weighed visual, audio and sensor features related to their ability to distinguish human activities. They compared Kernel SVM, MKBoost and SimpleMKL algorithms which resulted in Polynomial SVM yielding the best performance.
In this paper, a comparison between conventional MI. and a DL, technique is presented. Not only the performance in widely acclaimed accuracy metrics is taken into account. but also in the parameter tuning, the training and the prediction (inference) times. Therefore, they should both be weighed according to the needs of the implementation. Finally, a publicly available dataset and a newly in-house collected one have been utilized to evaluate the classifiers.

A. Dataset
Three datasets have been exploited in order to develop the proposed scheme: DataEgo, DataEgo* and an in-house dataset.
DataEgo is an egocentric dataset composed of visual and inertial data. This dataset consists of 20 activities which can be further split into 5 groups: mobility, office work, kitchen related, exercising and general. The data was captured with the Vuzix M300 Smart Glasses. Each session lasts around 5 minutes and mimics a real-world scenario that corresponds to a sequence of 4 to 6 different activities. By analyzing the sensor information, some variability can be noted for a given activity. The data includes activity annotation per sample. Each sample consists of acceleration and gyroscope information. Each sensor provides values from the corresponding three-axis coordinate system. The sampling rate of the accelerometer and gyroscope is 15 Hertz. This dataset is highly imbalanced. It should be noted that the minority class contains 796 observations while the majority has more than 31000.
DataEgo* is a subset of DataEgo which contains the classes related to high physical motion activities such as walking, cycling, doing push-ups, running and doing sit-ups and some activities that belong to the lower physical motion like eating and reading. DataEgo* was generated in order to overcome the high imbalance of DataEgo.
To tackle the limited amount of public datasets and further assess the generalization of the developed models, a newly collected dataset was needed. After the approval from the ethical committee, volunteers, at the IMBB-FORTH, Ioannina, Greece, were asked to take part in a data collection study and perform a set of activities wearing a portable sensor ecosystem. The ecosystem is comprised of a pair of glasses which had a mounted IMU sensor. The sensor utilized was the MetaMotionR, from which the accelerometer and gyroscope data were collected. The dataset consists of 10 participants. Each participant performed a sequence of activities, yet some of which performed a second one. The activity sequence mimics a real-world scenario of activities of daily living (ADL). The performed activities are sitting, standing, walking, walking upstairs and downstairs. The participants were asked to maintain their personal style and speed when doing an activity and act as natural as possible. The chosen time spent per activity was predetermined to provide a balanced dataset. The sample rate of the accelerometer and gyroscope data is 50 Hertz. Accelerometer and gyroscope range lies in ± 16 g and ± 2000 dps, respectively. The total number of collected samples is 649301 which corresponds to 3.6 hours of annotated dataset. Each sample contains six values and the annotated class. The values represent the accelerometer and gyroscope information. Annotation was based on a timestamp application that captured activity changes and on the graph representation of the inertial data.

B. Hardware
To develop the in-house dataset, a 3D printed pair of glasses was utilized. A 9-axis inertial measurement unit sensor (MetaMotionR) was mounted on top of them. The processing unit (LattePanda A864s) and the sensor communicate via Bluetooth Low Energy. Additionally, a portable powerbank of 30.000 mAh was attached as the power supply. As a result, the glasses became a portable wearable system which is shown in Fig. 1.

C. Methodology
For the classification of ADL, conventional ML and DL techniques have been developed. The classifiers that have been utilized are the Random Forests (RFs), the Support Vector Machines, the k-Nearest Neighbors (k-NN) and an Artificial RNN architecture, the Long Short-Term Memory (LSTM). SVMs exploit the ability of a dataset to be separable by a hyperplane. K-NN predicts the class based on k neighbors according to a distance function. RFs, combine several decision trees in order to provide the most common inference. Lastly, LSTM exploits both feedforward and feedback connections in its neural network architecture along with its ability to learn and forget.
The developed ML scheme employs a sliding window method to augment the data. The window size is 200 data points with an overlapping ratio of 80%, which corresponds to 4 seconds. For each window, a feature vector is extracted. It includes features calculated not only per axis but also a combination of them. Mean, kurtosis, energy and signal magnitude area are some of the features calculated. In total, 208 features were extracted. To support the performance of the algorithms, a min-max normalization is applied, which scales the values of the features in [0,1]. However, some features do not hold the same level of importance as others. For that reason, analysis of variance (ANOVA) and chisquared feature ranking algorithms have been employed. To further refine the models, grid search was utilized to find the best parameter combination. For example, in RBF-SVM, c and gamma were determined within [10-7,10 7 ] .
The LSTM architecture is able to learn and recall over long multiple parallel sequences of input data provided by the axes of accelerometer and gyroscope sensor. The model learns to extract features from sequences of observations and then maps the internal features to different types of activity. The benefit of using LSTM for time series classification is that it has the ability to directly learn from raw data and retain the information for a long period of time by design.   Table I. Due to the high performance of RBF-SVM, it is considered the best choice from the ML algorithms. For the in-house dataset, RBF-SVM managed to reach 90.82% on SCV and 90.72% on LOSO. LSTM achieved 84.83% on SCV, 85.31% on LOSO, which further enriched the alignment that more data were needed for such a model. The classifier results from the median accuracy of the folds are depicted in Fig.   3.
The most notable error is observed on the misclassification of "sitting" and "standing". During these activities, no unique movements are noticed. Therefore, a motion IMU sensor placed on the human head cannot perceive these movements and the corresponding numerical data of these activities end up similar. Another misinterpretation occurs in "walking" and "upstairs", which might be a case of the type of stairs during the data collection protocol. The stairs had a walking part on each floor.
It was also observed that the precision, the recall and the fl scores are all within 0-3% difference from the accuracy score in all experiments. Parameter tuning through grid search is only used for the radial basis function, while the parameters The models are trained and tested in the datasets described in Section II.A. At the same time the lO-fold stratified cross validation (SCV) is utilized as an evaluation method for all datasets based on the mean accuracy of the folded classifiers. Additionally, leave -one sequence -out validation (LOSO) is implemented for the in-house dataset.
For the DataEgo dataset, the accuracy of the SVM classifier reached 33.32%, while SVM with polynomial kernel of third degree (Poly-SVM), RF and k-NN achieved 40.34%, 44.26% and 43.25%, respectively. SVM with Radial Basis kernel Function (RBF-SVM) achieved 66.75% reporting the highest performance. By utilizing ANOVA and Chi-Squared feature selection techniques, the above-mentioned classifiers provided a lower accuracy score by a margin of 4 -7% in all cases. Despite the lower scores, the model was simpler and converged faster. It should also be noted that parameter tuning time for RBF-SVM was larger comparing to the rest. Specifically, the grid search took 18 hours to complete, while the tuning of the rest was performed in a couple of minutes. LSTM achieved 45.75%, and can be further increased if more data is available.   such as reading or working on pc, that do not indicate any special movement for a head motion sensor. As a result, accuracy was increased for DataEgo*, which includes mostly high motion activities. To the same extent, accuracy was significantly increased for the in-house dataset achieving up to 90%. This comparison is depicted in Table III.

Yo CONCLUSION
In this work, two algorithmic pipelines have been designed and developed for BAR in real time. The latter is part of the personalized profile of the See Far solution. More specifically, it is implemented into the See Far augmented reality smart glasses which assist vision impaired population with services and recommendations in real time. The developed models can reach up to 90%, accuracy for the in-house dataset while yielding great results in the publicly available datasets. As future work, a wider variety of activities will be explored without influencing the overall performance.  The proposed activity recognition models utilize sensor data streams, derived from accelerometer and gyroscope, to differentiate high motion magnitude ADL. The proposed algorithmic pipeline is validated by utilizing for LSTJ\1 have been selected based on one-parameter-at-atime tests. Tuning demands high computational power. 'Thus, experimentation environment consisted of an Intel i9 with a Nvidia GeForce RTX-21J61J_ Training time is a far less timeconsuming assignment. In constrast with LSTM that requires 4.5 seconds, the prediction time of SV~1 is nearly instantaneous. Overall, it was observed that both approaches achieved good accuracy levels. Despite the superiority presented by RBF-SVM in the training and prediction time, parameter tuning prohibits online training. LSLI\!l seems to provide a higher level of automation but a slower inference (Table II).