Human Action Recognition Using Deep Learning Methods on Limited Sensory Data

In recent years, due to the widespread usage of various sensors action recognition is becoming more popular in many fields such as person surveillance, human-robot interaction etc. In this study, we aimed to develop an action recognition system by using only limited accelerometer and gyroscope data. Several deep learning methods like Convolutional Neural Network(CNN), Long-Short Term Memory (LSTM) with classical machine learning algorithms and their combinations were implemented and a performance analysis was carried out. Data balancing and data augmentation methods were applied and accuracy rates were increased noticeably. We achieved new state-of-the-art result on the UCI HAR dataset by 97.4% accuracy rate with using 3 layer LSTM model. Also, we implemented same model on collected dataset (ETEXWELD) and 99.0% accuracy rate was obtained which means a solid contribution. Moreover, the performance analysis is not only based on accuracy results, but also includes precision, recall and f1-score metrics. Additionally, a real-time application was developed by using 3 layer LSTM network for evaluating how the best model classifies activities robustly.


I. INTRODUCTION
H UMAN action recognition is a still challenging task due to the fact that actions can be carried out by different persons in various places. Moreover, length of time interval can vary for same and different types of actions. Since actions consist of sequential data, getting spatio-temporal information from the data becomes a challenge [1].
Wearable devices have started to change our lives, such as smartphones have changed our habits and behavior within ten years. Wearable devices appeared to be seen in many areas e.g. health care, fitness tracking, entertainment and they are expected to spread rapidly in the near future. For instance;  [2] proposes to measure sweat rate by using wearable sensors. [3] uses wearable sensors to monitor kinematic change while doing sportive activities that require physical demands. [4] suggests to follow up betterments of children under rehabilitation and assess them activities that accelerate the treatment process. [5] monitor health of user by wearable sensors. It reviews wearable sensor usage in the mining industry. To detect hazardous gas, to monitor brain activities of equipment drivers, to monitor fatigue, wearable devices are already in use. [6] commits a system to detect obstacles for people who has lost sense of sight by wearable sensors. [7] emphasizes the future usage of wearable devices in tourism.
In recent years, deep learning techniques outperformed hand-crafted feature extraction methods [8] and numerous deep architectures were developed particularly for wearable devices. They have great capability to extract meaningful and high-level features from data [9].
Owing to accelerometer and gyroscope sensors are wearable, human actions can be recognized via these sensors [10]. These sensors are able to be integrated in textile products and they benefit from improvements on both textile and electronics' technology [11]. Technology is used for accelerometer based systems for different purposes [12].
This study focus on activity recognition based on state-ofthe-art deep learning methods using wearable sensors. Products are designed and implemented including accelerometer, gyroscope and wireless radio frequency module called ZigBee. Moreover, they are embedded with temperature, heart rate and humidity sensors.
In this work, data was collected by using these sensors at 5 Hz frequency while subjects were performing 7 different activities. The dataset was generated based on experimental study implemented under European Union project (ETEXWELD H2020: RISE. 644268) so that the dataset is called as ETEXWELD dataset. In the literature, several works have been carried out using accelerometer and gyroscope for activity recognition and authorization. UCI HAR is the first large dataset including daily activities of subjects which are measured via inertial sensors in mobile phone [13]. It includes 6 different activities including "walking", "sitting", "standing", "lying", "up-stairs" and "down-stairs". These are similar that we used in this project, but in this study we have also added "falling" activity. Therefore, UCI HAR dataset is used in this study as a benchmark dataset to compare ETEXWELD dataset results.
In the literature, there are numerous strategies for performance analysis on sensor based activity recognition. These strategies are occurred by supervised, unsupervised and semisupervised techniques. However, this is a classification problem and we used supervised learning methods. The UCI HAR dataset which is used as benchmark dataset is also used in several studies [14], [15]. In these works, mostly classical machine learning algorithms such as Support Vector Machines (SVM) [16], k Nearest Neighbour, Linear Discriminant Analysis (LDA) [17], Multilayer Perceptron(MLP) [18] and deep learning methods are used on this dataset. [19], [20] recognizes daily activities by using wearable sensors and additionally environmental sound with DNN (Deep Neural Network). [21] uses S transform and Gaussian windows to extract features of the activities but its success related with length of the activity which is very dataset specific. Many studies uses kNN algorithm to classify activities on. But it is not possible to implement to kNN to real-time applications due to computational cost causes delay in recognition. [22] demonstrates that using the combination of accelerometer and gyroscope sensors together to detect human activities properly by using feature selectors and kNN algorithm. While classical machine learning algorithms present around 85% accuracy, deep learning algorithms including CNN and LSTM [23] give better result over 90s% [24]. While CNN gets 95% accuracy, [25] uses histogram of gradient (HG) and the Fourier descriptor (FD) to extract features, kNN and SVM is deployed for recognition. This kind of hand-crafted feature extraction techniques requires special feature engineering for every dataset. They achieved the latest state-of-the-art result by 97.1% using these methods on UCI HAR dataset. But, by using deep learning method we have found new state-of-the-art results on UCI HAR this dataset. Moreover, we have implemented several methods in order to compare the results of the deep learning methods including Dynamic Time Warping (DTW) [26] and kNN, one dimensional CNN (1D CNN [27]), two-dimensional CNN 2D CNN [28]), 2 layer LSTM [29], 3 layer LSTM [30], bidirectional LSTM [31], 1D CNN and LSTM on UCI HAR dataset in the paper. Additionally, we have implemented all these methods on ETEXWELD dataset to see comparison of result of the methods for sensor based activity recognition problems over UCI HAR and ETEXWELD datasets.  In this study, we have worked on how to improve accuracy, precision, recall and f1-score metrics [32] by using data augmentation and data balancing methods. After using data set balancing, we have observed much higher accuracy and precision results. Then, data augmentation is used after balancing as well and it also improved the results. All these interventions are applied only on training set. During the implementation of all methods, 10 fold cross validation was used to check how to model optimize on training dataset.
There are several contributions to literature including original deep learning architectures, improvement algorithms, a new dataset, comparison of several methods in different metrics and improving the state of the art results on public dataset in the content of this study. First of all, our results beat the state of the art results on UCI HAR public dataset in the literature. In order to achieve it, we have developed new deep learning networks with a novel data balancing algorithm. Above all, new dataset is designed and created for this project specifically, but it would be used in several projects in the future. One of the most challenging part was to create new dataset using integrated sensors and to work on a real sampled data acquired from those embedded sensors. It is not possible to collect thousands samples data from different subjects. Therefore, data augmentation approach within the study was also designed to improve generalization ability of the models. It is one of the novel step as well, because of improvement the results and being unique. For the recognition, scratch deep learning and machine learning algorithms which are used on both datasets were deployed. Finally, the comparisons of several learning methods for each dataset are shown from  the aspects of accuracy, precision, recall, time and space consumption.

II. DESIGN ARCHITECTURE OF SENSORS EMBEDDED TO
A TEXTILE STRUCTURE End-to-end system design including the human, sensors, monitoring, and communication modules integrated to textile products is shown in Fig. 1. Several setups occurred by electronic modules and flexible covering band for various parts of the body are prepared as seen in Fig. 2. However, in this study, only chest module is used, because [33] shows that chest is the best position to place just a single wearable device to distinguish human activities. Therefore, just a single wearable device placed to the chest with its covering band and all activities were performed by this setup. Wearable sensor device can be seen on the chest clearly in Fig. 3.
All data from each sensor is tracked via a desktop application as shown in Fig. 4. Some of the activities such as walking, falling, upstairs and jogging also could be seen in Fig. 5.

III. DATASETS
In the scope of this study, there are two similar datasets including UCI HAR [13] benchmark dataset and ETEXWELD dataset. Both are created for activity recognition via accelerometer and gyroscope sensors. These datasets include data for the actions of "walking", "sitting", "lying", "standing", "up-stairs" and "down-stairs". Despite these datasets have similarities, there are also differences including sampling rate, number of actions and hardware platforms.
Firstly, while UCI HAR dataset has 50 Hz sampling rate, ETEXWELD dataset has 5 Hz sampling rate, which is highly  challenging to detect activities. Requirement of a low sampling rate comes from the power consumption constraint of the system. Therefore, sampling rate and number of the samples are less in ETEXWELD dataset. Secondly, while UCI HAR dataset has 6 different actions, ETEXWELD dataset has 7 different activities with extra added "falling" and "jogging" actions and discarded "sitting" action. Because these activities are more related to our user stories.
Finally, although first dataset is collected from the accelerometer and gyroscope sensors of a mobile device, ETEXWELD dataset is required since the new hardware platform is a custom hardware board with different accelerometer and gyroscope sensors. Table I shows all physical activities in the scope of the project and the number of collected data samples. As seen on the Table I, we can consider this as a small size dataset and some activities has much more samples compare to the others.
In order to make dataset balanced, data augmentation algorithms have been implemented and a comparison between these different type of datasets has been made.

IV. THE DESIGN OF THE AUTOMATION SYSTEM
The automated system is managed by data collection, pre-processing with data augmentation, machine learning algorithms and real time application.

A. Data Acquisition
The data is collected with an automation system that uses ZigBee module on both clothing architecture and ground station. This module works with 2.4 GHz frequency and transmission distance is between 10 and 20 m for indoor environment, up to 200 m for outdoor environment. For instance, Fig. 6 and Fig. 7 shows the collected data based on 3-axis accelerometer and 3-axis gyroscope, respectively. 7 different activities were carried out by 10 different subjects during data acquisition. Volunteers consist of 5 male and 5 female aged between 22 and 40 years. All experiments were conducted with the consent of the persons. After data acquisition, this collected data should be pre-processed in order to apply deep learning algorithms.

B. Data Preparation
Data preparation is done by preprocessing, data balancing and data augmentation. Pre-processing includes parsing the data into the format of input shape for each model with 15% to  get samples throughout activity. Also, it separates the dataset into train and test data as shown in Fig. 8. Train set consist of 3 males and 3 females subjects and it corresponds 75% of the whole dataset. Other remaining 25% part which consist of 2 males and 2 females subjects which is completely different from train dataset was left as test set to determine model performance.
The distribution of activities in native train and test subsets are shown in Fig. 9.
As it is seen in Fig. 8, the native data set has unbalanced distribution for both train and test data sets. Because of the unbalanced number of collected samples for different actions, we have faced with high accuracy but low precision. In order to improve precision, data balancing algorithms were used. Sufficiency is based on random stop. The only thing we considered before stop the balancing that any proportion of number of any action should be less than 2. NewData ← Random Shi f t (r, data) if acti on.len * 2 < max Acti on then 7: real Sample ← get Dt FrmT ht Action(acti on) 8: for i <= r andomness do 9: newSample ← Augmnt Dt (real Sample, n, m) After balancing, the number of samples are increased seen in Fig. 10.
The core code of balancing algorithm is also used in data augmentation. The data augmentation step was necessary due to the lack of sampled data for each action. The idea behind this augmentation algorithm is after taking any real data for any action, we would like to create similar data to improve our dataset. In order to do that, we add noise to raw data and shift it randomly left and right. While determining the noise and shifting amount, the correlation matrix was taken into account. For each action correlation matrices have been created in order to keep the similarity level of each action type in collected dataset. The correlation matrix for balanced and native train dataset can be seen in Fig. 11. The higher level pseudo-code is shown in Algorithm 1 and Algorithm 2.
After balancing and augmentation of the dataset, the similar values in correlation matrices were observed. This was our benchmark during the determining randomness border on balancing and augmentation. The distribution of augmented dataset is shown in Fig. 12.   General training process for these machine learning approaches.

C. Creation of Deep Learning Models
The research problem in this study is handled as a supervised classification process. All of the evaluated supervised deep learning [34] methods in this study are trained and tested as seen in Fig. 13 and Fig. 14.
These methods are 1D CNN, 2D CNN, 2 Layer LSTM [30], 3 Layer LSTM, Bi-directional LSTM, 1D CNN+LSTM. Each network is designed and implemented with a unique architecture and is trained from scratch. In addition to these state-of-the-art deep learning methods, DTW+kNN is also implemented with the parameter k equals to 5. Moreover, all methods were applied on both UCI HAR dataset and newly created dataset. while k ≤ T otal Activities do 6: Acti vity(k) ← Convoluti ons 7: Acti vity(k) ← Max Pooling 8: Acti vity(k) ← Classi f ier 9: Loss ← T rue Activity − Acti vity(k) 10: return Error

D. CNN Models
Convolutional neural network (CNN) is applied through signal data to extract features automatically and to classify [35], [36]. CNNs have capabilities to consider both dependencies in both time and signal dimensions. Weight sharing in CNN improves the recognition performance while decreasing the number of trainable weights. We used two different type of CNN models. Algorithm 3 shows general algorithm for CNN models.
1D CNN Model: Input for 1D CNN from scratch model is 1x90 vectors. Consecutive samples of accelerometer and gyroscope data is ordered by 1D vector. For feature extraction task, architecture consists of 2 convolution layers with 2x1 kernels, stride 2 and 1x1 kernels and stride 1, respectively. 32 different filters is used for all layers. After this step 1x1440 flattened feature vector is obtained. For classification, this features is fed to a fully connected layers which consist of 100 neurons with RELU activation function. To determine activity class, an output layer is deployed which consist of 7 neurons with softmax activation function. Learning rate is determined 0.01 and optimization method is chosen Stochastic Gradient Decent with categorical cross-entropy loss function. After 250 epochs 10 fold-cross-validation presented the best performance. Fig. 15 shows 1D CNN model. 2D CNN Model: Before being fed to 2D CNN, the data were converted into image format. When we consider the image as a matrix, 3 coordinates of accelerometer value comes to first 3 column and gyroscope value comes to next 3 column, respectively. Row by row, every sample is arranged horizontally throughout until the activity ends. Every activity is carried out within 3 seconds. Data input is 15x6 activity map for both model. In 1D model, convolution operator works on just one time sample instead of other in 2D model. Notice that the kernel height is responsible for extracting temporal patterns while the kernel width extracts the correlation between neighboring axes.
Scratch model consists of 2 convolution layers, 1 maxpooling layer, 1 fully connected layer and 2 dropout layers. Learning rate was chosen 0.0001 and Adam [37] optimization method is used for learning weights. Data was fed into groups of 24 batches and the highest performance was achieved with 10 fold cross-validation as a result of 300 iterations.
To learn relations between features, 1000 neurons were used in hidden layer for classification. There is no pre-trained net- work for training with less effort. So, we developed completely from scratch network. Fig. 16 shows 2D CNN network model.

E. LSTM Models
There are plenty studies that take advantage of interpreting capability of sequential data for activity recognition by using wearables [38], [39]. When implementing LSTM models, most commonly used architecture is described in [9]. A LSTM memory block consist of three gates: an input gate (i ), a forget gate ( f ), and an output gate (o), which overwrite, keep, or retrieve the memory from the memory cell (c) at the time t.
Firstly, input gate (i t ) and forget gates ( f t ) are computed by following equations: (1) Afterwards, by using an amount of the previous contents (c t −1 ), the current memory cell (c t ) is updated for forgetting and the new memory (c) for including. They can be calculated by following equations: Finally, the final activation at the current position (h t ) is calculated with the output gate (o t ) which regulates the amount of information to output.
x, i , f ,c, c, o, h ∈ R T , where T is the length of the input. x t is the input activation at the current time (t), and h t −1 is  the output activations from the past time (t − 1). W , H , and C are weight matrices for input to gates, recurrent connections, and cell to gates.
In the study of Zebin at al. [40] only 6 different activities were classified with 2 layered LSTM Network. In our work, beside that two different layered LSTM Network, Bidirectional LSTM Network [41] is also used. Standard LSTM network have restrictions as the future input information cannot be reached from the current state. On the contrary, bidirectional LSTM network do not require its input data as the same dimension. Moreover, their future input information can be reached from the current state. The main idea of Bidirectional LSTM is to connect two hidden layers of opposite directions to the same output. By this structure, the output layer can access information from past and future states and interprets better.
After pre-processing step, data should be converted to fixed form for feeding LSTM network model. Data constructed with appropriate dimensions are fed to the successive LSTM blocks with vectors formed by 32 batches. Overall flow chart of the LSTM model is seen in Fig. 16 with different number of hidden layers. To understand sequential relationship between spatio-temporal data, window size and overlap rate hyperparameters are very important and were selected as 32 and 50%, respectively. 2 and 3 layers LSTM models have been deployed for two different experiments. Every layer in the network consist of 32 neurons. In experiments, 10-fold cross-validation, 0.0025 learning rate, 0.0015 loss account coefficient and 15 epochs are the common hyper-parameters for both LSTM models. Binary cross-entropy optimization has been determined as the optimization method. The dropout layer is applied with 20% probability to prevent overfitting. Bidirectional LSTM network model consists of 2 hidden layer that includes 100 and 32 hidden layer, seperately. Bidirectional LSTM network model consists of 2 hidden layer that includes of 100 and 32 hidden layer, seperately.

F. Hybrid Models
CNN + LSTM Model: In this model, CNN and LSTM models took part for the recognition task as a hybrid model. At first, 1D CNN was used in different time steps as a feature extractor. After implementing 2 convolutional layer that each of them have 64 filter maps and 2 max pooling layer with stride 2, 1536 features were obtained. For classification task 3 LSTM layer that every hidden layer consist of 128 neurons was deployed. To get sequential relationship between samples and time intervals, data was fed to network as 1x124 vectors. Adam [37] optimization method was used and categorical crossentropy selected as a loss function during training. Fig. 18 illustrate overall scheme of the model.
KNN + DTW Model: [42] used KNN algorithm for UCI dataset. Although KNN is classical machine learning algorithm for classification, it is very simple and effective. It is very important to determine optimal k value which has a great role in terms of classification accuracy. Algorithm was run with different k values and was chosen as 4 for best performance. DTW [43] used to calculate similarity between two temporal sequences and combined with KNN algorithm for classification task. It finds optimum mapping by using minimum distance between arbitrary length signal samples rather than just customary calculating Euclidean distance as a similarity measure criteria in KNN algorithm.Equation 8 shows how minimum distance calculation using Euclidean distance formula in Equation 7. Here, Euclidean distance was used with neighbourhood of k = 4 nearest training samples with the initial condition D(1,

G. Real-Time Application
In the work carried out, one real-time desktop application which uses the deep learning method as an analyzer was implemented to track fire-fighters activity. Interface of the application can be seen on Fig. 19.

V. EXPERIMENTAL TEST RESULTS
All results were obtained using the same hardware circumstance that involves Intel i7 2.6 GHz CPU, 8 GB RAM   TABLE II  EVALUATION METRICS FOR CLASSIFICATION WHICH  ARE USED IN THIS WORK   TABLE III  TEST ACCURACY, TIME Table II. Training time and number of updating parameters are also measured for time and space consumption in algorithm comparisons. Training time do not effect accuracy rate. But if there is big amount of data, training procedure cost much more computation time. All of the deep learning networks that designed were tested on two different datasets including UCI HAR and ETEXWELD dataset. On UCI HAR dataset, two different approach was deployed. Firstly, experiments were carried out by using just accelerometer and gyroscope data. Then, another experiments, accelerometer, gyroscope and total data consist of accelerometer and gyroscope data together were used. Moreover, ETEXWELD dataset is used in three different training set forms which are native, balanced and augmented. The most important metric in this study is accuracy. Comparative results which based on accuracy are obtained on the UCI HAR and ETEXWELD dataset. For UCI HAR dataset when only accelerometer and gyroscope is used, accuracy results, time and space consumption points of view  There is also one more parameter comes with UCI HAR dataset called "total". After adding this input into training model, the results are summarized in Table IV. As it is seen, the accuracy was improved and new state-of-the-art result was obtained with 93.7% accuracy rate.
The same approach was also implemented to ETEXWELD train dataset. Results of native test accuracies are given in Table V.
As it is seen in Table V from accuracy point of view, the best result was obtained with 3 Layer LSTM. However, accuracy may not be enough to determine whether the model is good or not. Therefore, precision, recall, f1-score and support metrics for 3-Layer LSTM have been also checked in Table VI. While accuracy is relatively high, the precision, recall and f1-score are very low for several activities. When we examine the results, it is observed that, lower precision results are also small sampled activities. The reason of this low precision is that of unbalancing dataset that is trained. Confusion matrix    Fig. 20.
In order to solve this problem, the data augmentation and dataset balancing have been done. After the balancing process, balanced train set for all machine learning methods is again used. As it is seen in the Table VII, accuracy is highly improved for all machine learning models and again, the best performance was obtained with 3 layer LSTM as before.
After this observation, other parameters in balanced dataset training model were re-checked. As given in Table VIII, there is an important improvement.
It could be also clearly seen in confusion matrix how much it is improved with balanced data in Fig. 21.
After this big improvement on not only accuracy, but also precision, recall and f1-score, the dataset and model became more trustful.  Training process results on balanced data is also followed on epochs for 3 layer LSTM model. Accuracy and loss chances could be seen for both train and test set as well in Fig. 22.
In addition to this balancing, all models have been tested on the augmented dataset to see how data augmentation is effecting classification accuracy and the accuracy results on all models is seen in Table IX and also confusion matrix can be seen in Fig. 23.
The improvement on the accuracy still exists, although it is very small. The training time is increased because of increasing iteration number. Other metrics also examined with augmented data in Table X.

VI. CONCLUSION
In this study, a unique dataset based on real experiments acquired from wearable devices was used for human activity recognition. We have achieved state-of-the-art results with two different data set with using deep learning techniques including from scratch trained and designed CNN and LSTM networks. It is proven that 3-Layer LSTM model is the best solution for sensor-based activity recognition problems in real time applications particularly for wearable devices. Additionally, kNN may not be an efficient option for classification of activities for big datasets like UCI HAR because of lower accuracy rate and computational cost. Therefore, we contribute to the literature while showing the comparison of different deep learning results according to accuracy, time and space complexity. Moreover, we have shown that the data augmentation on small size datasets and balancing for the unbalanced dataset are critical in order to obtain higher scores. They do not only improve test accuracy but also improve other metrics such as precision, recall and f1-score dramatically. In order to achieve that, we have design and implement our original algorithms for time series sensory data specifically as one of the novel steps. This study can be implemented to detect activities of firefighters to determine their health condition and performances. In future studies, it is planning to use multiple sensors and recognize more complicated activities.