WeMoD: A Machine Learning Approach for Wearable and Mobile Physical Activity Prediction

It is indisputable that physical activity (PA) is vital for an individual’s health and well-being. However, globally, one in four adults do not meet the recommended levels of PA, with substantial personal and socioeconomic implications. In recent years, a significant amount of work has explored the potential of pervasive computing and self-tracking for increasing PA. Adaptive and personalized goal-setting has proven to be one of the most efficient methods in this direction. To this end, we propose a Machine Learning (ML) approach, WeMoD, which can be used to predict a user’s future daily step count for setting challenging yet achievable goals. To develop WeMoD, we utilize heterogeneous, multimodal human data collected unobtrusively in the wild. Additionally, we use a novel fusion of physiological, behavioral, and contextual features, which according to the experimental results, has a positive effect on the predictive ability of the models. Specifically, we can predict a user’s step count with a MAE of 1930 steps and further improve this performance through personalization with a MAE of 1908 steps, paving the way for future work in this field.


I. INTRODUCTION
Physical Activity (PA) is any movement that sets our bodies into motion. PA has substantial benefits for human health and well-being, including a lower risk of all-cause mortality, coronary heart disease, type 2 diabetes, certain types of cancer, depression, and Alzheimer's disease. It also has several socioeconomic benefits, such as reduced usage of fossil fuels, safer roads, less air pollution, and generally higher quality of life [1]. Despite these considerable benefits, one in four adults and three in four adolescents do not meet the recommended guidelines for PA [2]. According to the World Health Organization, globally, physical inactivity's cost is estimated at $54 billion in direct health care (57% of which is covered by the public sector), and an additional $14 billion from productivity loss [1].
A significant challenge towards fighting this inactivity pandemic is encouraging individuals who are not sufficiently active to include PA to a greater extent in their daily routine. Simplifying the guidelines into something such as step count recommendations is easy to understand and achieve and hence is of utter importance for promoting public health [3]. To this end, self-tracking devices have been proven successful in increasing individual PA levels by providing a personalized selfmonitoring and coaching experience [4,5]. The most common way in which wearable devices are used to increase PA is by tracking the user's daily step count and setting a daily step goal to achieve [6]. The most crucial aspect of setting successful step count goals is the algorithm used to calculate such goals. The simplest and most common algorithms follow a "Fixed Goal Approach" in which the device sets a predefined fixed goal each day for the user, such as the recommended 10.000 steps or a goal that the user has selected for themselves. However, the "Adaptive, Personalized Goal Approach", in which the system personalizes and adapts the user's goal by taking into account various aspects of their behavior and context, has proven to be more effective in increasing adherence and PA levels [4,5].
Nevertheless, the field of PA prediction for adaptive goalsetting is still in its infancy. SotA approaches suffer from various limitations, such as subjective, lab-based experimental data, inability to tackle heterogeneous, real-world data sources, limited feature space, and lack of transparently evaluated, endto-end, ML-based solutions.
Motivated by the issues above, this work proposes the We-MoD approach for PA prediction, consisting of a series of ML models trained on real-life, heterogeneous, multimodal users' data collected in the wild. Specifically, our contributions are as follows: • C1 -In-the-wild Data Collection: To capture individuals' actual daily routines outside of an ongoing experiment, we utilize past data, raging from a few days to more than five years, donated from existing wearable users to build a naturalistic, multi-device dataset that enables us to build more robust models for the real world. • C2 -Wearable Data Integration & Preprocessing: To take advantage of data originating from different devices, we design and implement a data integration and preprocessing component that merges and analyzes multimanufacturer, multi-device data. This way, we enable researchers and experts in this field to expand their sample population and diversify their available activity data sources. • C3 -Extended Feature Space: We fuse physiological, psychological, and contextual features, including COVID-19-related data, to forecast a user's upcoming daily step count. We incrementally utilize and evaluate various feature sets for evaluating their effect on the model's predictive ability. • C4 -Open-source, ML-based Approach for PA Pre-diction: We develop WeMoD 1 , an open-source, end-toend approach for the collection, integration, preprocessing, and ML-based forecasting of PA from a set of rich features. In this process, we experiment with various ML algorithms and paradigms. Among others, we compare the performance of traditional generalized ML models with personalized models, directly linked to individuals, as a herald of individualized goal-setting interventions. We organize the remaining of this paper as follows: Section II discusses prior work in ML-based PA prediction and goalsetting. Section III outlines our data collection, preprocessing, analysis, and evaluation methodology, while Section IV presents our experimental results and highlights our findings. Finally, Section V concludes this paper and introduces future work directions.

II. RELATED WORK
In the following, we briefly overview two core research topics that our work touches upon, namely goal-setting interventions and ML-based PA prediction.

A. PA Prediction & Goal-setting Interventions
Several works have studied the impact of appropriate goalsetting on increasing an individual's PA levels, measured in daily step count data. Zhou et al. [7] conducted a Randomized Control Trial (RCT), where the intervention group received personalized goals adapted to their previous activity data, whereas the active control group received a fixed goal of 10000 steps. Their results led to a statistically significant difference favoring the intervention group. Similarly, van Dantzig et al. [8] evaluated the impact of a context-aware, personalized goalsetting approach through an RCT, and Phatak et al. [9] presented an idiographic (person-specific) approach establishing the efficiency of dynamic models of PA in the context of goalsetting and positive reinforcement interventions. These findings indicate that personalized goal-setting, which considers the user's context, is more effective towards increasing PA levels, as expressed through the daily step counts. However, these studies focus on RCTs rather than the prediction models themselves and provide limited information in this regard, as opposed to our work. Additionally, in contrast to these works, we have created and utilized a real-world, heterogeneous dataset that is not limited to a specific experiment duration or lab setting and represents the participants' natural behavior. Overall, we adopt a broader view of human sensing by incorporating a combination of features, including physiological, psychological, and contextual data, rather than sole past measurements, and by utilizing ML for behavioral analysis from human sensing data.

B. ML-based PA Prediction & Goal-setting Approaches
Most closely to our work, prior literature has also focused on building ML models for PA prediction and adaptive goalsetting. Zhou et al. [7] developed an ML model utilizing users' historical data regarding daily step count and goal achievement 1 https://github.com/BasdekD/adaptive-goal-setting rate and generated challenging yet realistic goals to maximize future PA. Dijkhuis et al. [10] developed an ML model that predicts a user's achievement of a daily step count goal which uses activity and time-related features up to the prediction moment. A similar approach is presented by Li et al. [4], where the authors used ML to develop a model that calculates hour by hour the probability of a user achieving their daily step goal. The model considers past activity patterns and the current PA target to deliver the desired prediction. A limitation of the above works is that they do not report the performance of the step count prediction models, rendering it impossible to evaluate their effectiveness. Also, they do not consider contextual and psychological features that may enhance the accuracy of the suggested goals, an approach that we explore in this work. To cover this gap, Mohammadi et al. [11] developed a neural network model that considered several contextual features derived from questionnaires (personal, social, and environmental features), as well as physiological data to predict the average weekly step count of an individual. However, this approach does not consider the different characteristics of the days in a week (weekends vs. weekdays, holidays, day-specific activity patterns). In contrast, our work explores in-depth the importance of such feature sets (e.g., time-related, COVID-19-related) for the performance of ML models and calculates more demanding daily (instead of weekly) PA predictions.

III. METHODOLOGY
This section presents our methodology regarding data collection and preprocessing, feature engineering, and model building and evaluation, an overview of which we present in Figure 1. We M o D Fra mewo rk Fig. 1. The WeMoD approach outline: In the dataset curation phase, we perform necessary data preprocessing steps, and feature and window size selection. In the algorithm selection and optimization phase, we experiment with different ML algorithms, and we proceed with dimensionality reduction.

A. Data Collection
A critical aspect for accurately predicting the daily step count of a person is finding the appropriate dataset for training and evaluating the ML model. As mentioned in Section I, there are two major open issues regarding such datasets. First, previous studies mainly focused on physiological measurements, overlooking other factors that may affect PA, such as behavioral traits or environmental factors. Second, there is a lack of relevant in-the-wild data, which may lead to a distorted reflection of an individual's actual PA levels [7]. To this end, we unobtrusively constructed an in-the-wild dataset consisting of data from users with heterogeneous activity trackers. Furthermore, we collected multimodal data concerning historic PA levels combined with the user's psychology and personality, time-relevant features, and features related to COVID-19 movement-restrictive government policies, as described in detail in Section III-B.
To collect the necessary data for our final dataset, we asked twenty-one participants (10 females, 11 males), all over 18 years old (18-24 years: 5, 25-34 years: 8, 35-44: 7, 45-54: 1), owners of Xiaomi and Apple activity trackers and mobile phones, to provide us with their data without any monetary compensation. The average number of participants' steps ranged from 793 to 9244 (µ = 4527 and σ = 2062). The study followed the guidelines of the EU General Data Protection Regulation (GDPR) (2016/679) [12]. We pseudonymized the collected data, using a one-way cryptographic hash function [13] (SHA256) to make it impossible to match a piece of information to the specific participant that has submitted it. Finally, all participants provided informed consent, and ethical approval was obtained from the Aristotle University of Thessaloniki (AUTH) Ethics Committee (254324/2020).

B. Feature Engineering
The WeMoD feature-rich dataset incorporates data collected from three different data sources: (1) Activity trackers, (2) Questionnaires, and a (3) COVID-19 dataset, totaling 53 features. Specifically: Personality and identity features: We extracted 33 personality and identity features from four questionnaires: demographics, the Big Five personality trait scale [14], and the Processes and Stages of Change scales [15,16]. With regards to demographics, we built seven features concerning gender, age, family status, educational level, and career status. From the Big-Five questionnaire, we evaluated each participant's personality in terms of openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism and built five related features. Lastly, from the TTM questionnaires, we assessed the user's cognitive and affective experiential (e.g., consciousness-raising, dramatic relief, environmental reevaluation, self-evaluation, social liberation) and behavioral processes (e.g., self-liberation, counter conditioning, helping relationships, reinforcement management, stimulus control), and built 21 related features. Activity and Date features: We obtained the daily number of steps as the target variable, as well as past physical activity features. It is worth mentioning that days with less than 500 steps were considered as no-wear days and discarded from our dataset similarly to other works [9,17]. Regarding the date features, we extracted helpful meta-information for predicting steps, such as holiday, weekend, day of the week, month, and day of the month. COVID-19 features: In response to the COVID-19 outbreak, governments worldwide applied a wide range of measures that may have impacted PA (e.g., curfews, movement restrictions).
To this end, and since the period of recorded data coincides with the global pandemic, we utilized a public dataset [18], containing indicators from different policy responses (e.g., economic, health, containment, and closure, miscellaneous). We focused on Greece's records per our sample and included 15 features regarding containment and closure policies.
After completing the feature extraction process, each day of a participant in our final dataset consists of 53 features (excluding physical activity window-based features), converted in a numerical format with appropriate encoding techniques.

C. Data Preprocessing
Following the data collection and feature engineering processes described above, a few necessary preprocessing steps have been taken to ensure the robustness of the dataset we created. First, we transformed the data through the manufacturer integration component, merging data from heterogeneous data sources. Then, we applied several techniques for data cleaning (outlier detection, handling of missing days, and no-wear days), and we evaluated the importance of the different feature groups through an extended experimentation process with different subsets of features. To obtain the final format of the dataset, we had to decide on the window size to be used as input for WeMoD to predict the step count of the next day. Four different window sizes were tested, and the performances of the corresponding prediction models were compared. The different n values used were 5, 7, 14, and 20 days. These values cover an adequate number of experimental window sizes for this research field while also maintaining the feasibility of efficient training of ML models.
After the initial data preprocessing phase, each day of recorded activity in the dataset is described by 53 different features plus additional window-based physical activity features. It becomes evident that the task at hand is characterized by high dimensionality, which may lead to increased training time and even reduced accuracy of the obtained predictions [19]. Hence, for the dimensions of the dataset to be reduced, two distinct approaches were evaluated, namely feature selection and dimensionality reduction through dataset projection into a lower dimension. Our approach towards feature selection was to apply recursive feature elimination (RFE), specifically RFECV, due to the automated and interpretable manner by which the most important features are identified. Regarding dimensionality reduction, we utilized Principal Component Analysis (PCA). RFE reduces the dimensions of the input data by eliminating the least essential features from the dataset. On the other hand, PCA accomplishes the task by applying transformations from the field of linear algebra to the data, making the resulting dataset a projection of the original one.
Overall, our data preprocessing methodology had a positive impact on the performance of the ML prediction models, indicating the high quality of the final dataset, as presented in Section IV.

D. Model Building & Evaluation
Following the creation of the final version of the dataset, the next step towards developing WeMoD is selecting the ML algorithm that will be used for the prediction model. Generalized ML Models. Since the target variable is a continuous integer value, we formulate the task at hand as a regression problem. Hence, the algorithms used in our work are supervised ML regression algorithms, specifically Ridge Regression (RI), Decision Tree (DT), Random Forest (RF), and Gradient Boosting Regressor (GBR). These algorithms are chosen to evaluate the performance of different approaches representing linear, tree-based, ensemble, and boosting methodologies used in ML for PA prediction. To obtain the experimental results and compare the different algorithms, we have first conducted hyperparameter tuning and determined their optimal configuration through the usage of GridSearchCV. Personalized ML Models. On top of the traditional, generalized models, we also adopt a personalized ML paradigm. Personalized ML refers to the creation of ML models that exploit a single individual's data instead of assuming that "one-sizefits-all", building upon previous promising work in the domain of personalized health and well-being analytics [20,21]. In our work, we conduct a proof-of-concept experiment to evaluate WeMoD's performance on an unknown, new user and lay the foundation for further research on this topic. Specifically, we utilize two versions of the dataset with identical features. The first one, the personalized dataset, is based on data from a single participant. Ten percent of this user's data are held out as a test set for evaluation purposes. The second dataset, used for the generalized model, contains data from all the participants, excluding the user above.
In the process of evaluating the performance of the different models and paradigms, the dataset to be used each time is separated into a train/validation (90%) and a test set (10%). Firstly, the performance of each model is evaluated through a time-series-oriented CV process on the training data. Next, each model is assessed on the test set to evaluate its performance on completely unknown data and ensure the robustness of the evaluation results by excluding the possibility of overfitting. The corresponding results of the algorithm selection and the generalized versus personalized experiments are presented in Section IV.

IV. RESULTS & DISCUSSION
This section presents our experimentation results and a commentary on our findings regarding data preprocessing techniques' effectiveness in the field of PA prediction and the differences in performance between generalized and personalized ML.

A. Effects of Data Preprocessing on PA Prediction
As discussed in III-C, several different approaches regarding feature-group selection, outlier handling, and window size selection were tested in order to obtain the optimal version of the dataset. This section presents our results regarding the experimentation with various data preprocessing techniques for PA prediction from multimodal, heterogeneous data. Window Size: The window size used in creating the dataset did not have much impact on the PA prediction models' performance. Thus, we used five days (n = 5) of activity as input, leading to the least complex models and the largest pool of available data for training, totaling approximately 10000 days of user data. Feature Selection: Concerning feature selection, we organized the various features into four categories, Activity (A), Date (D), Personality (P), COVID-19 (C19), and tested the chosen regression algorithms' performance in datasets containing different subsets of features. The results (see Table I) indicated that all feature groups had a positive contribution to the predictive ability of the ML models reaching a MAE of 2138 steps. Thus, although the exclusion of certain features through feature selection positively impacted the model's efficiency, no feature group as a whole should be excluded from the dataset. In other words, the novel feature fusion used in the WeMoD dataset leads to the best performance of the prediction models. Additionally, we notice that GBR has consistently yielded the best results (the lower the MAE, the better) in the task of PA prediction and hence we utilize it for the remaining experiments. Outlier Handling: Outlier handling is vital especially when in-the-wild data are considered. In our approach, days with exceptionally high step counts are removed from the dataset. The number of days to be removed is defined as a threshold value indicating the percentage of the total number of recordings in the dataset that should be considered outliers. After experimentation, we concluded that the optimal threshold for which the model's performance is improved while overfitting is avoided is 2% with a MAE of 1940 steps in the training set.  Dimensionality Reduction: For window size of n = 5, the total number of features is 317, and the RFE reduced these features to 247. We evaluated our GBR model on the reduced version of a training and a test set containing 247 features and achieved a MAE of 1930 steps in the training set and 1951 steps in the test set. In Figure 2, we present the MAE as obtained through CV for different numbers of features chosen by RFE, where we can see that even less than 100 features can yield satisfactory performance. PCA yielded slightly worse performance with a MAE of 1975 and an optimal number of 57 components. Recalling the previous best performance of the prediction model (Table II), we conclude that feature elimination significantly reduces model complexity without compromising on PA prediction performance. Through this experiment, we assess the performance of the generalized WeMoD model on a new unknown user and compare this performance to a personalized ML approach. The two models were trained on their respective datasets, as described in III-D and evaluated for their predictions on an unknown test set of 160 days of user data. The personalized model yields a better MAE of 1908 steps, compared to 2282 of the generalized model. Having said that, the generalized WeMoD can still identify patterns in the user's step count, even though it has no previous knowledge of any information related to this specific participant of the research. Most likely, with a greater number of users whose data will be utilized for the generalized model's training, the efficiency of its predictions, even for completely unknown individuals, would be further improved. Figure 3 presents a plot with the predictions of the two models amongst the actual step count values for 20 days of the unknown test set, which verifies our claims.

V. CONCLUSIONS & FUTURE WORK
This work has demonstrated the feasibility of an end-toend approach for step count prediction. The purpose of our research has been two-fold. Firstly, working with an in-the-wild, heterogeneous dataset that included a combination of activityrelated, personal, and contextual features. Secondly, developing a model that utilizes the above dataset to forecast a user's future daily step count. WeMoD serves as a proof of concept for the feasibility of PA prediction in adaptive goal setting. Suggesting an appropriate intervention strategy is out of the scope of this research. Despite that, such a model could be incorporated in the core of more complex goal-setting approaches and PA intervention to positively alter a user's behavior regarding PA. Specifically, to fulfill C1, we utilized in-the-wild data originating from user activity unrelated to their participation in the ongoing research, which ensured an accurate reflection of an individual's PA levels. By implementing C2, we designed and applied a data integration and preprocessing component to pool and analyze activity data originating from various ubiquitous devices. For C3, we considered a novel combination of physiological, psychological, and contextual features, proving that they can positively contribute to PA prediction. Finally, for C4, we developed and open-sourced a series of ML prediction models capable of efficiently forecasting a user's future step count, given a set of features for a sequence of previous days. Our best model achieved a MAE of 1908 steps.
Based on the above, we believe that the objectives of this research have been achieved; without this, of course, meaning that there is no room for further improvement through future work. An important issue that we would like to address in the future is the relatively small sample size. A larger population sample would provide more days of recorded activity to train more robust and generalizable ML models. Another future direction is incorporating more feature categories (e.g., user location, weather conditions, or other behavioral traits) in the PA prediction task and assessing their impact on the predictive ability of the respective ML models. In reality, the available feature space that may be directly or indirectly related to a user's daily step count is vast. Thus, any future research that attempts to identify the impact of different features on PA prediction models' performance can be considered a contribution to this field. Additionally, while a throughout experimentation on the topic of personalized vs. generalized approaches is out of the scope of this research, our results provide exciting insight and indicate that there is more than enough space for further work regarding the pros and cons of each approach in the health and well-being domain. Finally, we propose incorporating our model in an intervention application to increase an individual's PA levels. The WeMoD prediction model was designed having this goal in mind and with the hope of contributing to the cause of promoting a healthier lifestyle for people. We hope that by open-sourcing our code, other researchers will be encouraged to experiment with the WeMoD approach and adapt it to their needs for a wide variety of problems within the mHealth domain.