First-person activity recognition from micro-action representations using convolutional neural networks and object flow histograms

A novel first-person human activity recognition framework is proposed in this work. Our proposed methodology is inspired by the central role moving objects have in egocentric activity videos. Using a Deep Convolutional Neural Network we detect objects and develop discriminant object flow histograms in order to represent fine-grained micro-actions during short temporal windows. Our framework is based on the assumption that large scale activities are synthesized by fine-grained micro-actions. We gather all the micro-actions and perform Gaussian Mixture Model clusterization, so as to build a micro-action vocabulary that is later used in a Fisher encoding schema. Results show that our method can reach 60% recognition rate on the benchmark ADL dataset. The capabilities of the proposed framework are also showcased by profoundly evaluating for a great deal of hyper-parameters and comparing to other State-of-the-Art works.


Introduction
The continuous rise of the video format as a medium for communication has brought a digital video revolution to the modern connected world. It is safe to say that is has now surpassed the popularity of image and text formats judging by the countless online multimedia platforms that support it and the amount of video clips the web pages are filled with daily. The use cases are endless: from do-it-yourself tutorials, to marketing and live event broadcasting that are uploaded online, many popular public video repositories contain massive amounts of video content. It is not only the attractive combination of auditory and visual content that is making the medium popular, but also the technology of modern wearables that push seemingly every single person to carry a tiny video camera at all times, plus the convenient ways that exist for the videos to end up posted online for immediate consumption on social media.
In most of the videos uploaded online, humans are the center of attention and the thematic content is in one way or another moving around the activities that they perform. Multimedia processing and computer vision researchers have shown much interest in the exploitation of those huge databases. The proposed solutions can address the needs of several real life applications, such as video surveillance and security applications, human behavior understanding, video indexing and retrieval, human-machine interaction, etc.
In this work, we process videos captured by wearable devices and more precisely we focus on the recognition of the human activities such videos contain. Besides the use of wearables as entertainment devices mostly in outdoor environments, this technology can also be used effectively in order to monitor indoor activities of patients. Those patients that may have a critical disease are often called to live inside their own homes, as nursing homes and hospitals can not accommodate them in their own premises for too long. However, for some of them it is essential that a doctor or a carer should continue monitor their health and keep a log file of their behaviors throughout time. Thus, this work is mostly motivated by the need of efficiently recognizing human activities of daily living that are captured by wearable cameras in indoor environments.
Huge attention has been drawn to generic human activity recognition that capture the human subjects from distant cameras, or third person viewpoints, but the first-person activity recognition field is relatively understudied. There are several challenges when dealing with the task of first-person activity recognition, the main one being the lack of human actors in the field of view. The severe distortions that may also appear in egocentric videos, like field of view distortions, and ego-motion from the user's movements, may negatively impact the process of extracting meaningful representations of activities using the State-ofthe-Art methods proposed for third-person activity recognition. First-person video datasets have recently emerged, [31,36,38], as well as first-person activity recognition challenges [9], calling for more interest on the subject.
With human movements out of the field of view, most of the related work focuses on either human hand movements that may be still in the frame, or hand-object interactions [3,57]. Many of the recent works also employ Deep Convolutional Neural Networks (CNNs) to study those interactions, as well as multiple CNNs that are tailored to complete a certain task in the whole framework [51,57]. Our contribution lies close to the object-centric approaches, but we also focus on efficiency in all the steps of our methodology in order to recognize activities while maintaining feasible computational times. To this aim, we build upon a previous work of ours [18] by exploring a more appropriate dimensionality reduction scheme that exploits the sparsity of our representations to reduce the computational complexity during training whilst achieving comparable results. Additionally, we explore in this work the impact of motion compensation to our low-level descriptors.
In contrast with most of the previously published works, we not only detect relevant objects but also extract their individual motion patterns using object flow histograms. Moreover, aggregating the motion features by class over short temporal windows allows us to build discriminative representations that relate directly to the object manipulation patterns. In addition, we encode those patterns in a binning framework to understand their usage in short-term actions, which are fundamental building blocks of long-term activities. Unlike other State-of-the-Art (SoA) works, our current implementation does not rely on any hand movement information at all or other modalities, like sensory equipment or gaze information.
The rest of this paper is structured as follows. In Section 2, we present the state-of-the-art related work, while in Section 3, we show our proposed methodology. Section 4 describes our experimental work and evaluation results are included. Finally, conclusions are drawn in Section 5.

Related work
In this section we examine the previous work on activity recognition, and briefly on object detection, as it is an integral part of our approach.

Activity Recognition
Previous works on activity recognition can be categorized based on the appearance of human actors that perform the activities, or lack thereof. For activity recognition of the first category, where human actors appear in the activity clips, most of the pre-CNN era works dealt with motion analysis using optical flow and the analysis of dense trajectories [1,24,25,49,50], which involved the process of extracting classic low-level descriptors like Histograms of Oriented Gradients (HOG), Histograms of Optical Flow (HOF) and Motion Boundary Histograms (MBH), to represent the visual and motion features around keypoints. Towards more temporally-aware methods, others have chosen to model activities as sequences of subactions focusing on the temporal structure of visual patterns [17,39]. Shortly after the CNN impact, the direction was steered naturally towards deep learning approaches. Amongst the most popular was the multi-stream CNNs [15,45,51], that work by feeding various modes of video frames, mainly an RGB channel and an optical flow channel, in order to extract deep CNN visual and motion features and represent activities based on fusion of the two. In a later work, information from the actor's pose was more effectively captured using pose-based CNN features [4]. Other methods that appeared later rely on RNNs, to more accurately model the temporal dynamics of activities. More specifically, the modern technique of deep visual attention was combined with RNNs in [42] and [11]. More recent works focused on refining existing techniques like in [44] where optical flow derivation features (OFF) where plugged in existing CNN-based action recognition schemes. In [14] the two-stream CNN approach was modernized by injecting residual connections, while the more recent TSN framework [52] works in sampled video segments with the aim to model long-range temporal structures more efficiently.
For first-person activity recognition other approaches have been proposed where, in contrast to the previous methods, human actors can't be seen performing the activities. Human hands or lower body parts are naturally the only information we can get as far as the actor's movements are concerned. Therefore, in the context of representing activities of daily living, object manipulation and hand movements are the main source of visual information. As such, many of the works in this category propose to describe activities by an objectcentric manner following the information that derives from the existence of specific objects in the scene [13,34,38,57]. Moreover, scene understanding is also used in [47]. In [54] a multi-task clustering framework tailored to first-person view (FPV) activity recognition is presented. Another more recent approach is to use deep CNN architectures [53] to learn deep appearance and motion clues. Deep CNNs are also used to learn hand segmentations in order to understand the activities that a user performs and his interaction with other users that might also appear in the video frame [2,3,57]. More recent works focus on multi-modal analysis of egocentric cameras and information from other wearable sensor equipment with the deployment of early or late fusion schemes [5,6,35].

Object detection
Since our method is heavily dependent on object detection, we provide here a brief overview of the literature on this subject as well. Several works have been proposed to solve this task with outstanding results in challenging datasets [10,12,30,32]. The breakthrough of deep CNNs that were thoroughly examined for this task include works like the seminal work of [19], which deployed a proposal generator [46] step to feed the network. Later, the bounding box proposal network was incorporated into an end-to-end deep architecture in Faster R-CNN [41], achieving better performance and faster prediction during testing. Others have focused on deep end-to-end single shot detectors that predict classes and box coordinates directly using the last convolutional feature maps, like the SSD [33] and YOLO [40] detectors. Since then more works have been proposed that focused on faster detection time like in [7] and [28], and were based on sharing convolutions to multiple layers. The most recent high performance detectors were based on minor tweaks to vanilla models, but have produced significant performance boost nevertheless [26,43,55,56,58].

Methodology
In this section we take a closer look at how our proposed framework processes the activity clips. The motivation behind the modeling of micro-actions is first discussed, and then a detailed description for each processing stage is given.
Activities of daily living such as "book reading", "hand washing" or "preparing breakfast" usually take place in long segments inside an egocentric video, lasting on average a couple of minutes. Instead of trying to capture directly long-term dynamics, it is suitable to get a deep understanding of the lower level actions the actors are performing in order to accomplish the large scale activities. For example the activity "preparing breakfast" involves the fine-grained actions "opening the fridge", "grabbing butter", "closing the fridge", "taking a knife", "spreading the butter", etc. This group of micro-actions as we call them, does not always need to form a complicated sequence for every activity. For example, the activity "reading a book", except for the actual "reading" activity, usually involves one micro-action performed repeatedly, i.e., "turning the page". For those reasons, we seek a way of extracting a representation of the full duration of an activity clip which will be informative towards the set of micro-actions that are involved and have a strong ability to uniquely describe the activity.
It is very well established in the literature [13,34,38,57] that every activity is related closely to a group of active objects and a group of passive objects. The first group contains objects that are handled by the person during the activity, and the second group contains objects that are simply within the view of the camera when the activity is performed. Objects are good indicators of certain activities such as the TV in the "watching television" activity or the book in the "reading a book" activity. We further elaborate this notion by hypothesizing that not only the presence, but also the characteristic motions of the objects in the scene are powerful enough to discriminate between active and passive ones. For example, motion information from dishes that are being washed combined with the presence of a tap in the scene can uniquely describe the "washing dishes" activity.
The overall framework is shown in Fig. 1. The above assumptions are taken into account in our activity recognition method. First, we detect objects using a Deep CNN architecture that combines a deep feature extraction network and a bounding box coordinate regression network, that predicts object classes and locations in the video frames. We combine the powerful detector with a tracking algorithm, eliminating the need to utilize the deep architecture for every frame, in order to achieve near real time object detection. Then, every detected object's motion is processed using HOF or MBH [8] features, so as to form the lower level micro-action representations that appear in short time windows over the full activity sequence. The resulting micro-action descriptors go through a dimensionality reduction step, which keeps the representations compact with minimum loss of information. Finally, Gaussian Mixture Modeling (GMM) is used for clustering in order to extract prototype micro-actions, finding the most discriminative of the full set. Given a set of micro-action descriptors extracted for a single activity sequence and the GMM clustering centers, a Fisher encoding schema is used in order to yield the final descriptor of the full activity sequence in a Bag-of-Micro-Actions type of representation.

Object detection
In order to detect the activity-related objects in the egocentric videos, we chose to extract deep image representations and predict pixel coordinates of bounding boxes using a deep Fig. 1 Block diagram of the proposed methodology CNN object detector. To this end, we adopt a modification of the Faster-RCNN which was originally proposed in [41]. A thorough evaluation of this model and comparisons with other SoA deep object detectors, presented in [23], reveal that the Faster-RCNN-resnet101 architecture achieves a good trade off between speed and accuracy. This model incorporates the resnet101 [21] deep feature extractor and a region proposal network, along with a bounding box classifier and coordinate regressors. In order to make the object detection procedure more efficient during inference time, we find useful to track the detected objects found in a frame into the next T frames of the video, instead of running the detector for each single frame. Since we do not expect dramatic cuts during an activity clip, we still manage to get good quality detections at far greater speeds. By assigning a detection rate of T > 15 our combined detector and tracker algorithm achieves near real time performance. We manually set the detection rate parameter to 15 following empirical evaluation after trials with other values ranging from 5 to 30. Intuitively, the detection rate defines the temporal resolution of the continuous object detection function. Lower detection rate means higher temporal resolution of the detector and vice versa. For a usual 30 fps video, by setting the detection rate to 15 the detector only operates between half-second intervals and the tracker works the rest of the time. This is expected to yield adequate temporal resolution, considering that it is very unlikely that an object will appear and disappear in less than half a second. The core functionality of our tracker is based on the KCF tracking algorithm that was proposed in [22].

Micro-action representation
Our method builds representations of short, fine-grained actions, of fixed temporal window W , from the motion patterns of the objects that are found in this window. More specifically, we first compute a dense optical flow field, to extract the full scene's motion between two consecutive frames. We use the OpenCV implementation of the dense inverse search algorithm proposed in [29]. In addition, doing the calculation every other frame inside the window, instead of every frame, leads to W/2 calculations which yields faster computation times. Having already detected the objects in a particular frame we take each bounding box as our region of interest and crop the dense optical flow map accordingly, taking only the portion that belongs to the object. Consequently, we can calculate HOF (histograms of optical flow) descriptors that represent an object's motion.
To calculate an object's HOF descriptor we apply a 2X2 uniform grid on top of the bounding box region. For each one of the 4 cells, flow orientations are quantized into an 8 bin histogram weighted by their magnitude values. In addition, we chose to apply a soft binning method that distributes the votes between adjacent bins, based on the distances of the values from adjacent bins centers. This procedure results in a 32-dimensional motion descriptor that is extracted for every object in the scene. If multiple objects from the same class appear in one frame, we aggregate the vectors and divide by the number of objects, so as to get the average motion descriptor of that particular class. In the case of absence of any objects from a particular class, the corresponding HOF descriptor is set to the zero vector. Let C be the number of classes the detector can predict and N c the number of objects found for class c. The early object class motion descriptors are formed as follows: By concatenating L2 normalized motion descriptors for each class we get a complete description for a pair of consecutive frames in the window W : Finally, we concatenate those descriptors throughout W/2 frame pairs to get a complete representation of a micro-action composed by the object's movement patterns that appeared in the window: One problem with the accurate extraction of object motion from egocentric videos is that very frequently the wearable camera moves along with the person that is wearing it. As a result, ego-motion may overpower the delicate dynamics of the objects' motion that we are trying to capture. Therefore, we consider an alternative to the HOF descriptor, that is the MBH descriptor where the optical flow field is first separated into its x and y component and spatial derivatives are computed for each one of them. This time we obtain a 32-dimensional descriptor for each component (64-dimensional after concatenation) following the same procedure to obtain the final descriptor as in the previous case. Because MBH is the gradient of the optical flow, any motion that is happening constantly (global motion) is suppressed and only information about changes in the flow field (i.e., motion boundaries) is kept [48].
Given that there are N c possible object classes, the dimensionality of a micro-action descriptor is given by W 2 × N c × 32 for HOF and W 2 × N c × 64 for MBH. It is expected that the dimensionality is increased dramatically for a high number of object classes or longer windows W . Moreover the descriptors can be very sparse because of total absence of certain objects classes, where the respective values are set to zeros. Therefore, we proceed with two alternative approaches to apply the dimensionality reduction stage, while at the same time we exploit the sparsity in a way that yields lower computational complexity.

Dimensionality reduction
The high dimensionality of our micro-action descriptors severely affects the computational burden which we intended to alleviate through dimensionality reduction approaches. For dimensionality reduction two approaches were adopted, i.e., Principal Component Analysis (PCA) and random projections (RP).
PCA projects the data onto a lower-dimensional orthogonal subspace that captures as much of the variation of the data as possible. Due to the fact that PCA approach is quite expensive to compute for high-dimensional data sets we also investigate a computationally simpler method of dimensionality reduction that does not introduce a significant distortion in the dataset, i.e., the RP approach. In RP the original high-dimensional data, X ∈ R n×D is projected onto a lower-dimensional subspace using a random matrix whose rows have unit lengths. More formally, using matrix notation where X ∈ R n×D is the original set of n d-dimensional observations, Y = XR (4) where R ∈ R D×d is the random matrix and Y ∈ R n×d is the projection of the data onto the lower d-dimensional space. The fundamental idea of random projection arises from the well-celebrated Johnson-Lindenstrauss lemma [27] which states that if point instances in a vector space are projected onto a randomly selected subspace of appropriate dimension, then distances are approximately preserved [16]. In this work we use random matrix whose elements are Gaussian distributed with zero mean and unit variance.
Due to its computational simplicity and our sparse feature vectors, RPs are ideal for the dimensionality reduction task in this work. In particular, the aforementioned random projection procedure is of order O(Ddn) and taking account that X is in our case sparse (assuming l nonzero entries per row) the complexity is of order O(ldn) [37].

Activity recognition
For a given activity sequence, the extraction of micro-action descriptors that represents a small sequence of W frames takes place with a stride of S frames. We chose that value to be exactly 1 second in all our experiments. This simply means that for every micro-action descriptor M we skip 1 second into the video before we begin extracting the next microaction descriptor. Contrary to using overlapping windows, the stride parameter was inserted to give our method a speed boost. The micro-action descriptors are extracted from fixed length temporal windows W . In contrast, the length of the activity clips are not expected to be constant. Therefore, the number of micro-action descriptors that are formed can vary, depending on an activity's duration and the length of W . Given that the micro-action window W is chosen sufficiently small, it can be guaranteed that the number of micro-actions that will be formed for an activity sequence will be enough for the activity to be adequately represented.
All micro-action descriptors extracted from all the training activity sequences are fed into a Fisher encoding schema. This way, a micro-action vocabulary based on the most discriminating ones is constructed. The computation of the most discriminating samples is performed by applying unsupervised clustering, using Gaussian Mixture Modeling, in the micro-action representation hyperspace.
Let {μ j , j , π j ; j ∈ R L } be the set of parameters for L Gaussian models, with μ j , j and π j standing respectively for the mean, the covariance and the prior probability weights of the j th Gaussian. Assuming that the D-dimensional early descriptor is represented as M i ∈ R D ; i = {1, . . . , N}, with N denoting the total number of descriptors, Fisher encoding is then built upon the first and second order statistics: where q ij is the Gaussian soft assignment of descriptor M i to the j th Gaussian and is given by: Distances as calculated by (5) are next concatenated to form the final 2LD-dimensional Fisher vector, F X = [f 11 , f 21 , . . . , f 1L , f 2L ], that characterizes each activity sequence. The final Fisher encoding for a specific activity sequence can now be classified using an SVM or a Neural Network classifier.
In this section, we first describe the experiments that we conducted, so as to select the best hyper-parameters for our activity recognition algorithm, while also comparing the performance of different descriptor options (HOF, MBH). The performance of our object detection and tracking algorithm is presented as well. Additionally, we extend our experimental work by studying alternative dimensionality reduction techniques in order to examine the validity of our assumptions. Furthermore, we applied camera ego-motion compensation, as in [20], to examine the improvement it may bestow upon our best models for both descriptors. We accumulated and present activity recognition results for each class, in the form of confusion matrices, and examine how each class performs depending on object detection performance. Finally, we present a comparison of our framework with other SoA works in terms performance in the ADL dataset in order to prove the applicability of our method.

Dataset
We performed our experiments on the ADL dataset [38]. It is composed of videos recorded with a wearable camera from 20 different persons. The videos contain realistic scenes of daily living and the benchmark is challenging due to the existence of global camera motion. The objects are also in many cases occluded. From the 48 different classes of objects that are available, we select the 34 most frequently annotated to train our object detector. We also select a subset of 18 activity classes, as in [57], to train our activity recognition framework so as to present comparable results with previous works. The taxonomy of the activity classes is given in Fig. 2. The activity classes can be divided in three major sub-categories.

Object detection
To train our object detector we used only the first 6 videos, since this is the typical way of splitting the dataset and is reported in previous works. During testing we set the detection rate to 15 frames and track the detected boxes, managing to achieve detection inference time at a rate of 12 fps on average. Our object detector achieves an overall 26.9% mAP on the 14 remaining test videos. A detailed performance evaluation per object category is shown in Table 1. The detector performs very well on several classes for which many annotated samples are provided in the training set (over 1000). However, it performs poorly on small objects like "towel" or "pills" and it even yields 0% mAP on three object classes. Small items that are handled by the actors are expected to be heavily occluded in comparison with big static objects such as a TV, an oven or a microwave. Figure 3 shows qualitative detection results in test frames of the ADL Dataset.

Hyper-parameter selection
We experimented with two different durations for the temporal window: W = 90 and W = 60 frames. Those two values correspond to 3 seconds and 2 seconds respectively for the videos of the ADL Dataset which were recorded at 30fps. Considering that the activity average duration is in the order of minutes in this dataset, we manage to get enough microaction descriptors assigned to each activity and simultaneously capture more complex object motions through time. Furthermore, we show that micro-actions of 3 or 2 seconds are long enough for our method to perform close to SoA levels. The two choices for our temporal window W proves to be convenient for algorithmic speed considerations as well.
The length of the micro-action descriptors before dimensionality reduction is W 2 ×34×32 for the HOF descriptor and W 2 × 34 × 64 for the MBH descriptor. In this stage, we use only two options for the dimensionality reduction stage, using PCA with 80 and 256 components, so as to focus on evaluating the other hyper-parameters. Later, we perform an extended study between various modes of dimensionality reduction on the most promising model configurations.
We also experiment here with two different vocabulary sizes, using 32 or 64 Gaussians. For the final stage, we deploy as our classifier, a fully connected neural network (NN1) with a depth of two layers, of width 512 and 256 accordingly, using RELU activations, 50% chance of dropout between layers and softmax activation in the output layer. Another similar architecture (NN2) was also deployed with half the amount of neurons for each layer (256 in the first layer and 128 in the second) and a linear SVM classifier as a third option for the sake of classifier comparison.
To evaluate the action recognition performance as in [57], we performed the leaveone-person-out cross-validation strategy for every hyper-parameter combination and we  report the mean average precision (mAP) and standard deviation. Tables 2 and 3 present analytically our scores for every experiment. As shown, choosing 256 components in PCA results in performance boost when combined with a larger temporal window. Choosing 80 components resulted in better performance in some cases of the shorter temporal window. However, as shown later, those two thresholds are limiting and more PCA or RP components lead to better performance overall. Increasing the size of the GMM vocabulary from 32 to 64 failed to improve our results, especially when using the shorter temporal window. This proves that using a smaller vocabulary consisting of 32 words is enough to get good coverage of the most discriminant micro-actions of the entire dataset. Finally, we can see that the MBH descriptor almost entirely outperformed the HOF descriptor for every experiment with a temporal window of 60 frames and that the performance of the two was comparable for a window of 90 frames. This is an indication the MBH has to offer more when micro-action extraction is more refined in time. Overall, the best models came from the combination of 256 PCA components coupled with a GMM vocabulary of size 32 and the neural network architecture with the most learnable parameters (NN1).

Dimensionality reduction
In this section, we keep the best models' parameters fixed, e.g. GMM vocabulary of size 32 and the NN1 classifier, and experiment upon using different dimensionality reduction  techniques, for all the different descriptors and window sizes. The same evaluation scheme is also applied in this section as well, i.e. leave-one-person-out cross validation. First we investigate selecting 1000 PCA components instead of 80 and 256, in order to explore the capabilities of our method for higher dimensional micro-action vectors. When reducing from some thousand components to only 256 it is possible that only a small portion of the dataset variance can be explained, thus the reduction step can become a bottleneck to the overall performance. In Table 4 the results indeed show significant improvement in performance for all settings, ranging from 3% to 5% mAP.
However, as discussed earlier PCA is rather expensive to compute mainly during training time for high dimensional data. Especially in our case, the micro-action descriptors, depending on the setting, are at least 30720-dimensional and up to 92160-dimensional vectors before the reduction stage. For those reasons, we chose to experiment with RP using 4 different settings: d = 1000, d = 2500, d = 3500 and d = 5000. Table 5 shows the results. We can see that with Random Projections close to 3500 components, the method scores either comparably or surpasses PCA's lower component settings (80, 256). Considering, all RP experiments took lesser time to produce, there exists a performance/speed trade-off when choosing one or the other during training. For full performance gain, it is evident that the 1000 PCA component setting is the ideal one.

Motion compensation
Heavy ego-motion may appear between video frames of the ADL Dataset, as a result of the person moving around while performing the actions. As such, the real movement of objects may be overpowered by camera motion. We have already established the MBH descriptor as the best choice over HOF for activity recognition on the ADL dataset and the hyper-parameters that lead to the best performance. In this section, we apply an egomotion compensation technique before the calculation of the descriptors, so as to determine its impact on the overall performance.  Ours -Bag-of-Micro-Actions with HOF (best) 58.3% Ours -Bag-of-Micro-Actions with MBH (best) 60.1%

Best performance indicated by bold entries
Having already computed the dense optical flow field, we select randomly 1000 points in the image and their displacement vectors and feed them to a RANSAC estimator of a projective transformation (3 × 3 homography) between consecutive frames. Then, the set Fig. 4 Confusion matrix of our activity recognition method with HOF descriptors of inlier samples can determine the camera displacement. We take the mean displacement of the inlier samples in each direction (x and y) and subtract it from the optical flow field. Then, the compensated optical flow field is used to calculate compensated versions of HOF and MBH descriptors. The performance comparison of the ego-motion compensated versions of our best performing HOF and MBH descriptors is shown in Table 6. The impact of motion compensation upon the HOF descriptor is positive, and yields an additional performance improvement of 1.5% mAP. However, it still cannot surpass the best performing MBH descriptor. Contrariwise, a slight drop of performance on the compensated MBH indicates that it may not benefit from the use of motion compensation.

Comparison with State-of-the-Art
In Table 7, we compared the accuracy rates of our best models, namely HOF and MBH variants with 32 GMM words and 1000 PCA components, to the ones that are mentioned in the literature. As already mentioned, we followed the evaluation procedure in [57], in order to present comparable results. As we can see, the MBH version of our method outperformed every other. The motion compensated HOF descriptor is also highly ranked.

Per-class evaluation of our activity recognition framework
Next, we select our top two models (one for each descriptor) and train them for the first 6 videos of the dataset. We present the test set confusion matrices in Figs. 4 and 5. As we can see, MBH performed better than HOF in most of the classes that heavy camera motion is expected, like the "washing dishes" or "drinking water" activities, because it simulates a compensated motion and it proves to be more appropriate when wearable cameras are used. Our framework is highly dependent on the performance of the object detector, as expected. The performance drops in instances that involve interactions with smaller objects, usually in hygiene activities. In addition, activities that involve the better performing object classes, like "watching tv" or "using computer", have higher recognition rates. Confusion seems to exist between the classes "making tea" and "making coffee" because they almost always involve person interactions with the same object classes. Another similar example is the confusion created between the "combing hair", "brushing teeth", and "dental floss" classes that are all taking place inside a bathroom with the same objects being visible from the camera. Hence, the need to directly deal with active vs passive object recognition is indicated here.

Conclusions
In this paper, we introduced a new approach for activity recognition from wearable cameras by detecting objects and then incorporating their motion patterns into low level micro-action descriptors. We represented activities using a Bag-of-Micro-Actions schema, using GMM clustering and Fisher vector encoding. Comparison with SoA techniques on the ADL dataset validate the competitiveness of our approach.
Our future steps will be to develop an object detection algorithm that discriminates between active and passive objects so as to weight them differently and to leverage hand movements and include gesture patterns into the overall framework. Additionally, we plan to incorporate Deep Neural Networks at another stage in our framework, so as to replace the Fisher encoding schema and model temporal dependencies of active object movements using LSTMs. Finally, evaluation on newer activity recognition datasets is also a future target.
Ioannis Kompatsiaris is a Senior Researcher (Researcher A') with the Information Technologies Institute / Centre for Research and Technology Hellas and the Head of the Multimedia Knowledge and Social Media Analytics Laboratory. His research interests include semantic multimedia analysis, indexing and retrieval, social media and big data analysis, knowledge structures, reasoning and personalization for multimedia applications, eHealth, security and environmental applications. He received his Ph.D. degree in 3-D model based image sequence coding from the Aristotle University of Thessaloniki in 2001. He is the co-author of 129 papers in refereed journals, 46 book chapters, 8 patents and more than 420 papers in international conferences. He has been the co-organizer of various international conferences and workshops and has served as a regular reviewer, associate and guest editor for a number of journals and conferences.