Modeling Human Motor Skills to Enhance Robots’ Physical Interaction

. The need for users’ safety and technology acceptability has incredibly increased with the deployment of co-bots physically interacting with humans in industrial settings, and for people assistance. A well-studied approach to meet these requirements is to ensure human-like robot motions and interactions. In this manuscript, we present a research approach that moves from the understanding of human movements and derives usefull guidelines for the planning of arm movements and the learning of skills for physical interaction of robots with the surrounding environment.


Deriving a basis of Human movements
There are many examples in literature that have highlighted the importance of humanlikeness (HL) to ensure a safe and effective Human-Robot Interaction (HRI) and Enviroinment-Robot Interaction [8]. This aspect has gained increasing attention, since it could open interesting perspectives for the control of artificial systems that closely interact with humans, as is the case of assistive, companion and rehabilitative robots. For the latter category, for example, human-inspired movement profiles could be used as reference trajectories for rehabilitation exoskeletons (see [10] for review), as an alternative to, and/or in association with, classic rehabilitation procedures [9]. Similarly, human likeliness of movements is of paramount importance for robots that interact with the surrounding environment in an unstructured scenario shared with humans.
Indeed, in these cases the motion of a robot can be more easily predicted, and hence accepted, by the user, if its movements are designed taking inspiration from actual human movements [14], leading to a general enhancement in terms of system usability and effectiveness. However, the design of control laws that effectively ensure human-like behavior in robotic systems is not straightforward, representing an important topic within the general framework of robot motion planning. The solution we implemented to solve this problem exploits functional analysis to derive a basis of Eigenfunctions of human movements, which encode the characteristics of typical physiological motions.
To this end, we recorded the motion of 33 healthy subjects performing a list of 30 different actions of daily living (see Fig. 2).
Then, functional Principal Components Analysis (fPCA) was used to identify a basis of principal functions (or eigenfunctions), characterized by the fact that they are ordered in terms of importance. More specifically, let us assume, without any loss of generality, a 7 DoF kinematic model to represent upper limb joint trajectories q(t) : R → R 7 where t ∈ [0, 1] is the normalized time. In these terms, generic upper limb motion q(t) can be decomposed in terms of the weighted sum of base elements S i (t), or functional Principal Components (fPCs): where α i ∈ R n is a vector of weights, S i (t) ∈ R n -in this case n equals to 7 -is the i th basis element or fPC and s max is the number of basis elements. The operator • is the element-wise product (Hadamard product),q ∈ R 7 is the average posture of q while S 0 : R → R 7 is the average trajectory, also called zero-order fPC. The output of fPCA, which is calculated independently for each joint, is a basis of functions {S 1 , . . . , S s max } that maximizes the explained variance of the movements in the collected dataset. Given a dataset with N elements collecting the trajectories recorded in a given joint j, the first fPC S j,1 (t) is the function that solves the following problem Subsequent fPCs S j,i (t) are the functions that solve the following: A detailed implementation of this method -which bypasses the solution of the minimization problem -is presented in [2]. The core idea is that the output of this process is an ordered list of functions that are organised following the importance that each function has in reconstructing the whole dataset. (See Fig. ?? Note that this formalization of human trajectories is very compact and full of information, and can enable several practical implementations. For example, one can observe that the higher in the number of functional PCs required to reconstruct one specific movement, the more complex (or jerky) the motion is. This can have a direct impact for the evaluation of motion impairment, for example as a consequence of a stroke event. Indeed, pathological movements are typically characterized by jerky movements, and an assessment of the level of impairment can rely on the fPCA characterization, as we proposed in [15].
Moreover, the hierarchy in the definition of subsequent fPCs is a key characteristics, since it can be exploited to design incremental algorithms [1] of motion planning, as presented in the following of this manuscript.

Planning Robots' Movements with fPCs
As previously discussed, typical approaches used in literature to achieve human likeness [12] in robotic motions rely on the strong assumption that human movements are generated by optimizing a known cost function J hl (q) : C 1 is the space of smooth functions going from [0, 1) to the joint space R 7 , and 1 stands for the final normalized time. The function J hl is used to produce artificial natural motions by solving the problem How to choose J hl is not obvious, and it is indeed a very debated topic in literature. However only achieving human likeness is meaningless without specifying also a task to be accomplished. For this reason also a model of the task should be added to (4). The latter point can be formulated in terms of the minimization of an additional cost function J task : C 1 7 → R + . As soon as the need for minimizing J task is introduced, (4) becomes a multi-objective optimization, which is of very difficult formulation and solution, except for very simple cases [12].
The solution we proposed is able to by-pass this issue. Indeed, instead of using data to guess a reasonable J hl (·), and then explicitly optimize it, our solution directly embeds human likeness in the choice of the functional subspace where the optimization occurs. More specifically, the problem move from the infinite dimensional functional space C 1 7 [0, 1), to its finite dimensional subspace containing all the functions so constructed: withq, S i , α i defined as in the previous section. In this way the principal components can be used to generate motions happening within any time horizon [0,t fin ). M ≤ s max is the number of functional Principal Components considered in the optimization (with s max as in previous Section). According to the preliminary results presented in [2] and further extended in [3], it is plausible to expect that a low number of functional Principal Components should be sufficient to implement most of the humanlike motions at the joint level. Therefore the multi-object and unconstrained optimization can be formulated as the following constrained optimization problem: In this manner, the search space is narrowed, with the twofold purpose of ensuring human likeness, and strongly simplifying the control problem (indeed, the search space is now of dimension M + 1).

Point-to-Point Free Motions
Point-to-point motion can be generated by solving the following optimization problem, instance of the more general formulation (6) min where q(0) and q(1) are the initial and final poses of the calculated trajectory, while q 0 and q fin are the desired initial and final poses respectively. In this simple case, a single functional Principal Component (i.e. M = 1) is already sufficient to solve (7) with zero error, and the solution can be written in closed form (see [3]).
Obstacle avoidance Let us consider the case in which we also need to avoid one or more obstacles, while performing the point-to-point motion. The problem can be generalized as: Two terms can be distinguished in this cost function. The first contribution guarantees that the desired initial and final poses are achieved, as for the free motion case (7). The second term takes into account the distance w.r.t. obstacles. For the sake of conciseness, and without any loss of generality, we considered here N O spherical obstacles. Given P O = {P O 1 , . . . , P O N O } the set containing the Cartesian coordinates of all the centers of these obstacles, P(q, P O ) is a potential-based function that sums up, for each obstacle, a term inversely proportional to the minimum distance between the obstacle and the closest joint trajectory, i.e.
where m i is the distance between the arm and the i−th obstacle, defined as m i (q([0, 1]), The distance between the k − th point of contact with forward kinematics h k , and the i − th sphere is with R O i radius of the sphere.
Incremental optimization procedure The problem of motion generation with obstacle avoidance does not have a closed-form solution, hence the optimal trajectory is calculated T. 4 Fig. 3. In this example our approach is used to generate a "drinking" task, with and without obstacles along the trajectory.
via numerical optimization. One solution to do this is to exploit the hierarchy of fPCs basis elements, according to a descending amount of the associated explained variance, and implemented an incremental procedure (see [3] for the implementation of the Algorithm). The proposed approach calculates, given a fixed number of fPCs enrolled, the optimal trajectory that minimizes the error in starting and final position while maximizing the distance from the obstacles. If the corresponding solution is sufficiently far from the obstacles, this choice already defines the globally optimal solution. If the obstacles are not very close to the aforementioned trajectory, then solving (8) with M = 1 would fine tune the initial guess, achieving good results. In case of obstacles very close to or even intercepting the free-motion trajectory, at least one more fPC should be enrolled to suitably solve the problem. The more are the basis elements enrolled, the more complex are the final trajectories that can be implemented (see e.g. ?? for the generation of a "drinking" task).

Learning from Humans How to Grasp: Enhancing the Reaching Strategy
Deriving useful information from Humans can be pushed even further through the usage of machine learning techniques. However, it is important to recall that learning based techniques can only achieve solutions that are close enough to the desired ones, rather than exact. This uncertainty can be naturally compensated by the ability of soft hands to locally adapt to unknown environments. Following this approach, part of our effort has been devoted to the development of a human inspired multi-modal, multilayer architecture that combines feedforward components, predicted by a Deep Neural Network, with reactive sensor-triggered actions (more details in [5]). Humans are able to accomplish very complex grasps by employing a vast range of different strategies [7]. This comes with the challenging problem of finding the right strategy to use for a given scenario. It is commonly suggested that the animal brain Fig. 4. High level organization of the proposed architecture, which combines anticipatory actions and reactive behavior. A deep classifier looks at the scene and predicts the strategy that a human operator would use to grasp the object. This output is employed to select the corresponding robotic primitive. These primitives define the posture of the hand over time, to produce a natural, human-like motion. The IMUs placed on the fingers of the hand detect the contact with the items and triggers a suitable reactive grasp behavior. addresses this challenge by first constructing representations of the world, which are used to make a decision, and then by computing and executing an action plan [11]. Rather than learning a monolithic end-to-end map, we built the proposed architecture as combination of interpretable basic elements organized as in Fig. 4. The intelligence is here distributed on three levels of abstractions; i) high level: a classifier which plans the correct action among all the available ones, ii) medium level: a set of human-inspired low level strategies implementing both the approaching phase and the sensor-triggered reaction, iii) low level: a soft hand whose embodied intelligence mechanically manages local uncertainties. All the three levels are human-inspired.
The classifier was realized through a deep neural network, trained to predict the object-directed grasp action chosen among nine human-labeled strategies, using as input only a first-person RBG image extracted from a video. These actions were implemented on the robotic side to reproduce the motions observed in the videos. A reactive component was then introduced, following the philosophy of [4]. This component take as input the accelerations coming from six IMUs placed on the soft hand to generate the desired evolution of the hand pose. The lower level of intelligence consists of the soft hand itself, which can take care of local uncertainties relying on its intrinsic compliance. Any robotic hand being soft and anthropomorphic both in its motions and in its kinematics can serve to the scope (as for example the Pisa/IIT SoftHand).

Deep Classifier
The aim of this deep neural network is to associate to an object detected from the scene the correct primitive (i.e. hand pose evolution) humans would perform to grasp it. The deep learning model consists of two stages, depicted in Fig. 4: one for detecting the object, and the second one to perform the actual association with the required motion.
Dataset creation and human primitive labeling The network was trained on 6336 first person RGB videos (single-object, table-top scenario), from 11 right-handed subjects grasping the 36 objects. The list of objects was chosen to span a wide range of possible grasps, taking inspiration from [6]. During the experiments, subjects were comfortably seated in front of a table, where the object was placed. They were asked to grasp the object starting from a rest position (hand on the table, palm down). Each task was repeated 4 times from 4 points of view (the four central points of the table edges). To extract and label the strategies, videos were visually inspected to identify ten main primitives (power, pinch, sliding, lateral and flip grasps in different relative orientations). The choice of these primitives was done taking inspiration from literature [6,7], and to provide a representative yet concise description of human behavior, without any claim of exhaustiveness. The first frame of each video showing only the object in the environment was extracted, and elaborated through the object detection part of the network (see next subsection). The cropped image was then labeled with the strategy used by the subject in the remaining part of the video. This is the dataset that we used to train the network.
Object detection and Primitive classification Object detection is implemented using the state of the art detector YOLOv2 [13]. Given the RGB input image, YOLOv2 produces as output a set of labeled bounding boxes containing all the objects in the scene. Assuming that the target is localized close to the center of the image, we select the bounding box closest to the scene center. Then, a modification of Inception-v3 [16], trained on the ImageNet data set, was used to classify objects from images and extract high level semantic descriptions that can be applied to objects with similar characteristics. Fig. 6. Photosequence of a grasp produced by the proposed architecture during validation: slide grasp of a flat plate. Panels (a-c) depicts the approaching phase. In panels (d-e) the environment is exploited to guide the object to the table edge. In panels (f-g) the hand changes its relative position w.r.t. the object so to favor the grasp, which is established in panels (h-i). In panel (j) the item is firmly lifted.
Technical details on training and validation are here omitted, the interested reader is invited to refer to [5].

Robotic Grasping Primitives
The output of the network introduced in the previous section is a direction of approach, described in terms of an high level description of the human preference for the specific object shape and orientation. For each primitive, a Human-Like approaching trajectory needs to be planned (following, for example, the approach presented in section II).
As a trade-off between performance and complexity, the approaching phase is associated with an additional reactive behavior. The role of the latter is to introduce a feedback control leveraging on measures recorded through IMUs at the fingertip level, with the ultimate goal of locally precisely arrange the relative configuration between hand and object (see [4]). The transition between the first and the second phase is triggered by a contact event, detected as an abrupt acceleration of the fingertips (as read by IMUs). In [4], a subject was asked to reach and grasp a tennis ball while maneuvering a Pisa/IIT SoftHand. The grasp was repeated 13 times, from different approaching directions. The user was instructed to move the hand until the contact with the object, and then to react by adapting the hand/wrist pose w.r.t. the object. Poses of the hand were recorded through a PhaseSpace motion tracking system. We subtract from the hand evolution recorded between the contact and the grasp (T represents the time between them) the posture of the hand during the contact. The resulting function ∆ i : [0, T ] → R 7 describes the rearrangement performed by the subject to grasp the object. Acceleration signals α 1 . . . α 13 : [0, T ] → R 5 were measured too through the IMUs. To transform these recordings into a local adaptation strategy, we considered the acceleration patterns as a characteristic feature of the interaction with the object. When the Pisa/IIT SoftHand touches the object, IMUs read an acceleration profile a : [0, T ] → R 5 . The triggered sub-strategy is defined by the local rearrangement ∆ j , with j = arg max i T 0 a T (τ)α i (τ)dτ .
When this motion is completely executed, the hand starts closing until the object is grasped. We extensively tested the proposed architecture with 20 objects, different than the ones used for the training of the network. Results demonstrated that this approach is very reliable, achieving a success rate of 81.1 % over 111 grasps tested, thus demonstrating that taking inspiration from humans can provide very interesting solutions for classic and novel problems toward a new generation of anthropomorphic robots.