Towards Generalization in Target-Driven Visual Navigation by Using Deep Reinforcement Learning

Among the main challenges in robotics, target-driven visual navigation has gained increasing interest in recent years. In this task, an agent has to navigate in an environment to reach a user specified target, only through vision. Recent fruitful approaches rely on deep reinforcement learning, which has proven to be an effective framework to learn navigation policies. However, current state-of-the-art methods require to retrain, or at least fine-tune, the model for every new environment and object. In real scenarios, this operation can be extremely challenging or even dangerous. For these reasons, we address generalization in target-driven visual navigation by proposing a novel architecture composed of two networks, both exclusively trained in simulation. The first one has the objective of exploring the environment, while the other one of locating the target. They are specifically designed to work together, while separately trained to help generalization. In this article, we test our agent in both simulated and real scenarios, and validate its generalization capabilities through extensive experiments with previously unseen goals and unknown mazes, even much larger than the ones used for training.


I. INTRODUCTION
T ARGET-DRIVEN visual navigation is a longstanding goal in the robotics community. A robot able to navigate and reach user specified targets, by using only visual inputs, would have a great impact on many robotic applications, such as people assistance, industrial automation, and transportation (see Fig. 1).
This article has supplementary downloadable material available at http: //ieeexplore.ieee.org, provided by the authors. The video titled "Towards Generalization in Target-Driven Visual Navigation by Using Deep Reinforcement Learning," in which a novel framework designed to improve generalization in target-driven visual navigation is proposed. In this task, a robot has to explore an environment to reach a user specified goal, using visual inputs only. This video describes briefly the proposed architecture and the training procedure. It also shows and discusses some recorded experiments, both in simulation and in real scenarios. This is an .mp4 file. It can be played using VLC 3 A naive way to approach this problem is to combine a classic navigation system with an object detection module. For example, it is possible to couple a map-based navigation algorithm [1]- [4] or simultaneous localization and mapping (SLAM) systems [5]- [8] with one of the state-of-the-art object detection or image recognition models [9], [10]. However, map-based approaches assume the availability of a global map of the environment, while SLAM algorithms are not still specifically designed for targetdriven visual navigation. A geometric map and the distinction between mapping and planning are not necessary in this task, and can make the whole system unnecessarily fragile. Furthermore, deep learning object detection models and classic navigation algorithm are not originally developed to work together, and combining them would not be trivial.
For these reasons, map-less methods [11]- [14] have proven to be much more suitable for target-driven visual navigation. A widespread approach is to combine deep convolutional neural networks (CNNs) with reinforcement learning (RL). Deep reinforcement learning (DRL), indeed, allows to manage the relationship between vision and motion in a natural way, and it has shown impressive results for mapless visual navigation and many other robotic tasks [15]- [20].
The common objective in most RL problems is to find a policy, which maximizes the reward w.r.t a specific goal. In such tasks, the target is unique and fixed from the beginning of the training phase. On the contrary, in target-driven visual navigation, a possibly different goal is specified for each run of the algorithm. The typical approach is to embed into the policy both the target goal and the current state. Theoretically, in this way, it is possible to train just one algorithm to find multiple targets, without the need of learning new model parameters for every possible goal. However, current methods are limited to consider as goals specific scenes or objects with which the model is trained [21], [22]. Therefore, in practice, it is still necessary to train, or at least fine-tune, the agent for every specific object it has to find and for every new environment it has to explore. If the environment is a real one, the training procedure for such DRL agent can be very complex and in some cases dangerous. In order to address this problem, we propose a new architecture, trained exclusively in simulation, which transfers to real environments and real objects, without the need of fine-tuning. RL algorithms, however, show a great tendency to overfit in the domain in which they are trained [23], [24]. To avoid this, we make sure that the agent's navigation policy is not biased by the objects seen during training, and that the localization of such objects is not affected by the exploration strategies. To this end, we introduce a novel framework composed of two networks. The first one aims to develop exploration strategies in unknown environments, while the other one to locate the target object in the image. We design our framework in such a way that the two parts of the architecture are independent of each other, helping generalization, but at the same time perfectly suited to work together. We show that by completely decoupling these two components, we can effectively apply domain-transfer techniques to the two networks separately and also control their complexity according to their respective tasks.
To train such agent, we build two simple simulated environments, one for each subtask, using the photorealistic graphics engine Unreal Engine 4 (UE4). 1 To evaluate our model performance, we analyze its ability to explore unknown environments and locate target objects in different simulated scenarios. Through our experiments, we show that with our approach it is possible to achieve surprising results even in much more complex scenarios compared to the ones used during training. Finally, we verify the generalization capability of our algorithm in a complex real setting with a real robot.
The supplementary video can be found at the following link: https://www.youtube.com/watch?v=gZzP0y4AnRY.

II. RELATED WORK
Target-driven visual navigation is a relatively new task in the field of robotics research. Only recently, end-to-end systems have been specifically developed to address this problem. A possible naive approach could be to use a classic map-based navigation algorithm along with an image or object recognition model, such as YOLO [9] or ResNet [10]. However, map-based methods cannot be used in unknown or dynamic environments. To cope with this restriction, they can be replaced with SLAM systems [5]- [8], but other limitations still remain, in particular: first, SLAM systems build and use geometric maps to navigate, which are not necessary for target-driven visual navigation; second, classic navigation algorithms separate mapping from planning, which is again not needed for this task, and can potentially compromise the robustness of the whole system; third, additional efforts are required to combine SLAM systems 1 [Online]. Available: https://www.unrealengine.com and object detection modules, as they are not designed to work together.
To overcome these limits, mapless methods, which try to solve the problem of navigation and target approaching jointly, have been proposed [12], [21], [22]. These systems, like ours, do not build a geometric map of the area, instead, they implicitly acquire the minimum knowledge of the environment necessary for navigation. This is done by direct mapping visual inputs to motion, i.e. pixels to actions. In particular, given its recent successes, the DRL framework proves very promising for this purpose.

A. DRL for Robotic Applications and Visual Navigation
RL has a long history in the artificial intelligence research field [25]. However, it is only in recent years, with the adoption of deep neural networks (DNNs), that the first great results have been achieved. In [26] and [27], human level performance is reached and surpassed in the games of Backgammon and Chess, respectively. Silver et al. [28], prove as a combination of DNN and tree search can master the game of Go, beating the world champion Lee Sedol. In 2015, Mnih et al. [29] presented the first RL and CNNs based algorithm able to reach human-level performance in a variety of Atari videogames, playing directly from pixels. This article encourages further research in many other visually complex tasks and videogames.
These successes in DRL inspire also several works in the robotic field. Thuruthel et al. [19] proposed a DRL-based closed-loop predictive controller for soft robotic manipulators. Andrychowicz et al. [20] used RL to learn policies for complex dexterous in-hand manipulation. Sampedro et al. [30] introduced an image-based visual servoing controller for object following applications with multirotor aerial robots. Sadeghi and Levine [31] proposed an algorithm, trained with 3-D CAD models only, which is able to perform collision-free indoor flight in real environments. In [32], a system to control torques at the robots motor directly from raw image observations is designed. Jaritz et al. [33] presented an A3C [34] based algorithm for end-to-end driving in the rally game TORCS, using only RGB frames as input. They run open loop tests on real sequences of images, demonstrating some domain adaptation capability. Bruce et al. [35] introduced the interactive replay to show how they perform zero-shot learning under real-world environmental changes. Ye et al. [36] proposed a new method for object approaching in indoor environments. Jaderberg et al. [13] and Mirowski et al. [14] showed that an agent can be successfully trained to explore complex unknown mazes to find a specific target, using only RGB frames as input. However, in their experiments, the appearance of the goals is fixed during training. As a consequence, the target is embedded in the model parameters, meaning that during the test phase, the target must necessarily be the same.
Other approaches [37]- [39] focus on visual navigation to find objects given the corresponding class labels. Mousavian et al. [37] try to learn navigation policies by using state-of-the-art (SotA) computer vision techniques to acquire semantic segmentation and detection masks. They show that such additional inputs are domain independent and allow joint training in simulated and real scenes, reducing the need for common sim-to-real transfer methods, like domain adaptation or domain randomization. However, we argue that segmentation and detection masks are less beneficial in maze-like environments, where floor and walls are the predominant elements, which, instead, our method addresses. Zhu et al. [38] introduced a new deep learning based approach for visual semantic planning. They use both RL and imitation learning (IL) to train a deep predictive model based on successor representations, demonstrating good cross-task knowledge transfer results. Despite bootstrapping RL with IL proves to be an effective way to increase sample efficiency in many other applications, one of the main purposes of this article is to study how artificial agents learn complex navigation policies in complete autonomy, i.e., without human influence. Yang et al. [39] addressed generalization in visual semantic navigation by managing prior knowledge with graph convolutional networks [40]. While they prove that, in their settings, prior knowledge improves generalization to unseen scenes and targets, it has limited applicability in maze-like environments and in targetdriven visual navigation, where target objects are not specified semantically but with images.
To the best of our knowledge, the only two works which tackle the task of target-driven visual navigation, by using images to specify goals and employing DRL only (i.e., without IL), are [21] and [22]. Zhu et al.'s work [21] is the first one that addresses this problem. The authors propose a novel algorithm whose policy is a function of the goal as well as the current state. In this way, it is possible to specify a new arbitrary target without the need of retraining or fine-tuning the model. However, the images they use to specify the goal are scenes taken from the same area where the agent is placed. This means that with this approach, a robot cannot navigate in an environment of which we have no image.
One further step toward generalization is taken by [22], which introduces a framework that integrates a DNN based object recognition module. With this module, the agent can identify the target object regardless of where the photo of that object is taken. However, it is still trained or fine-tuned in the same environments where it is tested. Therefore, it is still not able to generalize to unseen scenarios.

B. Contribution and Overview
In all the aforementioned works [21], [22], it is always necessary to retrain or at least fine-tune the model for new environments and objects. In real scenarios, this is not only an inefficient approach due to the high cost of producing samples with physical robots, but it can also be dangerous.
To avoid that, we design a novel architecture composed by two main DNNs: the first, the navigation network, with the goal of exploring the environment and approaching the target; the second, the object localization network, with the aim of recognizing the specified target in the robot's view. They are exclusively trained in simulation, and no single real image is used. Finally, we show that our algorithm directly transfers to new unknown environments, even much larger than the ones used during training, and, most importantly, also to real ones with real targets.
This article proceeds as follows. Section III describes the DRL framework used and the proposed two-networks architecture; Section IV focusses on the training procedure; Section V presents the environment setup used to train our algorithm; Section VI shows the experiments and the results; finally, Section VII concludes this article.

III. FRAMEWORK
In the following, the proposed framework is presented. In particular, this section is divided in two parts: 1) Problem Formulation: in which the problem is presented, and our design choices are described; 2) Network Architecture: where our two-networks architecture is detailed.

A. Problem Formulation
The target-driven visual navigation problem consists in finding the shortest sequence of actions to reach a specified target, using only visual inputs. Our goal is to design an RL agent able to find that sequence directly from pixels.
We consider the standard RL setting where the agent interacts with the environment over a number of discrete time steps. The environment can be seen as a POMDP in which the main task of the agent is to find a policy π that maximizes the expected sum of future discounted rewards: where γ ∈ [0, 1) is the discount factor, r t = r(x t , a t ) is the reward at time t, x t is the state at time t, and a t ∼ π(·|x t ) is the action generated by following some policy π. The MDP is clearly partially observable, in our case, because in every step the agent has no access to the true state x t of the environment, but only to an observation o t of it. The observation o t is composed of the current RGB frame from the agent point of view and the image of the target to be reached. These two inputs are both fed into the architecture, which consists of two different networks. The first, i.e. the object localization network, has the objective of comparing the two images and locate the target. The second, i.e., the navigation network, is used to learn exploration strategies to solve complex mazes. In particular, both inputs are processed by the object localization network, which compares them and outputs a vector indicating the target relative position in the field of view of the agent. This vector and the current RGB frame are then fed into the navigation network that selects the next action (see Fig. 2).
From the results in [13] and [14], it emerges that also a small and fast CNN can reach great results in synthetic maze solving, and we argue that such a network could be sufficient in real ones too. With this reasoning, we employe a lightweight CNN for the navigation and a larger one for the object location task, where a powerful model for feature extraction is needed. In this way, the object localization network can be much more efficiently trained Overall system architecture. We can see the object localization and the navigation networks with the yellow and blue background, respectively. The first network takes two 224 × 224 RGB images as input, the goal image and the current frame. All the weights are shared among the two branches (red dotted square). The network outputs a six length one-hot vector encoding the relative target position in the current frame. This vector is then fed into the navigation network together with the current frame, resized to 84 × 84. The network finally produced the policy and the value function. offline by supervised learning, and the navigation network can be faster trained with RL.
Among the several DRL algorithm, e.g., deep Q-Learning [41], A3C [34], batched A2C [42], GA3C [43], we choose the recent IMPALA (Importance Weighted Actor-Learner Architecture), which, in [44], is used to simultaneously learn a large variety of complex visual tasks. For our purposes, it offers two main advantages. First, it leverages parallel CPU computations for efficient trajectory generation, and GPUs for faster backward computation. Second, it implements V-trace targets, in replacement to the standard value function, which allows sample-efficient off-policy learning.
The architecture of the two networks is explained in detail in the following paragraphs, while the training approach is described in Section IV.

B. Networks Architecture
The proposed model is composed of two different networks: the object localization network and the navigation network. In Fig. 2, the overview of the architecture is shown.

1) Object Localization Network:
The object localization network has the objective of locating the target object in the visual field of the agent. We discretize the position of the goal in the current frame in 5 + 1 different classes: "extreme right," "right," "center," "left," "extreme left, "no target object" [see Fig. 3(a)]. The network takes two 224 × 224 RGB images as input: the current frame and the goal's image. To extract robust features, they are first preprocessed by a ResNet-50 [10] (pretrained on ImageNet [45], with the last two fully connected layers dropped). Then, they are fed into five convolutional layers with (512, 128, 16, 16, 16) filters with 3 × 3 kernel and stride 1, all of them with rectified linear units (ReLU) as activation and followed by a GroupNorm layer. Finally, the two representations are concatenated together and processed by a fully connected layer with 256 hidden units and ReLU activation. The final output is a probability vector of six elements, one for each possible position. The position with the highest probability is chosen as input to the navigation network.
2) Navigation Network: The navigation network has the main objective of exploring the environment until the object localization network does not locate the target object. Its inputs are the current 84 × 84 RGB frame and the target estimated position provided by the object localization network. The image is first processed by two convolutional layers: the first with 16 8 × 8 filters with stride 4, and the second with 32 4 × 4 filters with stride 2. Both layers are followed by a ReLU activation and a GroupNorm layer. Then, the image features extracted are concatenated with the target estimated position and elaborated by a fully connected layer followed by a long short-term memory [46] network, both with 256 hidden units and ReLU activations. It is important to specify that, during the training phase, the features extracted by the CNNs are also fed into two deconvolutional layers, symmetrical to the convolutional ones, to make depth estimation (see Fig. 4(b) and Section IV-B for details). The final output is a probability vector p of three elements, which represents the probability of choosing an action: "turn right," "move forward," "turn left," and a scalar value which is the estimated expected sum of future discounted rewards.

IV. TRAINING APPROACH
The training is divided in two different phases: The learning phase of the navigation network and the learning phase of the object localization network. These two phases are completely independent of each other, hence each network is trained separately.  (2). The features are then further processed by the last two convolutional layers and the fully connected one to produce the two probability vectors. (b) Navigation network inputs the current frame and one-hot vector from the engine. The features extracted by the two convolutional layers are used to make depth estimation, and concatenated to the vector to produce the policy and the value function.

A. Object Localization Training Phase
The object localization network training is posed as a similarity metric learning problem [see Fig. 4(a)]. We use the dataset collected in the capture-level (see Section V-B), whose samples consist of triplets of images. Each triplet contains: the picture of the goal, an image in which the goal is visible, and an other one in which it is not. These three 224 × 224 pictures are first preprocessed by a ResNet-50, and then fed into the first three convolutional layers. At this point, a triplet loss function is computed as follows [47]: where m is a positive constant which controls the margin, g represents the features extracted from the goal image, and f + and f − the features extracted from the picture where the target is visible and where it is not, respectively. This loss allows the model to develop a concept of similarity. The three descriptors are then concatenated together two by two: Finally, r 1 and r 2 are separately processed by the last two convolutional layers and the fully connected one, producing the two probability vectors, p 1 and p 2 . The loss we use is a weighted cross entropy (CE) loss, adapted to our specific task. In particular, in the classical CE loss formulation, every classification error weighs the same. On the contrary, we argue that for our specific task, the value of the loss should increase proportionally to the distance between the estimated and the true object location. For that reason, we modify the CE formulation for our localization loss: where d = |n − k| p * n is the nth element of the probability vector (either p 1 or p 2 ), n corresponds to the true location of the target object, and k corresponds to the most likely position according to the network, i.e.: So, the overall loss for a single triplet is given by the sum of the components in (2) and (3):

B. Navigation Training Phase
The navigation network is trained via RL using IMPALA. It requires the use of two main entities: the actor, which runs in CPU, and the learner, which works in GPU. Both actor and learner share the same network parameters. The task of the actor is to collect trajectories of experiences through the interaction with the environment, while the learner processes them to update the network. In our specific implementation, we use 16 actors and one learner.
1) Actor: During the learning phase, the navigation network is completely separated from the object detection network [see Fig. 4(b)]. Each IMPALA actor is placed in a different maze (see Section V-A for details), and for every action a t , it receives from the environment a reward r t and a new observation o t . r t is always 0, except when the agent reaches the goal, in which case it receives +1. o t is composed of the target position and the current RGB frame, which, during learning, are both generated by the game engine itself.
Once an actor finishes to perform a predefined number of steps, it sends its trajectory to the learner, which reprocesses it to update the network.
2) Learner: The learner is responsible for loss computation and parameters update. The first loss is relative to depth estimation. According to [14], if this loss shares model weights with the policy, which is true for our architecture, this auxiliary task can be used to speed up learning and to achieve higher performance. Since only the ceiling and floor are located in the lower and upper areas, we compute the depth loss only on the central 80 × 40 pixels of the frame. The loss we use for that is a pixelwise mean squared error between the predicted depth d p and the one provided by the engine d e The features extracted by the convolutional layers are concatenated with the one-hot input vector provided by UE4 [see Fig. 3(b)], and processed by the network, which finally output the policy and the estimated value, as described in Section III-B2.
To speed up learning, we use an experience replay [48], which is a buffer of trajectories shared among actors. Therefore, every computation, two trajectories are randomly picked from the experience replay, batched together with the actor's current one, and processed in parallel in a single pass.
All the trajectories are used to compute (4) and the following three different losses. Before describing them, it is important to clarify that the value IMPALA assigns to a state is different from the one computed with the classical Bellman equation [see (1)]. Due to the lag between the time when trajectories are generated by the actors and when the learner estimates the gradient, they actually follow different policies, μ (behavior policy) and π (target policy), respectively. Therefore, the learning takes place off-policy and this must be taken into account when estimating the value. IMPALA uses V-trace targets v t for that purpose, which are a generalization of the Bellman equation for off-policy learning. The first loss we report needs to fit those V-trace targets v t : where V θ (o t ) is the estimated value, parameterized by θ, for the current observation o t . The second is relative to the policy π: in which ρ t = min(ρ, π(a t |x t ) μ(a t |x t ) ) is one of the truncated importance sampling weights. Importance sampling is a well-known technique for estimating properties of a particular distribution, while only having samples generated from a different distribution (μ in our case) than the distribution of interest (π in our case). However, it may suffer from instability because, in some conditions, the policies can diverge from each other, resulting in extremely high weights π(a t |x t ) μ(a t |x t ) . To reduce the variance of the gradient estimate, the weights are clipped atρ = 1.
It should be noticed that with our architecture π θ and V θ share the same parameters θ, however, in general, they can be different. The third loss consists of a bonus for entropy in action selection, and is employed to avoid a premature convergence: This third component is especially useful during the first training steps, because it balances exploration and exploitation, allowing the agent to sufficiently explore the MDP before converging. For a detailed description of the V-trace algorithm and the last three losses, see the Appendix and [44].
All the aforementioned (4)-(7) define the overall loss function for the parameters update: where b and c are the baseline and entropy costs, respectively. The new weights are then returned to the actor, which starts a new trajectory.

V. ENVIRONMENT
To train our DRL agent on the target-driven visual navigation task, we use 3-D virtual environments only. One of our main focuses is to design an algorithm capable of generalizing to realworld scenarios. For this reason, we choose the UE4 graphics engine, which is one of the most recent and best known. It is widely spread in the game industry, and a lot of companies use it for its photorealism and customizability.
We have designed two different levels: 1) Maze-level: level where the navigation network has been trained; 2) Capture-level: level used to collect the images to train the object localization network. In the following sections, we describe both levels in detail.

A. Maze-Level
In Fig. 5, a top-down view of this level is shown. It is composed by several separate 3 × 3 mazes, one for each IMPALA actor. Every time an actor hits its target or reach the maximum number of allowed steps, its corresponding maze is regenerated and the actor and its goal are respawned. To avoid overfitting on a set of specific configuration of mazes and actor-to-target paths, we choose to generate mazes and to spawn actors/goals completely at random.
Most importantly, contrary to the approaches in [21] and [22], in which the DRL agent is trained on the same scenes, where the objects it has to find are located, we decouple navigation from target localization. To this aim, we use a goal invisible to the navigation network. The only way it has to reach the target is to follow the six element one-hot vector, generated by the UE4 engine itself, which indicates its relative position. In this way, the agent can be trained with no bias w.r.t. the objects.
In order to make the navigation network be directly transferable from a simulated environment to a real one, we use domain randomization (DR) [49]. It is a simple yet effective way to achieve sim-to-real transfer, which is successfully applied in other robotic tasks [50], [20]. It consists in randomizing the training environment settings in order to make the system more robust to domain changes. To do that, every time an actor hits its target, we change at random the following parameters: "maze wall heights," "maze wall textures," "maze floor textures," "light color," "light intensity," "light source angle." It should be noted that applying domain randomization to the SotA architectures in [21] or [22] is, in theory, possible. However, as we show in Section VI-B, this increases training inefficiency without improving the performance. As detailed in the following sections, the reasons behind those results can be mainly ascribed to the high complexity of their architectures, which do not decouple the navigation and the target recognition tasks. On the contrary, our approach relies upon two totally separable components (i.e., the navigation and the target recognition networks). Therefore, the navigation network can be much smaller, considerably reducing the training time and increasing its effectiveness. It is also important to emphasize that this in no way compromises the accuracy in locating objects, as the object recognition network can be arbitrarily complex.
The possible actions our actors can do are three: "turn right," "move forward," "turn left." To simulate the uncertainty of a real robot's motion, we inject uniform noise in the speed and angle of our actors movement.
As the training of the two networks takes place separately, the input vector of the navigation network is not the one produced by the object localization network but is generated by the UE4 itself. In order to make the navigation network more robust against possible classification errors of the localization network, we ensure that the one-hot vectors generated by UE4 have a 10% probability of being wrong (in that case, it is picked uniformly from all classes).

B. Capture Level
Contrary to the navigation network, which is trained via reinforcement learning, the object localization network is trained in a supervised manner. We collect a dataset of synthetic images from the UE4 game engine in the following way. We place a camera in a fixed position and spawn randomly generated objects in its field of view. Every time the camera takes a picture, the current objects are replaced with randomly generated ones. Like for the Maze level, we follow the approach in [49] to achieve good generalization capabilities across domains.
For each image, we get the relative position of each object with respect to the camera from the engine. We discretize the position in five classes: "extreme right," "right," "center," "left," and "extreme left." Then, we download from the web pictures depicting goal objects, and associate each of them with two images. Each entry of the dataset is then composed of: an image  I  IMPALA HYPERPARAMETERS   TABLE II  MAZE-LEVEL PROPERTIES of the target, a capture where the target is not present, a capture where it does, and its relative position in that capture.

VI. EXPERIMENTS AND RESULTS
In the experiments, we measure our target-driven visual navigation system performance in unseen environment. In particular, we analyze our agent's ability to: explore the surrounding environment and reach the designated target. To do that, we design three kind of tests, which are described in the following sections. Furthermore, we propose an ablation study to examine the benefit of the auxiliary depth estimation loss on the agent performance. Finally, to verify the generalization capability of our algorithm, we use it in a complex real setting with a real robot.

A. Training Details
We train the navigation network with 16 actors for 70 million steps using SGD without momentum, with learning rate 0.0005 and batch size 8. The specific parameter settings of IMPALA and the Maze-level that are used throughout our experiments are given in detail in Tables I and II, respectively. The object localization network is trained with the first 540 000 samples of our synthetic dataset for 50 epochs with learning rate 0.0025, batch size 128, and margin constant m equals to 0.1. We use other 90 000 samples to implement early stopping and selecting the best model parameters. All the details on the parameters used for the capture-level are reported in Table III.
The source code is publicly available and can be found at 2 .

B. Simulated Experiments
To measure the performance of our target-driven visual navigation system in unseen environments, we make three types of tests in simulation: one to check the exploration capability of the surrounding environment and the other two to verify if it is able to locate and reach the designated target.
Validation on target-driven tasks is achieved by comparing our approach against different strategies. First, we show that our method learns effective navigation policies in contrast to a random agent (RA) that picks actions by following a uniform random probability distribution. Furthermore, we show comparisons against the target-driven navigation model (TDNM), presented in [21], and the active object perceiver (AOP), introduced in [22]. We consider two different versions for each of the two SotA baselines, which differ from each other in the way they are trained: 1) Without DR: In these experiments, we evaluate the performance of the two baselines as originally presented in the papers [21] and [22]. Specifically, both models are trained in 16 3 × 3 mazes, as for our method, but with no domain randomization (i.e., mazes configurations, textures, lights, targets, etc., are all fixed from the beginning of training). A different specific layer is used for each one of the 16 mazes, following the training protocol proposed by the authors. 2) With DR: This second set of tests aim to analyze the effect of domain randomization on the SotA approaches. However, since they both employ scene-specific layers, DR cannot be straightforwardly applied. As originally conceived in [21] and [22], each scene-specific layer is trained in a single fixed scene. By applying DR on each of the 16 mazes, the scenes change at the end of each episode, which raises two issues: 1) a huge number of scene-specific layers would be generated throughout the training process and 2) each scene-specific layers would be trained for just one episode, preventing the whole network from learning properly. To overcome these issues and evaluate the SotA baselines with DR, we decided to share the scene-specific layer across all the agents and scenes, as for the generic layers. Nevertheless, it should be highlighted that such strategy differs from the original methods proposed in [21] and [22]. For a fairer comparison, we avoid training the object recognition network of the AOP (both with and without DR) from scratch, preferring to directly feed it with the ground truth bounding boxes provided by the engine.
Finally, at the end of this section, we present an ablation study to evaluate the effects of the auxiliary depth estimation loss l d on the agent performance.

1) Exploration Experiment:
In this experiment, we place the agent in the center of a 20 × 20 maze, which is much larger than the 3 × 3 mazes in which it is trained. We give it 180 s to explore it as much as possible. At the end of the episode, we measure the percentage of the maze it has discovered. It is important to say that with this amount of time, it is impossible to explore the entire maze.
In Fig. 6, four sample trajectories in a 20 × 20 maze are shown. In these pictures, we can see that the agent usually follows the right wall of the maze. The wall follower is a well-known algorithm to solve mazes, in particular, for simply connected ones, it is a technique that guarantees the agent not to get lost and not to walk the same path more than twice. The fact that the agent, even if trained in small 3 × 3 mazes only, has been able to develop this algorithm is remarkable. We argue that this can be explained by the reward function we designed. Specifically, the agent receives a positive reward only when it reaches the target (which is unknown to the agent at the beginning of the exploration). Therefore, it is encouraged to explore the maze as fast as possible. This implies that revisiting already inspected locations should be avoided, which is exactly what the wall-following policy does. Having developed such strategy suggests that our model extracts the fundamental features to explore mazes of any size. As proof of this, we follow the approach in [51] to visualize the saliency map of the network. In Fig. 7, as expected, it can be seen that the gradient is much stronger on the right wall, especially on the margins and edges.
Another evidence that the agent has developed a certain awareness of the task and a concept of exploration comes from the analysis of the value function. In Fig. 8, we can see sample frames with the corresponding estimated values. When the agent finds a corner, it knows that it has no visibility behind it, and that there is an unexplored area that could hide the object it has to find. This is clearly visible from its estimate of the value function, which shows peaks precisely in correspondence of the corners [see Fig. 8(a), (c)]. On the other hand, there is an opposite behavior in correspondence of dead ends, to which the agent assigns very low values [see Fig. 8 We measure our agent performance in four different mazes, each of them generated by a different seed, and with four levels of light intensity (see Fig. 9). For each maze-light setting, we randomly pick three floor and wall textures. For each of the 48 possible combinations, we average the results over 3 runs. In Table IV, the model scores are reported by percentage of explored area, w.r.t. the maze random seed and the intensity levels, in comparison with human and RA performances. As can be seen, the agent exploration ability is highly affected by the light conditions, in particular, the performance drops for the very low value of 0.5, for which the walls contour are fairly visible. It should be noticed that the two extreme values are both outside the range used during training, but only the lower one caused poor performance. We suppose that the main problem is Fig. 6. Four examples of agent trajectories in a 20 × 20 maze with some variants of floor and wall textures. It should be noticed that the agent learns to follow the right wall, which is an optimal rule to solve simply connected mazes.  not the overfitting in the training distribution, but the difficulty to distinguish the walls contours in dark frames. Indeed, also the score for intensity equals to 3, which is still in the training range, is quite low.
The comparison between our method and RA (see Table IV) shows that our agent learns a navigation policy (i.e., the wall follower strategy) that proves to be far superior and more efficient than a random explorer, as expected.
In Fig. 10, very large score ranges can be seen in correspondence of every light intensity levels. The agent performance oscillates considerably, and with some particular wall textures it reaches rather low levels. However, in the same figure, we can also see that in some other settings our agent can get close to human performance, which is impressive, considering that it is trained in very small 3 × 3 random mazes only.
2) Target-Driven Experiment (5 × 5): In this second experiment, we place our agent in a 5 × 5 maze, which ends in a room with three different objects, including the target. This experiment is naturally divided in two phases for the agent: it has first to explore the maze to find the room, then it has to distinguish the target from the other objects, locate it, and approach it. An episode ends when the agent reaches the target (i.e., when they collide) or when 90 s are elapsed. In Fig. 12, an example of our agent trajectory is shown, in that case the goal is the red chair.
We try with all the nine different objects used to train the object localization network, together with other three previously unseen object classes ("Can," "Extinguisher," and "Boot"), averaged over six runs each. To measure the agent performance, we use three metrics: the time (in seconds) needed to reach the    5. Plots show that the agent can achieve high scores with good light intensity values. However, the performance drops rapidly when the brightness is reduced. The wide score ranges that can be seen in all the figures are caused by particular wall textures with which our agent produces poor performance. goal, the success rate (in percentage), and the Success weighted by (normalized inverse) Path Length (SPL), introduced in [52]. The SPL is calculated as follows: where N is the number of test episodes, i the shortest-path distance from the agent starting point to the target in episode i, and p i the length of the path actually taken by the agent in episode i.
The results of all the models are summarized in Table V, in comparison to human and RA performance. As shown by the results, neither the TDNM [21] nor the AOP [22] are able to generalize in the test environment. Since the test environment is much harder than the training scenarios, an adequate exploration policy is essential to complete the task. Without such a policy, it is extremely unlikely to ever reach the room with the objects, as confirmed by the RA results.
We argue that the SotA baselines are not able to effectively explore the mazes, even if trained with domain randomization, for the following reasons. First, both baselines are originally conceived to be quickly fine-tuned on previously unseen environments, thanks to the scene-specific layers they implement. Therefore, they are not specifically designed to directly generalize to new scenarios, contrary to our approach. Second, TDNM and AOP rely upon a single complex architecture to simultaneously address exploration and target recognition. Hence, their optimization is more difficult and it is harder to achieve navigation capabilities that generalize over test scenarios that are wider and more complex than those used for training. On the contrary, our approach decouples the two subtasks and favors the generalization of exploration policies over other environments.
In addition, as shown in Fig. 11, our approach is also more efficient. The plot represents the average episode reward obtained by our method and the other SotA baselines (with domain randomization) during training, as function of time. In particular, our model can complete its 70 million steps training in about 156 h, while TDVN and AOP require more than 380 h to perform roughly 40 million steps (we decide to not continue further since the curves do not show any signs of improvement for several Fig. 11. Average episode reward during training as a function of time, expressed in hours. Our model (blue line) is trained for 70 million steps, while the other approaches considered ( [21] and [22], trained with domain randomization) for roughly 40 millions. The figure proves that our method can reach superior final performances and also be much more efficient to train (more than four times faster compared to TDVN). Notice that the difference in terms of training time between TDVN [21] and AOP [22], which employ very similar architectures, is related to the need to generate ground truth bounding boxes by the simulation engine. Fig. 12. Example of a trajectory followed by our agent in the target-driven (5 × 5) experiment. The environment is a 5 × 5 maze, which ends in a room with three different objects, including the target (a chair, in this case). million steps). Hence, our approach is considerably faster than [21] and [22], and this further suggests that those approaches are not particularly suited for direct sim-to-real transfer. On the contrary, as can be observed in Table V, our model produces reasonable results, successfully reaching the targets most of the time. There are particular object meshes, however, for which the agent is not able to complete the task before the end of the episode. In that cases, the agent finds the room but it is not able to approach the target. Most of the time, this is due to the uncertainty of the object localization network, which is unable to locate some object meshes continuously and consistently over time.
Since our object localization network is trained by using a similarity metric, we expect a certain ability to generalize to previously unseen objects classes. From Table V (see "Can," "Extinguisher," and "Boot"), it can be seen that the agent successfully reaches two ("Extinguisher" and "Boot") of the three new object classes. In particular, the poor results with the "Can" are due to frequent false positives of the object localization network that, misleading the navigation component, prevent the agent from exploring the environment properly.
3) Target-Driven Experiment (20 × 20): Differently from the previous experiment, in this case, the test maze is a much larger 20 × 20 maze and the time limit is increased to 300 s. The purpose is to assess the ability to collaborate between the object localization network and the navigation network, in situations where the distance to be covered is much longer. Considering the poor performance of the two baselines in the 5 × 5 maze, in Table VI, the agent results are reported in comparison with human performance only. As expected, it can be noticed that the time needed to reach the target increases significantly, while both the success rate and the SPL considerably decrease. In particular, the score highly depends on the first turns the agent chooses to take, since the maze is so large that it is rather difficult to get back on the right path within the time limit. This consideration is also valid for humans who, as can be seen, sometimes fail to complete the task. The size of the maze also implies a rather low SPL, for both agent and human. In fact, while the shortest path from the starting position to the target is not particularly long, the actual distance covered by the agent/human can be extremely large.

4) Ablation Study on Depth Estimation:
In this paragraph, we aim to evaluate the effects of the auxiliary depth estimation loss, proposed in [14], on the general performances of the agent. To this end, we train a second version of our model (using the same training protocol of the other experiments) without the auxiliary depth estimation loss. In Table VII, the comparison results, in the test 5 × 5 maze, between the two models are shown. As can be seen, the agent trained with depth loss shows better performance in terms of success rate and, on average (last column of Table VII), appears to be more than twice as fast in completing the task. By analyzing the navigation trajectories computed by the agent trained without the depth loss, we notice that it does not seem to follow any specific strategy and the exploration appears much more random.
Therefore, the results suggest that depth estimation could be helpful to develop a robust navigation policy. In fact, Fig. 13 shows some depth images produced by the deconvolutional network of the model. Since the main task is navigation and our aim is to teach it only the basic concept of depth, the images produced are poorly accurate. Nevertheless, it can be clearly seen that it is able to distinguish the right wall from the rest of the image, and we suppose that this may have encouraged the development of the wall-following policy.

C. Real Experiments
To test our model performance in real settings, we build several 4 × 4 mazes, both indoor and outdoor. In particular, we use seven different maze configurations where we place our robot and the target objects (Figs. 14 and 15). Due to the high complexity of the background and light conditions, we argue that the performance of the algorithm in a specific maze is probably affected by the position of the maze itself. To check this hypothesis, we test our agent in one of the seven maze Fig. 13. Three examples of depth images produced by our agent. Although depth estimation is only an auxiliary loss, and hence the images produced are not accurate, it is still possible to identify the right wall.  configurations several times, in two different orientations (see Figs. 14(a), (b), 16).
As for the experiments in simulation, we test both the exploration and the target detection capabilities of our model. In real settings, we measure our agent performance in three different types of tests: 1) The robot and the target are both placed randomly in the maze, and the goal of the agent is to reach the target as fast as possible. The run ends when the robot approaches the object or when the maximum number of 1000 steps is reached; 2) The robot is placed randomly in the maze, and its objective is just to explore as much as possible. In this case, the episode ends when the maximum number of 1000 steps is reached; 3) The robot is placed inside a simple maze, together with three objects, including the target (see Fig. 15). The purpose of this experiment is to verify the agent's ability to distinguish partially occluded objects, locate the goal, and approach it. As for the first, the run ends when the robot reaches the target or after 1000 steps. Considering all the maze configurations, locations, robottarget positioning and test types, we run a total of 84 experiments.
The robot we use in all the experiments has a substantially different shape from the avatar used during training, but performs the same actions as the agent in simulation: "turn right," "move forward," "turn left." 1) Indoor Experiments: We measure our agent performance in all three kinds of tests. In the top row of Fig. 14, the maze configurations used for the first two indoor tests are shown. We can see in Tables VIII and IX that, on average, the robot successes 46% of the times, while exploring roughly half of the maze. We expect a strong correlation between the success rate in the first experiment and the exploration performance. However, the results show that the former is slightly inferior to the latter. This is probably caused by the robot/target positioning. In fact, we ensure that the agent and the target are at a reasonable distance, and, in particular, the latter is placed in areas less likely to be explored. As a result, the success rate is consequently slightly lower.  In the third experiment, we consider one maze configuration and five targets to reach: "Monitor," "Trash," "Microwave," "Bottle," and "Lamp." We make four different configurations (C1-C4) with three objects each (see Fig. 15). For each of them, we run an episode, hence 12 in total.
From the results, reported in Table X, we can say that our agent is able to recognize and reach the objects 75% of the times. It is important to note that our model has never seen any real object, not even the ones we use in the experiments. Interesting is the fact that it is able to approach the "Microwave" every time it is specified as a goal, contrary to what happens in simulation, where it always fails. In this regard, we believe that the use of a pretrained ResNet-50 plays an important role.
2) Outdoor Experiments: We repeat type 1 and type 2 tests to measure our agent performance also in outdoor mazes [see Fig. 16(b)]. The bottom row of Fig. 14 shows the configurations used. From the results, summarized in Table VIII, a slight degradation in performance can be seen with regard to the first type of test. The exploration capabilities of the agent, on the other hand, remain practically unchanged, as reported in Table IX. It appears that, despite the great difference in lighting and background between indoor and outdoor, the algorithm performance are consistent.

VII. CONCLUSION
In this article, we introduced a new framework to address the challenging task of target-driven visual navigation. Through extensive experimentation, in both simulated and real mazes, we showed that a direct sim-to-real transfer for this problem is possible. The proposed two-network architecture, indeed, not only proved capable of reaching previously unseen objects in much larger mazes than those in which it was trained, but also showed a good ability to generalize in real scenarios.
However, although the results are encouraging, there are still a number of open problems and many aspects to improve. In particular, the navigation performance was fluctuating, and for some combinations of light and textures, the agent was not able to achieve satisfactory results. In this regard, we believe that the use of techniques designed to discover and analyze the weaknesses of DRL agents, such as [53] and [54], can be an effective way to prevent particularly bad behaviors and improve overall performance.
Furthermore, as reported at the end of the Section VI-B, we observed temporal inconsistency in locating some instances of objects. We attribute this issue to the way the object localization module was trained. We think that a standard supervised learning on uncorrelated images is not suitable for this task, and we left for future work the development of a different learning strategy that take into account the frames temporal dependence. APPENDIX V-trace is an off-policy actor-critic algorithm, introduced in [44]. Off-policy algorithms are applied when there is the need to learn the value function V π of a policy π (target policy) from trajectories generated by another policy μ (behavior policy). In the case of IMPALA, although both the actors and the learners are initialized with the same neural network and same weights, the models differ from each other after the first step of the computation. In fact, due to the asynchronous nature of the framework, actors can lag behind each other and the learner even by numerous updates. For this reason, an off-policy algorithm is required.
It is important to make clear that the behavior policies are followed by the actors only, while the target policy is parameterized by the learner model. For this reason, all the following V-trace computational steps are performed by the learner only.
First of all, consider the trajectory (x t , a t , r t ) t=s+n t=s collected by an actor with its policy μ. The n-steps V-trace target for the value approximation V (x s ) at state x s is then defined by where ρ t = min(ρ, π(a t |x t ) μ(a t |x t ) ) and c i = min(c, π(a i |x i ) μ(a i |x i ) ) are truncated importance sampling (IS) weights, and we assume thatρ ≥c.
The use of the V-trace target is necessary because of the lag between the actors and the learner, which makes the learning be off-policy. For this reason, the Bellman equation cannot be applied as it is, but it needs to be adapted. V-trace uses the truncated IS weights to correct the estimation error caused by the discrepancy between the policies. However, it should be noticed that the V-trace target is just a generalization of the on-policy n-steps Bellman target. In fact, when π = μ andc ≥ 1, every c i = 1 and ρ t = 1, therefore (8)  The main difference is the presence of the truncated IS weights ρ t and c i . ρ t determines the fixed point of the update rule, since it appears in the definition of δ t V (9). The fixed point is the value function V πρ of some policy πρ, defined by πρ(a|x) = min (ρμ(a|x), π(a|x)) b∈A min (ρμ(b|x), π(b|x)) .
This means that forρ = ∞ this is the value function V π of the target policy π. Conversely, choosingρ = 0 the fixed value become the value function V μ of the behavior policy μ. In general, for nonzero finiteρ, we obtain the value function V πρ of a policy πρ that lies between the behavior and the target policy. Hence, the largerρ the smaller the bias in the estimation, but the grater the variance. The product of the coefficients c i (8) determines the importance of a temporal difference δ t V observed at time t on the update of the value function at a previous time s. The more different the two policies are, the higher the variance of this product is. To avoid that, we clip the coefficient atc, in this way, we reduce the variance without affecting the solution to which we converge. In summary, ρ t controls the value function we converge to, and c i the convergence speed to this function.
With (8), we can describe the three parameters update. Recalling that IMPALA is an actor-critic algorithm, it employs two entities, the actor (which produces the policy) and the critic (which calculates the value), to compute the learning updates. Considering the parametric representations of the value function V θ and the target policy π ω (notice that θ and ω can be shared, as in our implementation), at training time s, the parameters θ are updated in the direction of to fit the V-trace target. This loss is needed to train the critic to judge the behavior of the actor.
The ω are updated along the direction of the policy gradient: ρ s ∇ ω log π ω (a s |x s ) (r s + γv s+1 − V θ (x s )) .
This second loss refers to the actor, which adjust the policy according to the critic evaluation. The critic contribution in the policy update rule is described by the second multiplicative term in the equation. Finally, in order to avoid premature convergence, we can add a third component in the direction of −∇ ω a π ω (a|x s ) log π ω (a|x s ) that favors entropy in action selection. This loss can be crucial for a successful training, because by pushing the probabilities of actions to be similar, it induces the agent to keep on exploring the MDP.