Dialogue management using reinforcement learning

Dialogue has been widely used for verbal communication between human and robot interaction, such as assistant robot in hospital. However, this robot was usually limited by predetermined dialogue, so it will be difficult to understand new words for new desired goal. In this paper, we discussed conversation in Indonesian on entertainment, motivation, emergency, and helping with knowledge growing method. We provided mp3 audio for music, fairy tale, comedy request, and motivation. The execution time for this request was 3.74 ms on average. In emergency situation, patient able to ask robot to call the nurse. Robot will record complaint of pain and inform nurse. From 7 emergency reports, all complaints were successfully saved on database. In helping conversation, robot will walk to pick up belongings of patient. Once the robot did not understand with patient’s conversation, robot will ask until it understands. From asking conversation, knowledge expands from 2 to 10, with learning execution from 1405 ms to 3490 ms. SARSA was faster towards steady state because of higher cumulative rewards. Q-learning and SARSA were achieved desired object within 200 episodes. It concludes that reinforcement learning (RL) method to overcome robot knowledge limitation in achieving new dialogue goal for patient assistant were achieved.

end-user/human critique whether good or bad label as input for loss function or unexpected value of candidate policies.
To build natural communication between human and robot for assistantships, our research team has prior works such as speech recognition [9], unclear pronunciation [10], robot walking and pattern generator [11,12], robot path planning [13], speech and gesture recognition [14], multimodal interaction [15] and rule-based/scenario dialogue management [16]. However the problem of rule-based is unable to follow dialogue development, so the possibility of scenario mismatch is getting bigger, once the robot does not understand, no other choice for this method besides end the dialogue and gives generic answers like "I don't understand your commands" [17]. This paper takes part on overcome robot knowledge limitation to achieve new goal through flexible dialogue. We propose robot asking method to gather new knowledge from human feedback through conversation. The robot does not stop immediately because of not understand, but it will ask first and gather new knowledge with goal to take patient belongings. We also provide entertainment and emergency request to complement patient needs. This paper would discuss natural language processing both in understanding and generation with different sub-section, dialogue management method, also the hardware set-up for robot.

RESEARCH METHOD 2.1. Natural language understanding
This paper would focus on goal-driven dialogue. The function of natural language understanding (NLU) are extracting the raw voice until system got the information needed, and provide dialogue information for dialogue management. NLU includes identification of domain and intent, also semantic parsing [18]. Text will get several processes: Stopword process to remove unnecessary such as common words. Part of speech (POS) tagging process to get grammar tag with POStag_idn and use Indonesian tag set at [19]. We focused on tag VB (verb), NN (noun), and CD (cardinal number). We use Indonesia Tnt-Tagger for POS tag method and Indonesia IDPOSTAG corpus from [20]. Followed by stemming process to get root word by removing the affix [21]. Ended with storing process using JavaScript Object Notation (JSON) format.

Dialogue management
Dialogue management (DM) consists of state tracking and generates action. Approaches for dialogue management problems are graph-based dialogue, frame-based dialogue, statistical approach [22]. Human involvement in DM framework has been successfully carried out in previous studies [23]. The reward as feedback from expertise (can be formed in negative or positive rewards) was given to optimize policy on reinforcement learning (RL) in [24]. RL was still the main instrument for DM. RL is current mainstream technology in order to solve real-world problem with large-scale belief state space [18].
Before RL can be explained, it necessary to understand basic components used. A learner called an agent in RL studies its behavior by select actions in an environment [25]. At each time, the agent receives a representation of state , while ∈ , where is states. The agent pickups an action , while ∈ , where is a set of possible actions that the agent can take. As the return of its action, the agent receives reward , while ∈ , and goes to new state s′. is learning rate, γ is a discount factor, and is policy that defines how an agent response from a specific state. The aim of an agent is selecting the optimal actions by maximizing its cumulative discounted reward.
In this paper, we use RL with temporal difference (TD) learning method. TD learning is a fusion of two benefits from Monte Carlo and dynamic programming as shown in (3), and (4). On one side, Monte Carlo methods have no model of environment's dynamics as shown in (1), so TD learns from raw experience. On the other side, dynamic programming (2) that no need waiting until the final outcome, so TD able to update estimates based on partially learned estimation [26]. Recall Monte Carlo: Recall dynamic programming: TD to make an update ( ) ← ( ), Given ( , , , ′): The value function usually also called as state-value function ( ) is the total amount of expected rewards that an agent can collect from that state to the end of the episode. The action-value function ( , ) is total amount of expected rewards of taking an action from the state until the end of the episode. The way agent learns the best policy called update policy, and the way agent behaves called behavior policy. In this paper, we also implement two TD learning methods that are off and on policy (Q-learning and SARSA).

Q-learning
Absolute policy is used by agent in Q-learning to learn optimal policy, on the other hand, agent behaves with other policy. Because the behavior policy is different from update policy, so Q-learning is categorized as off-policy TD control. Q-value of Q-learning is shown in (5).
From (5) we have known that update policy ( ′, ) is different from behavior policy ( , ). We use pseudocode from [26] to implement Q-learning in our python code as shown in Figure 1 (a).

State-action-reward-state-action (SARSA)
Agent in SARSA learns optimal policy and behaves with the same policy. Because the update policy and behavior policy are similar, so SARSA is categorized as on-policy. Q-Value of SARSA is shown in (6).

Knowledge growing
Entertainment purpose consists of: musik/playing music audio, dongeng/playing fairy tale audio, komedi/playing comedy audio. Motivation purpose, consist of motivasi/Playing motivation audio. Emergency purpose: memanggil perawat/calling nurse, keluhan/reporting complaint of pain. Helping purpose, taking an object/patient's belongings. Researchers have emphasized on implementing robots that can imitate owning memory/knowledge to mitigate many social-robot challenges [27], some studies have exploit data, based on user profile [28,29] to make memory-based adaptations. We implement robot asking during interaction to gather new information from human feedback, Figure 2 (a) is an example of additional knowledge. Entertainment, motivation, and emergency need back-end intervention from admin to add appropriate content manually. In emergency, the robot behavior (calling nurse in fix place, where robot moves, and what robot talk) cannot be changed by user/patient. Meanwhile, helping is moving action from robot that depends on user/patient habit on locating his/her belongings, so it can be useful to use end-user experience. Only for this kind of action robot will grow up its knowledge.
The helping conversation can be seen in Figure 2 (a). Grey shades show the unknown of robot, then from the conversations that we proposed then appear words in cyan, yellow, and orange shades, that is new knowledge. New knowledge will be saved at Q-Table shown in Figure 2

Natural language generation
Natural language generation (NLG) is responsible to generate linguistic realization of the system's dialogue. The goal of NLG is to produce spoken that is easy for humans to understand. In this paper we had 3 response systems there are rejection, asking, and aborting. Once system found that all word in a sentence has no verb (listed on corpus) or unique words, system will reject and request to change with other new words until there is verb or unique word in that sentence. Asking response is started with searching verb in system database knowledge, if there is no similar verb then system will categorize it as new verb with no relation to object. The system will ask for object then searching the word in corpus, if there is object in the corpus then system will search in database. That is why some verbs can have one same object. After the system has new verb and new object, then system will ask for place, if system able to fulfill direction and iteration, then it will save as new knowledge. Aborting response is where the system will able to abort mission if user says terima kasih/thank you in the middle of asking conversation.

Humanoid robot
Bioloid grand prix (GP) is a humanoid robot equipped with CM-530 controller, and lithium battery for power supply [30]. We use modified Bioloid GP from [13] as previous project with an additional speaker mounted on top of the robot. Analog voltages from Arduino Mega 2560 [31] are converted to digital values by analog to digital converter (ADC) as a reference command for CM-530 that will be translated into robot movement. Robot movement consists of forward, backward, left, and right with its iteration.

System implementation
The hardware needed for this system is a microphone input (Kinect 2.0), processor (Laptop), controller (Arduino and CM-530), and output in the form of speakers and robots as shown in Figure 3 (a). In the hardware implementation, robot able to move everywhere without wire on cable as shown in Figure 3 (b). Speech output and robot movement control are sent from laptop to Arduino via bluetooth. We used Google speech recognition with id-ID (Indonesian language) to recognize and adjust ambient noise. Laptop powered by the Intel Core i5 processor, 8GB of memory. We use python language and RL algorithm builds on it. We equipped robot voice with speech registry from Windows called Microsoft Andika to give Indonesian voice and accent, also pronounce cardinal number in Indonesian. We set robot to talk 150 words per minute (WPM). The average speech rate for conversational is 120-150 WPM [31].

RESULTS AND ANALYSIS
We conducted several experiments to see the performance of system. System configuration and environment as shown in Figure 3 (b). The performance will represent how fast robot execution, how accurate, and how knowledge growing. The experiment consists of entertainment execution, emergency execution, helping conversation with knowledge growing, policy behavior, and reward convergence.

Entertainment execution
This experiment gived us insight on execution time for single request. We implemented using 1 m fixed distance. Time counted right after translation from speech to text, we did it because the length of dialogues and the speed of people's speech rates varied. An average time was 3.74 ms. The slowest time occurs on purple shade with 8.23 ms. The fastest time was 0.86 ms with pink shade shown in Table 1.

Emergency execution
In emergency situation, we asked robot to call nurse by talk unique word "perawat (nurse)". We use sentence "panggilkan perawat (call the nurse)", then robot will ask for complaint of pain. After conversation, robot will walk to the place where the nurse usually standby and describe complaint of pain to the nurse. On the other hand, the complaint will record on a report shown in Table 2.

Knowledge growing
In beginning there were only 2 verbs and 2 objects, then during this experiment, human gives unknown knowledge to robot. From the conversations, knowledge expanded to 10 verbs and 8 objects. We also tried different verbs related to same object. Tonton/watch and lihat/see have the same object that was remote. Tulis/write and catat/record have same object, that was pencil.

Policy behavior
To know the movement of policy and actions taken in a certain state to reach appropriate object, we also taking plots for reward value at the end of episode (200th episode). As shown in Figure 5 reward shift towards remot/remote in the middle. Red shade means the lowest reward, where green shade is the highest reward (goal). The yellow shades describe the transition of reward value from the lowest to the highest. Figure 5. Left and right policy direction to object "remot/remote"

Reward convergence
On this implementation, we want to know the performance of Q-learning and SARSA for every object in each final reward for 200 episodes. Start from 1 to 7 objects. It can be seen in Figure 6 that the SARSA cumulative reward was slightly higher than Q-Learning, which means that its algorithm was faster towards steady states because SARSA's policy does not explore all actions at each step so that it was focused to get the goal.

CONCLUSION
From the experiment in previous section, it could be shown that the proposed system has the ability to expand from 2 to 10 knowledge. Additional knowledge affected to the time for learning execution that was getting longer from 1405 ms to 3490 ms. SARSA was faster towards steady state because of higher cumulative rewards. However, the difference between off and on learning can still be implemented, and the policy moves the action accordingly to achieve the desired object in 200 episodes. Equipped with entertainment feature to play music, fairy tale, motivation, and comedy request in fast average execution time of 3.74 ms. During emergency situation system able to call nurse and save 7 complaints of pain. It could be concluded that the method proposed in this paper successfully achieved the objective to overcome robot knowledge limitation in achieving new dialogue goal for patient assistant. For further research, dialogue classification and knowledge growing can be extended for chit-chat dialogue or non-goal driven.