Inhibition of Addictive Behaviors by Peer-Group Effect in a Conditional Delayed Reward Task

In recent years, especially in the adolescent generation, there has been a problem that relying on games and the Internet by the people interferes with their lives. This problem can be regarded as one of the delayed reward tasks. In this study, we propose a conditional delayed reward task in which players can get small reward by continuing playing the game without studying, but large negative rewards are given to the players later. In order to investigate the cause and countermeasures of this problem, we adopt a reinforcement learning model and focus on the peer-group effect (PGE) in which positive or negative influences are given from the group to individuals as one of ways of inhibiting the problem. Our simulation results showed that, in a case of the state without the PGE, individuals with a small gamma value, which is one of the parameters of the reinforcement learning model and suggested to correspond to the amount of serotonin in the brain, continue to play the game; namely, become a kind of addict state. We have also found that, in the condition with the PGE, players who were in game dependent in the absence of the PGE become to choose studying than gaming.


I. INTRODUCTION
In recent years, especially in the adolescent generation, there has been a problem that relying on games and the Internet by the people interferes with their lives. In 2013, the American Psychiatric Association (APA) claimed that Internet Gaming Disorder (IGD) became a condition for further study in the DSM-5 [1], and the World Health Organization (WHO) has officially recognized the IGD as an international disease and in 2018 added the IGD to the latest edition of the ICD-11 [2]. What is the IGD concretely? The WHO defined the IGD as "a pattern of gaming behavior characterized by impaired control over gaming, increasing priority given to gaming over other activities to the extent that gaming takes precedence over other interests and daily activities, and continuation or escalation of gaming despite the occurrence of negative consequences [3]." There have been investigated on relationship between brain and the IGD. The fMRI studies indicated that adolescents with IGD showed lower white matter in several brain regions that are involved in decision-making, behavioral inhibition, and emotional regulation and had decreased functional connectivity of prefrontal cortex striatal circuits, increased risk-taking choices [4]. Other recent studies, which are resembled of the IGD, suggest that internet addiction disorder (IAD) is associated with structural abnormalities in brain gray matter, and its brain abnormalities probably contributed to chronic dysfunction in subjects with the IAD [5]. As approaches to treating people with the IGD and IAD, the Kurihama Medical and Addiction Center which is a special hospital in Japan uses a mix of cognitive behavioral therapy, social skills development, and treatment programmes emphasizing physical activity. Higuchi who is the director of the center claimed that "in some ways addiction to gaming is harder to treat than addiction to alcohol or drugs because the internet is everywhere [6]." In this study, we consider that most of people have influences on other individuals mutually, because they have smartphones which can be connected to the Internet and played the games, and further they can be watched that situation around them. In the situation, it is well known exerting "Peer-Group Effect (PGE) [7][8][9]" influenced on the people in a group mutually. We think the PGE as a way of approaching to potentially inhibit addictive behaviors such as the IGD and the IAD. The PGE can be expressed as a proverb "He who touches pitch will be defiled." That is to say; we think if some people watch many individuals who conduct the same things, they affected by the individuals also conduct the same things. This, however, means that the PGE has either positive or negative influences on many people.
The purpose of this study is to clarify whether or not the PGE can actually have an influence of inhibiting the addictive behaviors. In order to realize our purpose, we propose a conditional delayed reward task (CDRT) as a model of situation that can be expressed an addictive behavior like the IGD and adopt a model of reinforcement learning agent as a player in the CDRT. The reason the situation can be regarded as a kind of delayed reward task is that game players gain positive rewards that feel a lot fun by playing a game but get negative rewards that get bad grades by keeping to continue only the playing.

A. Situation of the Task Selecting Two Types of Behaviors:
Playing a Game or Studying We are making various choices. In many cases, we will try to select good choices that will make profits through predictions 978-1-7281-3380-5/19/$31.00 ©2019 IEEE based on our various experiences and the past information that we obtained. There are situations in which selecting such good choices are so difficult, but also one in which we can predict correct choices previously and easily. However, for instance, in a school, there are many students who cannot select good choice such as studying and continue to select bad choice such as playing a game, even though they correctly know there are regularly scheduled tests after one week.
In this study, we model a task faced to the students in such the situation as a conditional delayed reward task (CDRT). In the CDRT, the agents can be regarded as the students, and there are three states: s0-s2. The state s0 is the initial state. The state transits each time all agents acted, and they can get a small positive immediate reward by playing a game (indicated by G) or a small negative immediate reward by studying (indicated by S) at each state. A large negative delayed reward, however, is given to them after continued to play a game. On the other hand, a large positive delayed reward is earned after kept to study. Therefore, in order to avoid to get a big negative delayed reward, they have to continue to select the studying with a small negative immediate reward. The CDRT expressed by a state transition diagram is depicted in Fig. 1. As can be seen in Fig. 1, we define four types of conditions for the way of giving the delayed rewards. The four types of delayed rewards are given to the players after added to an immediate reward given at the state s2 according to the defined conditions.
In the CDRT, actions keeping to continue playing a game can be regarded as an addictive behavior, despite the big penalties are given by the actions. We confirm whether or not the PGE can inhibit the addictive behavior by using the CDRT.

B. Reinforcement Learning Agents as Players of the Task
We adopt reinforcement learning agents as the players of the task and the Q-learning [10] as a learning method. The Qlearning [10] is one of the reinforcement learning method by trial and error. In accordance with the following (1), the agents with the Q-learning renew the Q-values representing the worth of the agents' actions for each state at the environment, and obtain more appropriate behaviors.
where ( ) is an action conducted at time t, ( ( ), ( )) is a Q-value for an action conducted under a state at time t, is the learning rate, ( + 1) is a reward given from the environment when conducting an action under a state , and is the discount rate. The softmax policy [10] is adopted as the way of selecting actions.

C. Peer-Group Effect in the Task
The Peer-Group Effect (PGE) [7][8][9] is an impact influenced on the people in a group mutually and strongly affected by the majority. For example, it is known that emerges phenomenon in which the achievement of low-performing students can be raised by good influence from high-ranking students in a class. The PGE, however, has not only good impact as presented above but also bad peer pressure from many people succumbed to temptation; namely, undesirable influence. We define the PGE as below: where ( ) is a probability with the PGE to select an action at time t, ( ) is a probability without the PGE to select an action determined by the softmax policy at time t, ( ) is the rate of an action selected by all agents at time t, ( ) is the number of an action selected by all agents at time t, and is the number of agents. The range of each important value is 0 ≤ , , ≤ 1.
As can be understood from the above (2) and (3), action choice probabilities influenced by the PGE depend on the rate of an action selected by all agents. Therefore, even if an agent has a possibility to select action S with high probability, the one selects the other action G in case there were many agents who selected the action G, and vice versa.
Conversely, a situation without the PGE can be regarded as a circumstance in which the agents are not mutually affected by everybody like when staying at his/her independent room.

A. Settings of the Simulations
The parameters used in the simulation experiments are as follows: The small positive reward given to an agent selected the action G at every states is +0.1, and the small negative reward given to an agent selected the action S at every states is -0.1.
After the agent gained the immediate reward at the state s2, the delayed reward according to the conditions is added to the immediate reward. As illustrated in Fig. 1, we set four types of conditions: a large positive reward added to the immediate reward at the state s2 is +10.0 when the agent selected his/her action S three times, a small positive reward added to the one is +5.0 when the agent selected his/her action S two times, a small negative reward added to the one <Immediate Reward> A small positive reward is given, when G is selected at every states. A small negative reward is given, when S is selected at every states.
<Conditional Delayed Reward> After acting at the state s2, -a large positive reward is added, when G is nothing, namely, S is three times. -a small positive reward is added, when G is once, namely, S is two times. -a small negative reward is added, when G is two times, namely, S is once. -a large negative reward is added, when G is three times, namely, S is nothing.
is -5.0 when the agent selected his/her action S once, and a large negative reward added to the one is -10.0 when the agent never selected his/her action S at the state from s0 to s2, namely, he/she only selected his/her action G.

B. Results of the Simulations
We investigate impacts of the PGE on addictive behaviors by comparing the CDRT without the PGE to the one with the PGE.
First, we confirm selection probabilities of action G at each state every 121 types of agents in order to investigate effects of their parameters on selecting the action G without the PGE. Fig.  2 (a)-(c) show examples of the dynamics of selection probabilities of action G at the three states s0-s2 every 121 types of agents, respectively. As can be seen in Fig. 2 (a), there are three clusters: the first one is the group of agents who can learn to avoid selecting the action G, the second one is the group of agents in which the selection probabilities of the action G are about 20-60%, and the third one is the group of agents who cannot learn to inhibit selecting G; namely, the selection probabilities of the action G are 100%. In the Fig. 2 (b), there are also some agents in which the selection probabilities of the action G are 100% and many kinds of agents with variety of the selection probabilities between 0-60%. Fig. 2 (c), however, depicts that all agents can learn to avoid selecting the action G.
The resemble results as well as the ones shown in Fig. 2 (a)-(c) are also confirmed on the other simulations with different random seeds.
On the other hand, Fig. 2 (d) illustrates the ensemble averaged rates of actions G and S selected at each state every 121 types of agents while the last 1000 steps. This ensemble averaged result is earned by calculation of averaging the results of the simulations independently conducted with twenty different random seeds. Each bar graph in Fig. 2 (d) is consisted of a set of three bars every 121 agents, the left bar shows the rate of actions selected at the state s0, the middle bar depicts the rate of actions selected at the state s1, and the right bar illustrates the rate of actions selected at the state s2. Therefore, in the x-axis of Fig. 2 (d), 363 bar graphs are represented. Also, in the x-axis of Fig. 2 (d), basically, we arrange the 121 types of parameters in ascending order as follows: α=0.01 and γ=0.01, α=0.01 and γ=0.1,..., α=0.01 and γ=0.99, and next α=0.1 and γ=0.01,..., α=0.1 and γ=0.99, and then α=0.2 and γ=0.01, and so on. The agents with the high discount rate γ appreciate the action where a positive reward may be gained in the future. Therefore, even if such agents continuously get small negative rewards as the immediate reward, in case that they earn a large positive reward once later, then they become to be selecting easily the action that contribute to get the large positive reward. From the ensemble averaged results showed in Fig. 2 (d), we found that the rates of action G selected at the state s0 and s1 are decreased as the learning and the discount rates of the agents are increased. As can be seen in Fig. 2 (d), although the agents with the high discount rate except for the agents with small learning rates 0.01 can learn to select the action S, note that, in the most of agents with the low discount rates, the rates of action G selected at the  Step Dynamics of selection probabilities of the action 'G' at the state s2 every 121 agents.

Rate of actions
Ensemble averaged rates of two actions 'G' and 'S' selected at each state every 121 types of agents while the last 1000 steps.

Rate of G
Rate of S states s0 and s1 are more than 50%. The reason the most of agents with the high discount rates can learn to select the action S at the state s2 is that the s2 is a state giving the largest positive reward to the agents who satisfied the conditions.
Next, we investigate whether or not the PGE has a possibility to treat or inhibit an addictive behavior. As well as the above simulation experiments, we examine dynamics of selection probabilities of action G every 121 agents with the PGE and ensemble averaged rates of actions G and S selected by the agents with the PGE at each state. The results of our simulation with the PGE in which the agents have influences on each other are illustrated in Fig. 3. As can be seen in Fig. 3 (a)-(c), all selection probabilities are ranged between about 20% and 50%. Of particular importance is that there are no agents with addictive behaviors such that the ones whose their selection probabilities of action G are 100%. These results showed in Fig. 3 (a)-(c) suggest that the addictive behaviors can be inhibited by the PGE. As can be understood in Fig. 3 (a)-(c), however, the agents who can learn to completely avoid selecting the action G come to not emerge. The minority agents who were supposed to select the action S get into the mood of playing a game, if there are many agents who select the action G. That is to say; this is thought as results of the negative impacts in the PGE to the agents. Fig. 3 (d) shows the ensemble averaged rates of actions G and S selected at each state every 121 types of agents with the PGE while the last 1000 steps. An important difference between Fig. 2 (d) and Fig. 3 (d) is that the selection probabilities of action G of the agents are less than 50% at the state s0 and s1, even if such agents have low learning rate. In Fig. 2 (d), there is the little difference of the rates of action G at the states s0 and s1 among the agents with the low and the high learning rates, when the discount rates of the agents are low. However, in Fig. 3 (d), the rates of action G of the agents at the states s0 and s1 are decreased as both of the learning and the discount rates of the agents are increased. In contrast to Fig. 2 (d), the rate of the action G at the state s2 is increased as depicted in Fig. 3 (d), as the learning rates of the agents are increased.

IV. DISCUSSIONS
We discuss the results showed in our simulations from the viewpoint of the relations between parameters of the reinforcement learning and our brains. In this paper, we focus on the learning rate α and the discount rate γ in the important parameters of the reinforcement learning.
The learning rate α is a parameter to determine the amount of updating the action value. If the learning rate α is large, the agent drastically changes his/her action value according to the reward the one got. On the other hand, the agent with the small learning rate α hardly changes his/her action value. In our brain, the acetylcholine as one of the neurotransmitters is suggested that plays an important role as a gate signal that selects keeping the past memory or exchanging to the new one [11]. The learning rate α can be said to be corresponded to the acetylcholine.

Rate of G
Rate of S the adolescent individuals with high plasticity easily show addictive behaviors. In our simulation results without the PGE, the agents with high learning rate, however, showed relatively appropriate behaviors in only case that they have high discount rate. This result suggests that, in case of a situation without influences from around other individuals, people can learn to avoid selecting the addictive behavior by drastically changing their action values in accordance with the large negative reward. In contrast, the agents with the PGE like people in a classroom have influences each other and can learn to avoid extreme behaviors, but the most of the agents become to be individuals who select inappropriate behaviors by receiving the negative impact of the PGE, even though the big delayed reward is given.
The discount rate γ is a parameter to determine which of immediate rewards and delayed rewards emphasize. As the brain researches related to the discount rate, Okamoto et al [16] showed that the serotonin as one of the neurotransmitters plays an important role to control a time scale parameter for reward prediction. Therefore, the discount rate γ can be regarded to be represented the serotonin. In fact, our simulation results without the PGE corresponded to the results conducted by Okamoto et al, and the agents with the low discount rate relatively easily select the action that can get an immediate positive reward. Furthermore, Miyazaki et al [17][18] clarified that, in a delayed reward task using rats, increasing flow rate of the serotonin and activation of the serotonin nervous system is related to waiting the delayed reward. The experimental results by Miyazaki et al can be interpreted that there are the two modes of the discount rates in which the large γ gives high praise to delayed reward, but the small γ does not so, and the two modes are switched by the amount of the serotonin. Our simulation results showed that the PGE can inhibit addictive behaviors to some extent, but completely avoiding the addictive behaviors is not realized. In the situation with the PGE in which the agents could not completely learn to avoid addictive behavior, we think that it is necessary for them some approach of not only increasing their discount rate, namely, amount of serotonin but also producing only positive PGE.

V. CONCLUSION
In this paper, we investigated whether or not the Peer-Group Effect (PGE) can have an influence of inhibiting addictive behaviors. For this, we proposed a conditional delayed reward task (CDRT) as a model of situation that can be expressed an addictive behavior and adopted a model of reinforcement learning agent as a player in the CDRT.
From our simulation results comparing the CDRT with the PGE and the one without the PGE, we found that the PGE has an impact to inhibit the addictive behaviors to some extent in whole. However, we confirmed that the agents who can learn to completely avoid inappropriate actions cannot be emerge by negative influences of the PGE.
In order to completely inhibit the addictive behaviors in a space like a classroom that can be exerted the PGE, we conclude that only approach for increasing the amount of serotonin that contributes to gain delayed rewards is not enough, and an idea to produce only positive PGE is also needed.