Multi-Agent Reinforcement Learning for Cooperative Coded Caching via Homotopy Optimization

Introducing cooperative coded caching into small cell networks is a promising approach to reducing traffic loads. By encoding content via maximum distance separable (MDS) codes, coded fragments can be collectively cached at small-cell base stations (SBSs) to enhance caching efficiency. However, content popularity is usually time-varying and unknown in practice. As a result, cache contents are anticipated to be intelligently updated by taking into account limited caching storage and interactive impacts among SBSs. In response to these challenges, we propose a multi-agent deep reinforcement learning (DRL) framework to intelligently update cache contents in dynamic environments. With the goal of minimizing long-term expected fronthaul traffic loads, we first model dynamic coded caching as a cooperative multi-agent Markov decision process. Owing to MDS coding, the resulting decision-making falls into a class of constrained reinforcement learning problems with continuous decision variables. To deal with this difficulty, we custom-build a novel DRL algorithm by embedding homotopy optimization into a deep deterministic policy gradient formalism. Next, to empower the caching framework with an effective trade-off between complexity and performance, we propose centralized, partially and fully decentralized caching controls by applying the derived DRL approach. Simulation results demonstrate the superior performance of the proposed multi-agent framework.

resources at SBSs, the studies in [13], [14] investigated the joint design of SBS beamforming and clustering. Caching strategies in these studies were designed to store either entire content items or uncoded fragments, which are referred to as uncoded caching.
To further improve caching efficiency, coded caching has recently gained considerable research attention. The study in [15] proposed a novel coded caching scheme, which provides a global caching gain relating to cumulative storage over all caching units. The research in [16], [17] investigated cooperative coded caching by utilizing maximum separable distance (MDS) codes to reduce traffic loads. MDS coded caching was also examined in [18], [19] to augment SBS collaboration and thus offers great advantages to lower latency and reduce power consumption compared with uncoded caching. The above-mentioned studies mainly investigated offline caching policies by assuming time-invariant content popularity distributions.
To exploit dynamic features in wireless networks, extant works have been devoted to designing caching policy by using RL. The study in [20] utilized Q-learning to find an optimal caching policy to minimize network cost. To counter the curse of dimensionality in conventional RL, DRL-based caching polices were advocated in [9], [21]- [23] by using deep neural networks (DNNs) as function approximators. Moreover, the study in [24] proposed a multi-agent DRL framework to maximize cache hit ratios for centralized and decentralized settings. The authors in [25] investigated cache placement by using cooperative multi-agent multi-armed bandit learning at small cell networks (SCN). A decentralized caching scheme was proposed in [26] by utilizing federated deep reinforcement learning. Nevertheless, the research in [9], [20]- [26] focused on uncoded caching. That is, each content item is entirely cached without exploring SBSs cooperatively fetching coded fragments of each content item.

B. Contributions
Indeed, cooperatively pre-fetching MDS coded fragments has been proven to significantly alleviate traffic loads and thus reduce latency, as well as transmission cost over storing uncoded fragments at SCN [17]- [19]. Specifically, since one can distribute MDS coded fragments of a content item to multiple SBSs, mobile users can access a cluster of SBSs simultaneously to download coded fragments of the desired content item. To date, very few works have investigated how to "intelligently" update MDS coded content items under dynamic environments (e.g., time-varying content popularity). Under centralized control, a prior study [27] utilized deep deterministic policy gradient (DDPG) to explore MDS coded caching under dynamic content popularity with the aid of predicting user requests; and study [28] utilized Q-learning with function approximation to investigate coded caching.
It is worth noting that, unlike uncoded caching schemes [7]- [9], [20]- [25] addressing binary decision-making, coded caching essentially entails continuous caching decisions that are subject to storage constraints. Accordingly, the resulting decision-making for dynamic coded caching is a constrained RL with a continuous action space. Quantizing actions into discrete values or directly applying conventional DRL algorithms may not be able to efficiently handle this constrained RL problem. In addition, a centralized control could lead to excessive communication overhead because the cloud processor (CP) needs frequent communications with SBSs to aggregate information and inform SBSs of their caching decisions. As the number of SBSs increases, the dimension of continuous states and actions in a centralized control would be very large, and optimal caching is computationally prohibitive. As a consequence, it is very challenging, but essential, to design efficient DRL algorithms for cooperative coded caching.
To bridge the research gap identified above, in this work, we investigate cooperative coded caching design at SCN with temporally evolving content popularity. In particular, we address the following fundamental issues for empowering "intelligent" caching: i) how to design efficient DRL algorithms for a constrained RL problem with continuous decision variables; and ii) how to develop a multi-agent DRL-based framework with different levels of controls to obtain an equitable trade-off between performance and complexity.
The main contributions of this work are summarized as follows: • To the best of our knowledge, this is the first work to investigate a multi-agent DRL framework for MDS coded caching under time-varying content popularity. Specifically, we model cache updating for MDS coded caching as a cooperative multi-agent Markov dynamic process (MDP). With the goal of minimizing long-term expected cumulative fronthaul traffic loads, we judiciously define the system state, local observations and action space of each agent, as well as caching reward. Our formulated problem is a continuous RL with action constraints.
We also characterize optimal decisions in a closed form.
• As a core technical contribution, we reformulate a general constrained RL problem, whose action space is inefficient to be satisfied through designing DNNs, into a tractable form that can be dealt with by utilizing homotopy optimization. Then, we custom-build a novel DRL, i.e., homotopy deep deterministic policy gradient (HDDPG), through recasting the basic elements of RL and unfolding the iterative process of homotopy optimization. The novelty of this approach lies in introducing a reasonable cumulative penalty to the objective of RL, and then properly manipulating it by using homotopy optimization.
• To endow the proposed DRL caching framework with different levels of control, we generalize the proposed HDDPG from centralized control to partially and fully decentralized controls.
Specifically, in the centralized control, the CP coordinates SBSs to conduct cache updating by using global information. To reduce complexity and communication overhead, we then propose a partially decentralized control by allowing SBSs to make decisions locally, but their polices are learned in a centralized manner. In the fully decentralized control, each SBS works as an independent learner and trains its caching policy independently based on local observations. The proposed decentralized controls could obtain a desirable trade-off between complexity and performance, and thus have the potential to handle large-scale wireless networks.
The remainder of this paper is organized as follows. Sec. II presents the problem statement.
Sec. III introduces the proposed DRL. Sec. IV develops a centralized cooperative coded caching design. Sec. V proposes a partially decentralized caching design, and Sec. VI proposes a fully decentralized caching design. Sec. VII presents performance evaluations, and Sec. VIII concludes the paper.

A. MDS Coded Caching at SCN
As illustrated in Fig. 1, we consider a SCN, in which a total of B SBSs are densely deployed and thus are capable of cooperatively providing communication services for users. Each SBS is endowed with a cache unit, which can cache popular content from the CP through a capacitylimited fronthaul. The CP is further connected to the core network through a backhaul. Suppose that a catalog of F content items are available in CP. For ease of discussion, all of the content items are of the same size s bits, and each cache unit has a storage of L × s bits. Let B = {1, · · · , B} and F = {1, · · · , F } denote the indices of SBSs and content items, respectively.
To reduce traffic loads on the capacity-limited fronthaul and provide better services for mobile users, SBSs can proactively cache popular content. By applying MDS codes, each content item of size s bits is able to be encoded into a sufficiently long sequence of parity bits, and any s parity bits are sufficient to reconstruct the original content item [17], [18], [29]. Moreover, in practice, MDS coding can be implemented by Raptor codes only with a very small redundancy [29], [30]. Therefore, SBSs with limited caching storage can cooperatively and collectively cache these coded parity bits, so as to satisfy user requests locally as much as possible. More precisely, we define the cache allocation matrix as L = [l f,b ] ∈ R F ×B , where element l f,b ∈ [0, 1] denotes the proportion of parity bits encoding content item f that are stored at SBS b, ∀b ∈ B, f ∈ F.
Owing to the storage limit at SBSs, cache allocation needs to satisfy f ∈F l f,b ≤ L, ∀b. It is worth noting that, the parity bits of a content item available at SBS b should be independent of that cached at other SBSs; thus, users can always download distinct coded content items from multiple SBSs [17]. To guarantee this caching diversity among SBSs, the encoded information sequence of parity bits of every content item should be sufficiently long, e.g., larger than Bs. In what follows, we introduce how SBSs cooperatively transmit coded content items to mobile users. Specifically, the operation cycle of a SCN is slotted into a series of epochs, indexed by t = 0, 1, · · · . For each epoch t, a number of active users K t are randomly distributed in the horizontal plane. We also assume that the duration of each epoch is relatively short, such that active users are considered to be quasi-static during a single epoch. Each user k ∈ K t is able to be served by a local SBS cluster, which is specified by the communication radius [31]. Therefore, where r 0 denotes the communication radius of each SBS, and d t k,b denotes the distance between user k and SBS b; otherwise, e t f,b = 0. For notational convenience, we collect all active users served by SBS b at epoch t as set K t b = {k ∈ K t |e t k,b = 1}, for ∀b ∈ B. Each user k ∈ K t is assumed to request one content item at a single epoch. Evidently, when content item f is not fully stored at user k neighboring SBSs, i.e., the missing part (1− b∈B l f,b e t f,b )s needs to be transmitted by the CP via fronthaul. This event is referred to as cache miss, which introduces additional fronthaul traffic loads. We summarize all of the key notations 1 in Table I. As a result, to mitigate traffic burden on fronthaul, optimized caching  to be temporally updated based on historical observations in order to provide better download services for future requests. We thereby introduce a dynamic cooperative coded caching problem in the following subsection.

B. Cooperative Multi-Agent MDP
In the cooperative coded caching, SBSs are anticipated to collaboratively cache the coded content items, which can be specified by optimizing continuous variables {l t f,b }. Therefore, we formulate the considered cooperative coded caching problem as a cooperative multi-agent MDP. Let A b be the action space of A b , and then A = ∪ b∈B {A b }. P collects all of the transition probability Pr{S |S, A} for ∀S, S ∈ S, A ∈ A. All of the agents share a common reward R after they cooperatively take actions {A b } b∈B . γ ∈ [0, 1) denotes a discount factor.
As aforementioned, user requests are expected to be satisfied by SBSs locally as much as possible; otherwise, the missing fragments could introduce additional traffic burden and transmission delay on the fronthaul. Therefore, in this paper, our goal is to minimize the expected fronthaul traffic loads. Accordingly, the basic elements in a cooperative multi-agent MDP are defined as follows.
State: We assume that a user request can be observed by his or her neighboring SBSs only.
Consequently, SBS b has local observation of the environment, which is defined as follows: where f t k ∈ F denotes the index of the content item requested by user k at epoch t; and E t k = {e t k,b } b ∈B implies the strategy of SBS collaboration in order to satisfy user k's request, which can be acquired by knowing user location. By aggregating observations of all SBSs, the system state is defined as: Action: By the end of each epoch t, all SBSs need to update their cached content. Accordingly, we define the action of SBS b at the current epoch as where element a t f,b = l t+1 f,b ; and the corresponding action space is given by: As such, a joint action can be given by . Reward: After executing joint action A, the system state turns into S t+1 with transition probability Pr{S t+1 |S t , A t }. In this cooperative task, all of the agents shall receive a common reward R(S t+1 , S t , A t ), which indicates how good a joint action A t is. Therefore, it should be consistent with the goal of reducing fronthaul traffic loads. It is clear that the total traffic loads for updating caching resources and satisfying user requests in the coming epoch are given by: Accordingly, we design the reward as: which indicates how much traffic loads are imposed in order to satisfy each content request in the coming epoch after performing cache updating.
Toward this end, the goal of this study is to find a cooperative caching policy π * , which maximizes the total expected cumulative caching reward, i.e.: and the cumulative reward is defined as: where π denotes a mapping from state space to action space; Π denotes the set of feasible caching policies; and the expectation is over all of the rewards {R(S t+1 , S t , A t )}. Furthermore, a characterization for optimal decisions is presented in the following proposition.
Proposition 1 Consider that all SBSs are fully loaded at the initial epoch, i.e., f ∈F l 0 , for any t ≥ 1.

Proof. See Appendix A.
Remark 1 Proposition 1 implies that optimal caching decisions are likely to be the case where caching units are fully loaded. This result is reasonable and would provide further insight for algorithm design. Nevertheless, calculating an optimal cooperative caching policy offline depends on knowledge of network dynamics (e.g., transition probability P r{S |S, A}), which is generally difficult to obtain in real applications. Even if this knowledge could be obtained, problem (7) is still intractable due to its no closed-form expression. In view of this, one can resort to DRL to handle this problem through utilizing historical experiences without knowing exact dynamic information. On the other hand, as previously mentioned, the decision variables {a t f,b } for cooperative coded caching are continuous. Simply quantizing these variables into discrete values may lead to performance loss as well as an exponentially large number of actions. For example, consider a very small scenario: three SBSs store parity bits of 10 content items, and each continuous decision variable a t f,b is coarsely quantized into five discrete values within [0, 1]; then, the resulting actions are 5 30 at most, leading to value-based RL algorithms (e.g., deep Q learning) that are intractable.

III. A NOVEL HOMOTOPY DDPG
To develop a working DRL algorithm for the considered problem, we first introduce a policybased RL and identify the arising challenges. Then, we recast a general policy based RL problem with constraints into a tractable form, which is suitable to be addressed by leveraging homotopy optimization. Finally, we custom-build a novel DRL algorithm by embedding homotopy optimization into DDPG.

A. Fundamentals of DDPG
DDPG is one of the policy-based RL algorithms, which is widely used to handle continuous decision-making [32]. Built upon actor-critic architectures, this algorithm employs DNNs as function approximators to learn a deterministic policy that can map high-dimensional states into feasible continuous actions. Typically, a DDPG-based RL framework consists of two networks, i.e., critic and actor, which are detailed as follows.
Actor: The actor network corresponds to a deterministic policy, which can generate an action A under a given system state S, i.e., A = π θ (S), and θ is the parameter of the associated DNN.
This parametrized policy π θ (·) aims to maximize the expected cumulative reward, i.e.: Critic: The critic network Q φ (S, A) serves as an estimator to predict an action-value function (also termed as Q-function), i.e., E [V |S, A, π θ ] , and φ denotes the parameter of the associated DNN. In general, the critic is designed to fine-tune the actor, which yields By recalling (8), it is expected to have the following recursive equation: where S denotes the subsequent state after taking A under state S; R denotes the corresponding instant reward; and the expectation is over all of the possible occurrences of (S , R).
Learning Algorithm: As a category of policy gradient approaches, actor parameter θ is updated by using stochastic gradient descent, where the gradient of the policy can be given by Deterministic Policy Gradient Theorem [32]; concerning the critic network, parameter φ is updated according to Temporal Difference. Readers are referred to [32] for greater details.
Although DDPG has achieved great success in addressing many continuous decision-making tasks, the action space in our problem (defined by (4)) could restrain it from being efficient.
Specifically, to confine the output of the actor to be feasible, a simple idea is to use the activation function SoftMax to normalize the output of the last hidden layer, which is then filtered by multiplying a scaling factor (e.g., L). This idea has been used in [27]. We should point out that the resulting elements (i.e., a t f,b , ∀f, b) could surpass 1 when L 1; directly clipping it to 1 may lead to a very poor caching decision if an element a t f,b = L exists. This practice contradicts Proposition 1 and could degrade the performance of DDPG. In the following subsections, we formally analyze this issue and propose an efficient approach to overcome this challenge.

B. A Homotopy Optimization Based Approach
For a class of RL problems, the corresponding action space A could be some constraints inefficient to be directly satisfied through designing DNNs, i.e., µ θ . More specifically, we consider the following situation: let set A θ collect all of the proto-actions as a result of µ θ (S), ∀S ∈ S; and feasible actions A may only lie on a subset of A θ , i.e., A ⊆ A θ . To deal with this issue, a straightforward approach is to use a mapping function σ A (·), which can project a proto-action Thus, a feasible policy function can be given . Accordingly, the associated policy-based RL problem is supposed to take the following form: Nevertheless, for many constrained RL applications, poor actions are likely to be generated after projection. Like the example in the previous subsection, mapping a proto-action with a dominant element a t f,b = L into an one-hot vector may lead to a very sparse caching vector; this case implies a very low caching resource available at SBSs. When frequently encountering this instance during training, using mapping methods may not guarantee network parameters to be efficiently updated. Thus, it will lead to a suboptimal policy.
To remedy this method, a natural idea is to seek a proper way to penalize the performance loss caused by mapping a proto-action µ(S t ) into a feasible one given any state S t . Let g(·|σ A ) be a general penalty function, which needs to be designed according to the corresponding problem.
In the coded caching problem, inspired by Proposition 1, a penalty function can be given by where · 1 denotes l 1 -norm; and this penalty indicates the remaining storage over all caching units after taking action σ A (A).
The proposed approach is then built upon maximizing a homotopy function: where the discount cumulative penalty, i.e., G = E +∞ t=0 (γ) t g(S t |σ A ) , is finite due to a discount factor γ ∈ [0, 1); and λ ≤ 0 is a homotopy variable, such that: At this stage, we introduce the following lemma to address problem (12) through a typical homotopy optimization method [33].
Lemma 1 On the basis of homotopy optimization, one can initialize a sequence of positive values, i.e., δ i , i = 1, · · · , I, subject to: and also initialize a point (θ 0 , λ 0 ), where θ 0 denotes a (local) optimizer of J hom (θ|λ 0 , σ A ) and λ 0 = λ min ; and then iterate the following update: and calculate a local optimizer θ i of J hom (θ|λ i , σ A ) by using gradient descent starting from Eventually, this homotopy approach is able to result in point (θ I , 0), where θ I is a local minimizer of problem (12) [33].
The motivation of the homotopy optimization approach is follows: Starting with a sufficient small value of λ min < 0, a very large cumulative penalty |λ min G| may penalize the corresponding policy (parametrized by θ 0 ) to generate intended actions, e.g., caching decisions that fully exploit available caching storage in the considered problem. Thereafter, by using homotopy optimization, we attempt to carefully tune policy parameter θ 0 to a (local) optimizer θ I of the original problem (12), which is likely to produce good decisions despite of applying mapping function σ A (·).

C. Proposed DRL Algorithm
In each iteration of homotopy optimization, computing an optimizer (e.g.,θ i ) of problem (14) offline is somehow impractical under an unknown temporally evolving environment. For this reason, we custom-build HDDPG for problem (12) by recasting the basic elements of DRL and unfolding the iterative procedure of homotopy optimization introduced in Lemma 1.
Specifically, as an actor-critic approach, HDDPG maintains a parametrized critic Q hom,φ (S, A) and actor µ θ (S) in addition to a mapping function σ A (·), where a feasible policy is given Following the sketch of a plain DDPG, 2 we introduce the proposed algorithm as follows. Evidently, the objective J hom (θ|λ, σ A ) can be equivalently reformulated as: Accordingly, we define the homotopy reward after taking action A t as: which is known at epoch t + 1. Then, the homotopy Q-function can be given by: which implies the discount cumulative homotopy reward after taking an action A t under state S t and thereafter following policy π θ (·) = σ A [µ θ (·)]. As a direct deduction of the Bellman optimality equation [34], we have the following Homotopy Bellman Optimality Equation.
Lemma 2 An optimal Q * hom (S, A) satisfies the following recursive equality: where S denotes the subsequent state after taking an optimal action A; and R hom denotes the associated homotopy reward.
Accordingly, to estimate an optimal homotopy Q-function, the critic Q hom,φ (S, A) can be learned by using Lemma 2. Specifically, we update φ by minimizing the following loss function: where ξ hom = (S, A, R hom , S ); and y denotes the target value: Regarding the update of the actor, it depends on the gradient of the objective J hom (θ|λ, σ A ), which brings us to the following the Deterministic Policy Gradient Theorem for HDDPG.

Lemma 3 Consider a homotopy deep deterministic policy with a continuous action space A
and a homotopy variable λ, as well as a mapping function σ A . Suppose that σ A (·) is continuous.
Finally, we leverage inexact gradient descent methods to update {θ, φ, λ} [35]. In particular, the updates of φ, θ occur at each epoch, i.e.: where α c and α a are the learning rates of the critic and actor, respectively; and λ can be updated by a slow circle, i.e., after every I 0 epochs, one can execute the following: where sequence {λ i } I i=1 should meet the equality in (16).

Remark 2
In contrast with plain DDPG, which constitutes a special case of the proposed HDDPG, i.e., λ = 0, properly introducing a penalty term into the objective function assists to infer which actions should be better to take and avoid becoming stuck in suboptimal solutions.
More importantly, we unfold the homotopy optimization approach in Lemma 1 into a DRL, which can be done through interacting with environments.
In the ensuing sections, we will apply HDDPG to the cooperative coded caching problem and propose a centralized caching design, and further generalize HDDPG in decentralized settings to reduce complexity and communication cost.

IV. CENTRALIZED HDDPG-BASED COOPERATIVE CODED CACHING
In this section, we introduce a centralized HDDPG (C-HDDPG) design for multi-agent cooperative coded caching. As illustrated in Fig. 2(a)

A. Proposed Centralized HDDPG-based Design
The system operation includes two procedures, i.e., network training and network evaluation.
In general, during network evaluation, the CP simply leverages the actor and mapping function to make caching decisions, while the critic is only necessary during training procedure to fine-tune the actor. The details of network design and training procedure are introduced as follows.
Network Design: In general, both networks, i.e., critic and actor, can be implemented by fully connected DNNs where each hidden layer has a batch of neurons and an activation function to perform nonlinear transformations [36]. The output of the critic should be a scalar, which corresponds to the estimated value of the Q-function. To generate feasible actions, we elaborate on how to design the actor network µ θ and mapping function σ A . It is evident that the number of neurons in the output layer of µ θ should match the dimension of a joint action, i.e., F × B (and these neurons output a long vector z = [z 1,1 , z 2,1 , · · · , z f,b , · · · , z F,B ]). Then, we use the following activation function (e.g., realized by Scaling and SoftMax) to refine z, i.e.: which thereafter is filtered by a mapping function 3 σ A (·) = min{1 F ×B , ·}. Accordingly, any proto-action µ θ (S) can be mapped into a feasible action, i.e., min{1 F ×B , µ θ (S)}.
Update: To proceed, the technique of Replay Buffer (RB) Ξ is introduced to store historical experiences ξ t = (S t , A t , R t+1 hom , S t+1 ), which serves as the data set for network training. The buffer size |Ξ| is usually finite, and thus the most outdated experience should be replaced by the current one as long as Ξ is fully loaded. Subsequently, at each epoch, we can randomly sample a mini-batch of N experiences (e.g., set Ξ N ) from RB to update parameters of the critic and the actor networks. More concretely, parameter φ of the critic network can be updated by minimizing the following loss function: where the expectation is over all of the sampled experiences; y t φ − denotes the target value: and Q hom,φ − (S, A) and µ θ − (S) denote the target critic and the target actor with parameters φ − and θ − , respectively. To stabilize training [32], target networks should be slowly updated, i.e.: where τ is a very small step size. With regard to updating parameter θ, the corresponding homotopy deterministic policy gradient ∇ θ J hom can be estimated by (24). In addition, the homotopy variable should be updated according to (27).
Exploration: To avoid becoming stuck in suboptimal policies, exploration is usually needed during network training. The purpose of this process is to gather sufficient experiences, which then are used to infer what actions should be adopted under different states. In continuous decision-making applications, a typical method is to add Ornstein-Uhlenbeck (OU) random noise to the action generated by the actor [32], i.e.: where σ A is a simple mapping function if the noise-perturbed action violates A; ∆ t = [δ t f,b ] ∈ R F ×B ; and each element δ t f,b denotes a sample drawn from a continuous OU process [32]; and β t ≥ 0 is a diminishing parameter.
To this end, an entire implementation of this centralized control is shown in Algorithm 1.

B. Fronthaul Communication Complexity
In the proposed centralized caching design, the CP needs to frequently communicate with SBSs during network training and evaluation. Herein, we briefly analyze fronthaul communication complexity of this centralized control, which is described by the total dimension of variables that are transmitted between the CP and SBSs. We consider the worst case, in which each SBS is fully loaded and serves a maximum number of users, e.g., |K t b | = K. Specifically, during network training, the CP needs to obtain information about the system state [{f t k }, E t , L t ] at each epoch. The dimension of user requests should be BK. Network connectivity E t can be computed by knowing the coordinates of the active users; by denoting the coordinates as two dimensional vectors, the total dimension of user positions is 2BK. Clearly, the CP has the exact information about cache allocation L t , which is termed as A t−1 in its RB; thus, no fronthaul cost is involved. Afterwards, the CP uses the the fronthaul to inform caching decisions {A t b }; the total dimension of the involved variables is given by BF . Regarding reward R t+1 , it can be inferred from state S t+1 . Hence, the overall fronthaul communication complexity during network training is O(3BK +BF ). When the system runs in an evaluation procedure, the CP again needs to know the system state and inform each SBS of its caching decision. Consequently, the corresponding fronthaul communication complexity is O(3BK + BF ).
Moreover, the critic and actor are built upon system states and joint actions, i.e., (S t , A t ), which is in the order of O(B 2 ) of local observations and actions, i.e., (S t b , A t b ). For this reason, the computational complexity would be excessively high as the number of agents increases for a continuous RL problem [37]. To address this issue, we now focus on developing efficient decentralized algorithms in following sections.

V. PARTIALLY DECENTRALIZED HDDPG-BASED COOPERATIVE CODED CACHING
In this section, to circumvent excessive communication cost and high complexity in the centralized design, we develop a partially decentralized (PD)-HDDPG-based cooperative coded caching design. This scheme operates at the level of PD control, in the sense that a (centralized) critic is used to train (local) actors that separately approximate the caching policy of each SBS.

A. Partially Decentralized Multi-Agent HDDPG
In a PD multi-agent framework, each agent maintains an actor and mapping function to produce its actions. To augment collaboration among multiple agents, these actors are trained with the aid of a (centralized) critic. Specifically, agent b has an actor µ b (parametrized by θ b ) and mapping function 4 σ b , which are able to map a (local) proto-action µ b (S b ) into the corresponding action space A b . Accordingly, the policy function for agent b can be expressed by Algorithm 1 Proposed C-HDDPG-based Cooperative Coded Caching 1: Initialize τ , α c , α a , γ, N , I 0 , λ = λ min 2: Initialize parameter φ for critic network and θ for actor network 3: Initialize parameters φ − ← φ, θ − ← θ for target critic network and target actor network 4: Initialize RB Ξ and mapping function σ A 5: Initialize δ 1 , δ 2 , · · · , δ I 6: for t = 0, 1, 2, · · · do 7: Input S t to actor and output A t = π explore θ (S t ) 8: Take action A t and observe S t+1 , R t+1 9: Calculate R t+1 hom by (19) 10: procedure TRAINHDDPG 12: Randomly sample a mini-batch of N experiences from relay buffer as Ξ N

18: end for
On the basis of homotopy optimization, all agents cooperatively seek polices to jointly maximize the following homotopy function: where we define µ θ {µ 1 , · · · , µ B } and σ A {σ 1 , · · · , σ B }; and the homotopy reward R t hom can be given by (19). Next, a (centralized) critic . Similar to C-HDDPG, parameter φ can be learned by minimizing the following loss function: Let A t = {A t 1 , · · · , A t b } and calculate R t+1 hom by (19) 12: The CP pushes ξ t = S t , A t , R t+1 hom , S t+1 into RB 13: procedure TRAINHDDPG 14: Randomly sample a mini-batch of experiences Ξ N

B. Implementation
As depicted in Fig. 2 (b), we propose a PD-HDDPG-based cooperative coded caching design. Specifically, the CP maintains a centralized (critic), while each SBS has a local actor and mapping function. In addition, the actor and mapping function are designed in the same manner as the centralized scheme to ensure that their outputs are feasible to (4). Particularly, During the training procedure, the critic and (local) actors should be learned in the CP. We again adopt the techniques of exploration and RB, and the entire procedure is similar to what we have presented in Algorithm 1. The detailed implementation is shown in Algorithm 2. Notably, after all actors are fine-tuned, the CP needs to send actor parameters (e.g., θ b , ∀b) to SBSs, which thereafter can locally compute actions.

C. Fronthaul Communication Complexity
During the training procedure, fronthaul communication complexity is the same as that of C-HDDPG, i.e., O(3BK + BF ). When the system runs in an evaluation procedure, each SBS computes its action locally; obviously, no fronthaul communication is incurred when observing content requests and positions of local users.

VI. FULLY DECENTRALIZED HDDPG-BASED COOPERATIVE CODED CACHING
To further reduce complexity and fronthaul signaling, we propose a fully decentralized (FD) control for cooperative coded caching. Particularly, each SBS serves as an independent learner to locally train its caching policy. Hereunder, we first present the FD-HDDPG-based caching design, and then briefly summarize the complexity of all of the proposed designs.

A. Fully Decentralized Cooperative Coded Caching Design
As shown in Fig 2(c), each SBS has a set of critic, actor, and mapping functions. These basic elements are designed in the same manner as that of C-HDDPG, but built upon local observations.

11:
Calculate R t+1 hom by (37) 12: end for 15: procedure TRAINHDDPG 16: for b ∈ B do 17: Randomly sample a mini-batch of experiences Ξ b,N 18: 19: hom is defined as: and Subsequently, each agent is envisioned to independently train its critic and actor. In addition, the training procedure should follow the same workflow as C-HDDPG, which is presented in Algorithm 3 in greater detail.

B. Fronthaul Communication Complexity
During network training, although fronthaul communications are not necessary for SBSs to obtain local observations, each SBS still needs to know the homotopy reward. SBSs first locally Remark 3 As a comparison, we summarize fronthaul communication complexity of all algorithms in Table II. It can be observed that C-HDDPG incurs the highest fronthaul communication complexity in either training or evaluation procedure since it manipulates system operation in the centralized control. PD-HDDPG requires the same order of signaling as that of C-HDDPG during training, which thereafter operates in a decentralized manner during network evaluation; thus, fronthaul communication complexity during the evaluation procedure is as low as that of FD-HDDPG. Indeed, FD-DDPG has the lowest fronthaul communication complexity during two procedures, which might compromise performance. Therefore, by developing different controls of caching design, the proposed framework is envisioned to possess advantages of superior performance, as well as scalability to large-scale systems.

VII. PERFORMANCE EVALUATIONS
In this section, we present performance evaluations of the proposed DRL algorithms for cooperative coded caching under different scenarios. Specifically, we first provide simulation setup and then compare the proposed algorithms with baselines. Subsequently, we investigate the impacts of system parameters on the proposed algorithms.

A. Simulation Setup
Unless stated otherwise, we consider the following default settings: and each SBS has a fractional caching capacity L/F = 0.2, which indicates that each SBS can fetch 20% of the total content.

B. Convergence Behavior
To analyze the proposed caching framework, we consider the following baselines: • Centralized Optimization-Cache Updating (CO-CU): This is a centralized optimizationbased design, which is performed in the CP [17], [18]. Specifically, one can first estimate the probability of each content item that could be requested by users under the coverage To implement the proposed algorithms, each critic is designed as follows: there are three hidden layers, each of which contains 512 neurons. Each actor consists of three hidden layers with 256, 128, and 64 neurons, respectively. All of the networks are trained by the Adam optimizer with a polynomial learning rate (e.g., readers are referred to [38] for additional details), where we set initial learning rates for actors and critics as 0.01 and 0.001, respectively, and the power factor for decay as 0.9. A mini-batch of 100 experiences are randomly sampled every time from RB that is capable of storing 5000 past experiences. Every target critic or target actor is updated by a step size τ = 0.001, and the discount factor γ = 0.99. To perform policy explorations, we use an OU process with mean 0 and variance 1; the associated diminishing parameter β is initialized as 0.9, and decreased at a rate of 0.995 every epoch until it reaches 0.0001. Finally, we initialize the following sequence to update the homotopy variable λ, i.e., {δ i = −0.1×λ min , i = 1, 2, · · · , 10}, where λ min = −0.005 and we update λ every I 0 = 1000 epochs.
As shown in Fig. 3, we first illustrate the learning curves of the proposed algorithm under centralized control. Particularly, we vary parameter λ min to investigate the impacts of penalty.
Each result is averaged over N = 5000 epochs, i.e., t τ =t−N +1 R τ /N . It can be observed that in the first 10 4 epochs, the curves of HDDPG-based algorithms rise markedly, and notable gaps can be observed between plain DDPG and HDDPG-based algorithms. In subsequent epochs, the learning curves of these DRL algorithms increase gradually until convergence. Clearly, when λ min is -0.005 or -0.015, HDDPG-based algorithms achieve higher caching rewards than those of plain DDPG; when λ min goes down to -0.1, the curve increases fairly slowly and converges to a level that is very close to plain DDPG. Therefore, if λ min G is significantly large compared with the objective, it could dominant the actual objective, eventually leading to a suboptimal policy.
These observations demonstrate that properly introducing the penalty term to RL (e.g., λ min G in (14)) could assist agents to infer better actions and speed-up convergence behavior. Furthermore, we propose to implement HDDPG (with λ min = −0.005) by initializing RB with 10% warm-up experiences via optimization baselines (e.g., CU-CO) rather than the conventional exploration method used in plain DDPG, i.e., only utilizing OU random noise to explore action spaces. As can be seen, with few warm-up experiences, the proposed implementation can further improve performance compared with the conventional exploration under the same λ min = −0.005. This result implies that taking advantage of a good baseline improves the efficiency of exploration in DRL, yet at the cost of additional computational complexity.  Obviously, the DRL-based designs exhibit more centered results while the rewards, achieved by the optimization baseline, spread out over quite a broad range. This observation demonstrates the effectiveness of HDDPG to track and adapt to dynamic features of wireless networks.

C. Impacts of System Parameters
In this subsection, we study the impacts of system parameters on the proposed caching framework. All of the results are obtained by averaging over 10 4 epochs. We first investigate the impacts of caching capacity under a larger content catalog size (e.g., F = 50). Clearly, as shown in Fig. 6, C-HDDPG is always superior to other algorithms. When fractional caching capacity (e.g., L/F ) is 10%, C-HDDPG achieves the lowest fronthaul traffic loads, e.g., 0.47, in contrast with PD-HDDPG and FD-HDDPG, e.g., 0.52. The superiority of C-HDDPG demonstrates the effectiveness of using global information to enhance SBS collaboration. As fractional caching capacity grows larger, the gap between PD-HDDPG and FD-HDDPG becomes bigger. Indeed, with the aid of a centralized critic to train local policies, PD-HDDPG can allow SBSs to tightly collaborate in comparison to the fully decentralized scheme. Although FD-HDDPG depends on local observations only, it still outperforms CO-CU and LO-CU by 6.11% and 11.51% respectively, under the scenarios being studied. This observation demonstrates the remarkable advantages of using DRL algorithms to learn policies under dynamic environments over conventional optimization-based algorithms.
Hereunder, we conduct experiments to investigate the impacts of content popularity by varying the skewness factor of Zipf distribution. Moreover, under each scenario being investigated, the corresponding skewness factor is fixed as a constant, in which a larger skewness factor indicates a more concentrated content popularity. As can be seen, fronthaul traffic loads decrease as the skewness factor becomes larger for all of the algorithms except for RCU. The reason for this is that user requests are more likely to be accessed in local SBSs if their preferences are more centered. Furthermore, PD-HDDPG achieves comparable fronthaul traffic loads in contrast with C-HDDPG when the skewness factor is smaller than 1; after that, using centralized control only produces a marginal performance gain over decentralized control, yet with a significant implementation cost. This finding demonstrates that PD-HDDPG can efficiently obtain a satisfactory trade-off between complexity and performance.
To investigate the scalability of the proposed algorithms, we carry out experiments by varying the content catalog size. As depicted in Fig. 8, PD-HDDPG obtains a comparable performance to C-HDDPG when the content catalog size is smaller than 50; as more content items are considered, C-HDDPG achieves better performance due to utilization of global information. It is worth noting that, vast gaps can be observed between the proposed algorithms and baselines under either centralized or decentralized scenarios. More specifically, over the entire horizontal axis, C-HDDPG and PD-HDDPG can decrease fronthaul traffic loads by 10.54% and 7.68% respectively in comparison to CO-CU; whereas FD-HDDPG can reduce fronthaul traffic loads   We further investigate how the number of agents (i.e., SBSs) impacts the proposed multiagent algorithms. In these settings, we vary the number of SBSs, and set the distance between two adjacent SBSs as 300 m. As shown in Fig 9, traffic loads exhibit a decreasing trend as more SBSs are available to participate in cooperative coded caching. In addition, all of the curves decrease relatively slowly when more than 15 SBSs are deployed, which implies that most users might already be able to access a maximum number of local SBSs, that is usually limited by communication coverage. More importantly, when the number of agents becomes large, PD-HDDPG always achieves comparable results to C-HDDPG with significant reductions of signaling overhead and complexity. Concerning FD-HDDPG, it achieves slightly larger traffic loads than those of C-HDDPG and PD-HDDPG by at most 3.23% and 2.34%, respectively.
These results demonstrate the potentials of utilizing decentralized controls as a large number of SBSs are deployed.

VIII. CONCLUSION
We have proposed a deep multi-agent reinforcement learning framework for dynamic cooperative coded caching at small cell networks. Particularly, we have developed a novel deep reinforcement learning algorithm, i.e., homotopy DDPG, to address the challenges arising from the resultant continuous decision-making with constraints. From an engineering perspective, we have proposed centralized, partially decentralized, and fully decentralized controls to balance complexity and performance. Simulation results have confirmed that the proposed DRL outperforms plain DDPG under different levels of controls; and the proposed decentralized designs also achieve satisfactory performance compared with the centralized design.

APPENDIX
A. Proof of Proposition 1 Consider an optimal decision sequence {A t }, which results in an optimal value J * = t (γ) t R t+1 .
Suppose that there exists t , b such that f a t f,b = L and L < L; in addition, the corresponding decision A t is anticipated to impact rewards R t +1 and R t +2 . To proceed, we first denote  (5), one can verify that R t +1 ≥ R t +1 and R t +2 ≥ R t +2 . We thereby claim that {A t } is also an optimal decision sequence where A t = A t for ∀t = t . Hence, Proposition 1 holds.

B. Proof of Lemma 3
This proof follows similar procedures to the Deterministic Policy Gradient Theorem in [39].