Decentralized Consensus Optimization Based on Parallel Random Walk

The alternating direction method of multipliers (ADMM) has recently been recognized as a promising approach for large-scale machine learning models. However, very few results study ADMM from the aspect of communication costs, especially jointly with running time. In this letter, we investigate the communication efficiency and running time of ADMM in solving the consensus optimization problem over decentralized networks. We first review the effort of random walk ADMM (W-ADMM), which reduces communication costs at the expense of running time. To accelerate the convergence speed of W-ADMM, we propose the parallel random walk ADMM (PW-ADMM) algorithm, where multiple random walks are active at the same time. Moreover, to further reduce the running time of PW-ADMM, the intelligent parallel random walk ADMM (IPW-ADMM) algorithm is proposed through integrating the Random Walk with Choice with PW-ADMM. By numerical results from simulations, we demonstrate that the proposed algorithms can be both communication efficient and fast in running speed compared with state-of-the-art methods.


I. INTRODUCTION
C ONSIDER a network G = (V, E), where V = {1, ..., N } is the set of agents and E is the set of links.The agents aim to solve the following consensus optimization problem, where f i : Ê n → Ê is the local loss function held by agent i, and all the agents share a common optimization variable x ∈ Ê n .This consensus problem is applied in various areas including wireless sensor networks [1], [2] and smart grid implementations [3].Specifically in [4], the distributed beamforming scheme for multiple relay nodes (RNs) is designed by solving a consensus problem.In general [5], consensus-based distributed linear estimation for cooperative communication in wireless networks can be formulated as (1).
A few decentralized algorithms have been provided to solve the consensus problem in (1).For low computation complexity, the first-order algorithms such as decentralized gradient decent (DGD) and EXTRA are proposed by [6] and [7], respectively, where agents use their local gradient during the optimization process.Among existing decentralized algorithms, the alternating direction method of multipliers (ADMM), which is extensively applied in wireless communications [8], is shown to be faster than DGD in convergence [9], in which at every Yu Ye, Hao Chen, Zheng Ma and Ming Xiao are with School of Electrical Engineering and Computer Science, KTH, Stockholm, Sweden (email: yu9@kth.se,haoch@kth.se,zma@kth.se,mingx@kth.se).
iteration an agent needs to solve an optimization problem with collected information from neighboring agents.Besides, a variety of algorithms such as Gauss-Seidel ADMM and Jacobi-Proximal ADMM [10] based on the original work in [11] are provided to solve the consensus problem (1).
In practice, one ideal decentralized approach is expected to obtain the optimal solution of (1) with the minimal communication and computation costs.Lots of research efforts have been put on computation complexity reduction.But there are very few results [12]- [14] on reducing the communication cost of ADMM.Though both proposed algorithms, distributed ADMM (D-ADMM) in [12] and communication-censored ADMM (COCA) in [13], limit the overall communication at each iteration, the COCA can adaptively determine whether a message is informative, and D-ADMM relies more on the network topology.The random walk ADMM (W-ADMM) algorithm is proposed in [14], which randomly activates a succession of nodes and incrementally updates the optimization variable.W-ADMM can achieve much less communication cost but at the expense of running time, since at each iteration only one agent is active for optimization.However, all the approaches provided in [12]- [14] are synchronous ADMM, which may suffer from the straggler problem [15].
In what follows, we will propose the parallel random walk ADMM (PW-ADMM) algorithm that allows multiple random walks active in parallel.Furthermore, we integrate the intelligent agents selection scheme with PW-ADMM, which is presented in the algorithm of intelligent parallel random walk ADMM (IPW-ADMM).By numerical results, we show that the proposed approaches can be both communication efficient and fast in running time.
The remaining of this letter is organized as follows.We first introduce the parallel random walk algorithms in Section II.To demonstrate the effectiveness of the proposed approaches, we provide numerical results in Section III.Finally, we conclude the letter in Section IV.

II. PARALLEL RANDOM WALK ALGORITHMS
By defining x = [x 1 , ..., x N ] ∈ Ê nN , problem (1) can be rewritten as where z ∈ Ê n , ½ = [1, ..., 1] T ∈ Ê n and ⊗ is Kronecker product.The augmented Lagrangian for problem (2) is , where λ = [λ 1 , ..., λ N ] ∈ Ê nN is the dual variable, and ρ > 0 is a constant parameter.The iterated updates of x, λ and z can be found in Algorithm 1 (W-ADMM) in [14].The Fig. 1 (a) presents an example of W-ADMM.Ignoring the difference in communication cost, the equivalent implementation of W-ADMM is shown in Fig. 1 (b), where agent i updates local variables x i and λ i after receiving token z, while the virtual master updates z with up-to-date x i and λ i .

A. Parallel Random Walk ADMM
To introduce the parallel random walk ADMM, we extend the architecture in Fig. 1 where z l ∈ Ê n (l ∈ M) is the token held by the l-th random walk.In the constraint of problem (4), we let x i equal to the average of the summation of tokens.By doing this, we will be able to update tokens of multiple random walks in parallel.
The augmented Lagrangian for problem (4) is Following the traditional synchronous ADMM [11], the update for k + 1-th iteration follows Algorithm 1: Parallel Walk ADMM (PW-ADMM) 1: Initialize: Inspired by the incremental update of WADMM [14], we transform (6a)-(6c) to the following process by approximating 1 M M l=1 z l with z m for the update of x i and λ i .
where M k+1 ⊆ M is the set of active random walks at iteration k.We adopt proximal update for x i with q i = τ i I, where τ i > 0 is a step size penalty chosen by agent i and 7a) and (7b).Hence to make (7c) satisfied, the update for token z m should follow Algorithm 2: Intelligent PW-ADMM (IPW-ADMM) 1: Initialize: same as that of PW-ADMM.
2: Algorithm of the m-th Walk: 3: follow steps 3-9 of PW-ADMM but substitute step 9 with the following step: The update of z k+2 m in ( 8) can be carried out in parallel and asynchronously since it does not require information of z k+1 l (l, m ∈ M k+1 , l m).Thus we parallelize (7a)-(7c) by the following updates for (k ′ m + 1)-th step of the m-th random walk, where k ′ m is the clock held by the m-th random walk.In (9a), For agents i i k ′ m , the local variables x i and λ i are not updated by the m-th random walk.The update of z m instead of (6a).Defining V i (⊂ V) the set of neighbors of agent i(∈ V) and V i = V i i, we present PW-ADMM in Algorithm 1. Similar to W-ADMM, the transition of token z m follows the embedded Markov chain with probability matrix P ∈ Ê N×N .When M = 1, the PW-ADMM reduces to W-ADMM.

B. Intelligent Parallel Random Walk ADMM
For general problems, the convergence speeds of W-ADMM and PW-ADMM are mainly determined by how frequently all of the agents are visited.Since the transition of the token is determined by probability matrix P, it is possible that the variables x i and λ i at some agents are updated for much fewer rounds than others.This hence may reduce the overall convergence speed.To guarantee agents not to be inactive for long time, we should improve the transition strategy for tokens.Inspired by [16], which introduces the Random Walk with Choice, we present IPW-ADMM in Algorithm 2. Different from PW-ADMM, IPW-ADMM requires that agent i has the knowledge of the active rounds of agents in V i .Considering the k ′ m -th step of the m-th random walk, the updated z m will be sent to the least visited agent i k ′ m +1 := arg min i ∈V i k ′ m k i .Note that we do not count the communication cost of sharing {k i } across the agents since the cost is negligible compared with transmitting tokens.
Since all agents and parallel random walks keep individual clock, both PW-ADMM and IPW-ADMM are asynchronous algorithms.However, our proposed algorithms are different from existing work [15], [17], where only one master updates the variable z.Moreover, the updated z is only sent to the agents just active.

C. Convergence Analysis
We present some results on the convergence of PW-ADMM and IPW-ADMM with the following assumption.

Assumption 1. The objective function f
Though (I)PW-ADMM is asynchronous algorithm, we prove the convergence from the synchronous point of view.Without loss of generality, we denote each synchronous iteration as the update for only one token, where Theorem 1.Under Assumption 1 and τ i = 0, the sequence x k , λ k , z k+1 generated by (I)PW-ADMM satisfies Proof.By substituting z k+1 with 1 M M l=1 z k+1 l in the proof of Lemma 1 and Theorem 1 [14], the result (11) can be obtained.

III. NUMERICAL RESULTS
In this section, we provide numerical results from simulations to demonstrate the communication efficiency and running speed of PW-ADMM and IPW-ADMM compared with stateof-the-art methods in [6], [7], [9], [13], [14], [18] with respect to the accuracy, which is defined as where x * ∈ R n is optimal solution of (2).For fair comparison, the parameters for algorithms are tuned to be the best, and kept the same in different experiments.The connected network G is generated randomly with N agents and |E | = N(N−1) 2 η links.Besides, the dimension of x i is set to be n = 2.We consider unicast among agents, and the resultant communication cost for each transmission of a n-dimensional vector is 1 unit.The running time includes both computing time and communication time.Without loss of generality, we assume that each agent has multi-process capability to update the tokens for multiple random walks in PW-ADMM and IPW-ADMM.Moreover, the consumed time for each communication is assumed to follow U(10 −5 , 10 −4 ) (s).The simulation is carried out on a laptop with Intel I7 processor and 8GB memory.The programing environment is Matlab R2016a.

A. Decentralized least square problem
The decentralized least square problem such as [19] aims at solving problem (1) with the local cost function  where D i = {o i, j , t i, j | j = 1, ..., b i } is the dataset of agent i locally.The entries of input o i, j ∈ Ê 2 and target t i, j ∈ Ê follow i.i.d.distribution U(0, 1).The number of data samples is kept unique across agents with b i = 30.The accuracy in (12) over communication cost and running time for different network settings is shown in Fig. 2. It is clear that W-ADMM is the most efficient in communication cost but with slow running speed.The proposed parallel random walk algorithms PW-ADMM and IPW-ADMM can significantly reduce the running time from W-ADMM, and consume much less communication resources compared to DGD, D-ADMM, EXTRA and COCA.In addition, IPW-ADMM can further reduce running time from PW-ADMM.Especially when the network is large and highly-connected, i.e., N = 200 and η = 0.5, the PW-ADMM and IPW-ADMM can achieve the best performance in running time with almost the same communication cost with W-ADMM.This is because the inherent asynchronous mechanism of PW-ADMM and IPW-ADMM outperforms the synchronous methods.
In Fig. 3 we present the impact of M, the number of active random walks, on the convergence behavior.It can be concluded that with increasing M, a larger communication cost of both PW-ADMM and IPW-ADMM is required to achieve the same accuracy, while the running time will be shorten.
Hence there exists a trade-off between the communication cost and the running time over M. Besides, for a larger M, e.g.M = 90, the accuracy gap between PW-ADMM and IPW-ADMM shrinks compared with the case where M = 10.This shows the advantage of intelligently choosing updating path for each walk according to the updated frequency of agents over randomly processing is weakened when more random walks are active.

B. Decentralized logistic regression problem
In the decentralized logistic regression, the local loss function of agent i is where t i, j ∈ {−1, 1} and b i = 30.Each sample feature o i, j follows N(0, I).To generate t i, j , we first generate a random vector x 0 ∈ Ê 2 ∼ N(0, I).Then for each sample, we generate v i, j according to U(0, 1), and if v i, j ≤ (1+exp(−x T 0 o i, j )) −1 , we set t i, j as 1, otherwise −1.Since it is difficult to solve the optimization problem, e.g.(9a), in PW-ADMM, we alternatively use the first-order approximation as Fairly, we adopt the first-order approximation for algorithms IPW-ADMM, W-ADMM, D-ADMM and COCA.Fig. 4 presents the accuracy over communication cost and running time.Apparently compared to other benchmarks in [6], [7], [9], [13], [14], only the proposed parallel random walk algorithms PW-ADMM and IPW-ADMM can guarantee both

Fig. 1 .
Fig. 1.(a) an example of W-ADMM; (b) the equivalent architecture of W-ADMM; (c) the equivalent architecture of parallel random walk ADMM algorithms.