Bivariate Hermitian Polynomial Coding for Efficient Distributed Matrix Multiplication

Coded distributed computing is an effective framework to improve the speed of distributed computing systems by mitigating stragglers (temporarily slow workers). In essence, coded computing allows replacing the computation assigned to a straggling worker by that at a faster worker by assigning redundant computations. Coded computing techniques proposed so far are mostly based on univariate polynomial coding. These codes are not very effective if storage and computation capacity across workers are heterogeneous and lose completely the work done by the straggling workers. For the particular problem of distributed matrix-matrix multiplication, we show how bivariate polynomial coding addresses these two issues.


I. INTRODUCTION
Matrix multiplication has a major role in many machine learning and signal processing tasks. Matrices to be multiplied can be very sizable in some tasks such as training machine learning models over large data sets or running inference over large parametric models. Multiplication of massive matrices, especially, under latency constraints, requires substantial amount of computational resources, which are usually not available at a single machine. This motivates distributing the matrix multiplication task to external workers. However, due to unpredictable delays in their service times, some workers called stragglers, may complete their assigned tasks much slower than others, leading to serious delays. Thus, combating the adverse effects of stragglers is an important problem in distributed computing.
To address this problem one needs to provide some form of computation redundancy to the system, i.e., replicating tasks to other available workers. Recent results have demonstrated that error-correcting codes can be applied to this setting to obtain the needed computation redundancy very efficiently. In this context, polynomial codes are proposed in [1]. These codes are optimal in terms of the download communication costs, i.e., the number of bits that the workers need to send to communicate their results. MatDot codes are proposed in [2]. They require more computation and communication resources at workers, but they achieve smaller recovery threshold, which is defined as the minimum number of non-straggling workers required for completing the computation task. PolyDot codes [2] and entangled polynomials codes [3] provide a tradeoff between the recovery threshold and the communicationcomputation costs. However, common to all these approaches is completely ignoring any computations completed by straggler nodes referred to as "under-utilization" in [4]. Specifically, the work assigned to a worker is communicated back to the master only if it is finished completely. This is sub-optimal, especially if the workers' speeds are close to each other, in which case, the ignored workers have probably completed a significant portion of the work assigned to them [5], [6]. To exploit the partially completed work done by stragglers, a multi-message approach is considered in [4], [6], [7], where workers' tasks are divided into smaller subtasks, and the result of each subtask is communicated to the master as soon as it is completed. The multi-message approach is applied to the matrix-matrix multiplication in [7] using product codes [8].
In all of the aforementioned works, univariate polynomials are used. In [9], we show that for univariate polynomial coding, dividing a task into subtasks by a given factor, reduces the computation capacity of the workers by the same factor. In case of limited storage at the workers, this approach is not efficient in terms of memory exploitation and upload costs. The product codes proposed in [7], which are basically a combination of two univariate polynomial codes, partially address this issue. However, for product codes, the computations at workers are not one-to-any replaceable anymore. This results in poor performance in worst-case scenarios. Moreover, univariate polynomial codes, as well as product codes, require homogeneous workloads across workers.
In this work, we propose bivariate Hermitian polynomial coding for distributed matrix-matrix multiplication. The proposed scheme allows us to exploit stragglers in highly heterogeneous settings and to reduce the upload communication cost, while guaranteeing (almost surely) that subtasks at workers are one-to-any replaceable. In contrast to univariate coding, the computational capacity of workers is not affected by diving tasks into subtasks. However, decodability, which requires bivariate polynomial interpolation, is highly dependent on the order in which the computation subtasks are completed at the workers. In [9], [10], we proved almost surely decodability for specific storage configurations at workers. Here, we significantly extend those results by relaxing the constraints on the stored matrices at workers. We also numerically show that the new approach beats univariate polynomial codes in [1], product codes in [7] and our previous results in [9], [10] in terms of average computation time.
We organize the paper as follows. In Section II, we present the system model and the problem formulation. We introduce and discuss our coding scheme in Section III. In Section IV, we compare our scheme with previously proposed schemes, and in Section V, we conclude the paper. We give the proof of Theorem 1 in the Appendix.

II. SYSTEM MODEL AND PROBLEM FORMULATION
We study the problem of distributed multiplication of two matrices A ∈ R r×s and B ∈ R s×c , r, s, c ∈ Z + . A master, with access to both matrices, is interested in multiplying them by offloading partial computations to N workers with, possibly, heterogeneous data storage and computation capacities. To distribute the multiplication task, the master partitions matrix A horizontally, and B vertically, into K and L submatrices, respectively, such that ,i ] which are encoded by linearly combining the partitions of A and B, respectively.
Every worker multiplies its coded sub-matrices in a special order, to be specified later, and communicates the results one-by-one to the master as soon as each computation is completed. As soon as the minimum number of computations necessary to decode AB are collected by the master, it starts decoding.
The maximum number of (potentially useful) results that can be sent to the master from worker i, without updating its local storage, is referred to as the maximum computation capacity, and denoted by η i . For univariate schemes with task partitioning [1]- [3] , η i is limited to min(m A,i , m B,i ), whereas we will show that the proposed bivariate scheme achieves η i = m A,i m B,i . A higher maximum computation capacity is key for minimizing the expected computation time, as it allows workers to provide more computations with the same storage capacity.
III. PROPOSED SCHEME A. Encoding The master partitions the matrices as described in Section II. Coded matrices are generated by evaluating the derivatives of the following univariate polynomials: Specifically, for worker i, all the derivatives of A(x) up to order m A,i −1 are evaluated at some x i ∈ R. Similarly, we compute all the derivatives of B(y) up to order m B,i − 1 are evaluated at some y i ∈ R. We require , and sends them to i th worker. For brevity, in the remaining of the paper, we use

B. Computation
Worker i computes the Cartesian product of the sets A i and B i , i.e., it multiplies all the elements in A i with all the elements in B i , one by one, in an order referred to as the zigzag order, and sends the result of each individual computation to the master as soon as it is completed.

Definition 2. A set of derivatives of A(x)B(y) evaluated
at the same point (x, y) constitutes a node. Thus, the computations from each worker constitute a node. The set of derivatives of a node can be depicted in the interpolation space. We provide an example of such a depiction of two nodes in the Appendix. Hereafter, we refer to a node by its evaluation point, i.e., node (x, y) is the node with evaluation point (x, y). We refer to the evaluations forming the node as the derivative set of the node.
Partial computations at each worker are done in the increasing order of their priority score. The priority score of partial computation (A i ) k (B i ) l at worker i will be denoted by S i (k, l). Definition 3. The priority scores for the zig-zag computation order are given by S(k, l) mB −1 for all the workers. For the visualization of the zig-zag order in the 2-D interpolation space, please see Fig. 1. Observe that a completed partial computation (A i ) k (B i ) l at the i th worker can be mapped to the derivative order (k, l) in the 2-D interpolation space. In the zig-zag order, we divide the interpolation space row-wise into L/m B equal horizontal blocks of consecutive rows.

C. Decoding
The master receives results from the workers, and decodes AB by solving a bivariate Hermite interpolation problem. The polynomial we interpolate is . However, unlike in univariate polynomial interpolation, there is no guarantee that any KL evaluations received from the workers define a unique polynomial. We show, in the next theorem, that if the workers follow the computation order defined in Section III-B, the number of evaluations needed is not more than In that case, the computation order at workers needs to follow a zig-zag type order in which the interpolation space is partitioned into vertical blocks and the workers go first in the right-horizontal direction. The number of computations that need to be received from the workers in that case is not more than KL + max 0, (m A − 2)( K mA − 1) . Due to the space restrictions, we do not include this approach in this work, but the extension is trivial. For a discussion on the choice between these two approaches, we refer to [9]. Remark 2. We emphasize that by setting m B = L or m B = 1, the zig-zag order reduces to the vertical or horizontal orders considered in [9], respectively. Similarly, in the horizontal extension of the zig-zag order, setting m A = K or m A = 1 reduces to the horizontal and vertical orders in [9], respectively. Remark 3. Instead of Hermite interpolation, we could also consider bivariate interpolation coding as in [9], [10], in which a worker multiplies the evaluations of A(x) and B(y) at different evaluation points instead of different derivatives evaluated at the same point. As shown in [9], to prove the almost regularity of such bivariate interpolation coding, we need first to transform it into a bivariate Hermitian interpolation problem. This is possible under almost regularity condition. Here, we directly describe Hermitian coding that result from such a transformation.
Bivariate polynomial coding improves the upload communication cost, C u , defined as the minimum number of bits the master needs to upload to recover AB. Suppose a bits are required to represent rows/columns of matrices. Then, in univariate polynomial coding, associated to each subtask there is an upload cost of a( r K + c L ) bits. Since the full task requires KL subtasks to finish, C u = a(rL + cK). However, in bivariate polynomial coding, each evaluation of A(x) and B(y) may contribute to up to m B and m A subtasks, respectively, assuming homogeneous storage capacity. Thus, each subtask has an upload communication cost of a( r KmB + c LmA ), resulting in C u = a(r L mB + c K mA ). In many works in the literature, storage capacity at the workers is assumed to be specified separately for each of the two matrices. However, in practice it is more reasonable to assume a total storage capacity at each worker, which can be freely allocated between the partitions of the two matrices. Assume that the rows of A and the columns of B require the same storage at workers. Let m i ∈ N + be total number of rows/columns i th worker can store. The computation time is improved with a higher maximum computation capacity η i and less discarded computations. In bivariate coding schemes, the computation capacity of a worker is given by η i = m A,i m B,i . Accordingly, for given K and L, we choose m A,i and m B,i to maximize m A,i m B,i subject to In the scheme we propose, in terms of memory allocation, we significantly improve upon [9], [10] especially in the low-memory regimes, and come close to the optimal. We restrict m B,i to be a multiple of a common factor m B with m B | L and we do not impose any conditions on m A,i if m B,i = m B and set m A,i = K otherwise. Moreover, we allow heterogeneous storage capacities across the workers. In a worst-case situation, we may need to ignore up to (m B − 2)( L mB − 1) computations. The bivariate extension of [7], which is presented in [9], [10], achieves the optimal memory allocation as there are no additional constraints on m A,i or m B,i . However, it works only in case of homogeneous storage capacity at workers. Moreover, in the worst-case, there are (n A m B,i − L)(K − 1) + (n B m A,i − K)(L − 1) useless computations, where n A and n B are design parameters such that N = n A n B . This is far larger than the number of computations that may be discarded by our scheme. This quantity can be significant and increases the average computation time.
On the other hand, in [9], [10], by forcing a vertical or a horizontal computation order, every computation is made useful. Moreover, it allows heterogeneous storage capacity across the workers. However, in the case of horizontal computation order, m B,i = K or m A,i = 1, and in the case of vertical computation order, m A,i = L or m B,i = 1 is required. These constraints incur a loss of optimality in the memory allocation between the partitions of A and B. This may result in a considerable increase in the average computation time in lowmemory regimes. However, not discarding any computation dominates if the storage at the workers is sufficiently large and highly improves average computation time.
IV. NUMERICAL RESULTS We compare the expected computation times for the proposed scheme, the scheme in [9], [10], the bivariate extension of the scheme in [7], which is presented in [9], [10] and the univariate polynomial coding scheme in [1] with partial computations. We assume that all the workers start computations simultaneously and define the computation time as the time required to obtain sufficient responses from the workers to interpolate A(x)B(y). We only focus on the computation time since the bivariate polynomial to be interpolated in all three schemes has the same number of coefficients, and thus the encoding and decoding complexities are very similar. We assume the communication time is negligible. We model the computation speed of the workers by the shifted exponential model [1], which is commonly used in the literature to analyze coded computation schemes. In this model, the probability that a worker finishes at least p computations by time t is F (p, t) = 1 − e −λ( t p −ν) , if t ≥ pν, and 0 otherwise. Thus, P (p, t) F (p, t) − F (p + 1, t) gives the probability that a worker completes exactly p computations at time t, where F (0, t) = 1, and F (p max + 1, t) = 0, where p max = η i = η, ∀i ∈ [N ]. ν is the minimum duration in which a worker completes a partial computation. A smaller λ means a higher variance, and thus more heterogeneous computation speeds across the workers. To cover more heterogeneous cases, we choose ν = 0.01 and λ = 0.1. Note that we use the same parameters for every worker. This means on average, the behavior of the workers are similar, but in every experiment, their speeds can differ a lot.
We run Monte Carlo simulations to find the expected computation times of the schemes for different memory sizes at the workers. We assume the sizes of the partitions of A and B are equal. We further assume that the workers have the same storage capacity, i.e. homogeneous case, as this is required by the scheme proposed in [7]. We partition both matrices into K = L = 10 pieces, and employ N = 15 workers. For each memory value, we run 10 4 experiments and take the average. The results are presented in Fig. 2. For each scheme, the minimum memory required to complete KL = 100 computations with N = 15 workers is different. Thus, we plot each scheme starting from a different minimum memory value.
We observe that the proposed scheme results in a much lower expected computation time than previously proposed schemes for low storage capacities. Even though we allow partial computations, the univariate polynomial coding scheme in [1] performs far worse than all the others due to inefficient use of the memory resulting in a very limited computation capacity. Despite the optimality in the memory allocation between m A,i and m B,i , we see that high number of useless computations aggravates the average computation time in the bivariate extension of [7]. For this code, increasing the storage capacity does not improve the average computation time beyond a certain point. Note that, while simulating [7], we use a random computation order at the workers, which is reported to perform well in [7]. That is, instead of assuming the worst-case, for every computation received by the master, we decide if it is useful or not. On the other hand, for our proposed scheme, we assume the worst-case, i.e., we always ignore (m B −2)( L mB −1) computations. Hence, we expect the improvement to be even more than what we observe in Fig. 2. We also observe that the scheme in [9], [10], also a bivariate polynomial coding scheme, performs well for the intermediate and large memory values. However, when the memory is limited, there is significant performance degradation. The proposed scheme solves the problem in small memory values and shows the effectiveness of bivariate polynomial coding. To better illustrate this, we also plot a lower bound on the average computation time of any bivariate polynomial-based coding scheme, assuming no computation is ever wasted and the memory allocation between m A,i and m B,i is optimal. Observe that the average computation time of our scheme is quite close to this lower bound.
V. CONCLUSION We proposed a bivariate Hermitian polynomial coding strategy for memory-efficient straggler exploitation problem in distributed matrix-matrix multiplication. The non-trivial problem in bivariate polynomial interpolation is showing the invertibility of the interpolation matrix. We show that if workers follow a predetermined order for their partial computations, which we refer to as the zig-zag order, and allow for a few additional redundant computations, then decodability is guaranteed almost surely. The proposed strategy's ability to exploit the workers' computation capacity is close to the optimal, for highly heterogeneous storage and computation capacity at the workers, while keeping the number of wasted computations relatively small. We numerically showed that in terms of the average computation time, our scheme performs better than its rivals in the literature.
APPENDIX: PROOF OF THEOREM 1 We first introduce some useful concepts and results.
Definition 4. The polynomial interpolation problem can be formulated as a system of linear equations, whose unknowns are the coefficients of A(x)B(y), namely A p B q where p ∈ [K], q ∈ [L]. As such, it has a unique solution if and only if its coefficient matrix, here referred to as interpolation matrix, is invertible. We denote the interpolation matrix associated with a set of nodes, Z, as M (Z). Since A(x)B(y) has KL coefficients, the interpolation matrix is a square matrix of size KL.
Definition 5. Given the interpolation matrix M , the Taylor series expansion of det(M ) with respect to the evaluation point (x j , y j ) around (x i , y i ) is given by We refer to node (x i , y i ) as the pivot node and node (x j , y j ) as the variable node.
Since bivariate monomials (x j − x i ) α1 (y j − y i ) α2 are linearly independent for different values of α 1 and α 2 , the determinant becomes zero ∀(x j , y j ) only if D α1,α2 (x i , y i ) = 0, ∀α 1 , α 2 . For specific values of x i , y i , x j , y j , there might be cases in which determinant is zero even if D α1,α2 (x i , y i ) = 0, ∀α 1 , α 2 , i.e. at the zeros of the polynomial in the Taylor expansion of det(M (x j , y j ) in (1). However, the Lebesgue measure of the zeros of a multivariate polynomial in R 2 is 0 if the polynomial is not zero everywhere. Thus, if we can prove that D α1,α2 (x i , y i ) = 0 for some α 1 , α 2 and almost all choices of the evaluation points (x k , y k ), ∀k ∈ [N ], then M (Z) is invertible almost surely. This relaxed invertibility notion is known as almost regularity [11] and our objective is to prove the almost regularity of M (Z).
Note that, by definition, D α1,α2 (x i , y i ) does not depend on the variable node's evaluation point (x j , y j ) anymore. This procedure of eliminating the variable node's evaluation point is known as coalescence of a pivot and a variable node.
In the Lemma 1, stated next, we compute y j )). However, first, we need to introduce the concept of regular shift. Definition 6. Let M (x j , y j ) ∈ R n×n be an interpolation matrix as a function of the evaluation point of node (x j , y j ) and denote by r i its i th row. We define the k th order derivative with respect to x j and the l th order derivative with respect to y j of the i th row as ∂ i,k,l M (x j , y j ) We also refer to the operator ∂ i,k,l as a shift in the i th row of M (x j , y j ). We say that the shift is simple in the x direction if (k, l) = (1, 0) and simple in the y direction if (k, l) = (0, 1).
Notice that only the rows of M associated node (x j , y j ), actually, depend on (x j , y j ). Let the row r i correspond to the derivative ∂ ki A(x j )∂ li B(y j ) of node (x j , y j ), then the derivative set of node (x j , y j ) in ∂ i,k,l M (x, y) contains ∂ ki+k A(x j )∂ li+l B(y j ).
Definition 7. Given a pair of length−n vectors (k, l) with (k(i), l(i)) ∈ [N ] 2 , we define the shift (k, l) as Thus, the i th entry of k and l denote the order of the derivative taken on the i th row of M with respect to x j and y j , respectively. Let α 1 = n i=1 k(i) and α 2 = n i=1 l(i). We say that (α 1 , α 2 ) is the order of the shift. Definition 8. We say that a shift (k, l) is regular if it can be achieved by at least one sequence of simple shifts, and after each simple shift there are not two equal rows or a zero row in the shifted matrix. We denote the number of sequences of simple shifts of the regular shift (k, l) as C k,l and the set of all regular shifts of order (α 1 , α 2 ) as R(α 1 , α 2 ).
Let us move back to the problem of proving D α1,α2 (x i , y i ) = 0. If a unique shift (k, l) exists, then the problem of proving D α1,α2 (x i , y i ) = 0 reduces to proving that ∇ k,l M (x j , y j ) xj =xi,yj =yi is invertible. To that end, we repeat the procedure.
Definition 10. Coalescence procedure on a fixed pivot. Choose a pivot node, and consider a sequence of coalescences in which after each coalescence, the resultant coalesced node is used as the pivot for the next coalesce in the sequence until all other nodes are coalesced into one with a derivative set covering the full interpolation space.
Following the procedure in Definition 10, if in every intermediate step, we can find a unique shift, then M is nonsingular for almost all choices of evaluation points. Unfortunately, in our coding scheme, finding a unique shift in every step may not be possible. Next, we give a relaxation of the notion of unique shift which is useful to prove the almost regularity of our scheme.
Definition 11. A shift order (α 1 , α 2 ) is called quasi-unique if the RHS of (2) evaluated at (x j , y j ) = (x i , y i ) can be written as the sum of the determinant of one main interpolation matrix ∇ k0,l0 M (x j , y j ) xj =xi,yj =yi for which the derivative set of node (x i , y i ) obeys the zig-zag order, and, possibly, the determinant of some other secondary interpolation matrices, for which the derivative set associated to node (x i , y i ), does not obey the zig-zag order. We call the shift (k 0 , l 0 ) quasiunique.
In the next lemma, we extend the successive coalescence procedure on unique shifts to quasi-unique shifts. the previous quasi-unique shift, then the last coalesce returns only one interpolation matrix defined by one node covering the complete interpolation space, and thus, the associated interpolation matrix is invertible.
The proof of Lemma 2 will be included in a longer version of this paper. Now, the problem reduces to showing if we can always find a series of quasi-unique shifts to coalesce all the nodes into one under the conditions of Theorem 1. Definition 12. Remember, in the zig-zag order, we divide the interpolation space into L/m B consecutive equal horizontal blocks. In a coalescence procedure, consider the occupation of the derivative set of the variable and the pivot node within the horizontally partitioned interpolation space. We denote by z, the number of elements in the uppermost partially filled horizontal block of the variable node's derivative set and denote by e, the number of empty spaces in the uppermost partially filled horizontal block of the interpolation space once filled with the derivative set of the pivot node. Example 1. Let us assume K = 5, L = 6 and m B = 3. Given that they follow the zig-zag order, a variable node having 20 computations and a pivot node having 9 computations are depicted in the interpolation space in Fig. 3a and Fig. 3b, respectively. We represent all possible 2-D derivative orders with the circles, and if a derivative order is present in the node's derivative set, then the circle is filled. In the figure, we can observe that the interpolation space is divided into 3 horizontal blocks. According to Definition 12, z = 5 and e = 6. These elements are shown within a dashed-lined polygon in the corresponding figures. Lemma 3. Given a variable node and a pivot node, which are zig-zag ordered, if at least one of the following conditions is satisfied, (a) z ≤ e, (b) z > e and the uppermost block of the variable node consists of only fully occupied columns, (c) z > e and the number of elements in the rightmost partially-occupied column in the uppermost block of the variable node is equal to the number of the empty spaces in the leftmost partially-occupied column in the uppermost block of the pivot node, then there exists a quasi-unique shift for the coalescence of these nodes.
The proof of Lemma 3 will be included in a longer version of this paper. Lemma 3 provides the situations in which quasi-unique shift exits. If we are in a situation not covered by Lemma 3, then we can always remove elements of the derivatives sets of the variables node just ignoring the associated partial computation at the master until we satisfy Lemma 3. This is possible since we are applying a sequence of coalescences on a fixed pivot node.
To obtain Theorem 1 from Lemma 3, we need to compute the number of computations we need to ignore along the successive coalescence procedure in a worst-case situation. Observe that if z > e, the resultant pivot node occupies the next horizontal block. Given that there are L/m B horizontal blocks, the maximum number of coalescences for which z > e is at most (L/m B − 1). For this to happen, all of the nodes from the workers must have strictly less than Km B elements; and therefore, we need more than L mB workers. Thus, in the worst-case, the total number of ignored computations is (m B − 2)( L mB − 1). This completes the proof of Theorem 1.