Nonsmooth Newton’s Method: Some Structure Exploitation

. We investigate real asymmetric linear systems arising in the search direction generation in a nonsmooth Newton’s method. This applies to constrained optimisation problems via reformulation of the necessary conditions into an equivalent nonlinear and nonsmooth system of equations. We propose a strategy to exploit the problem structure. First, based on the sub-blocks of the original matrix, some variables are selected and ruled out for a posteriori recovering; then, a smaller and symmetric linear system is generated; eventually, from the solution of the latter, the remaining variables are obtained. We prove the method is applicable if the original linear system is well-posed. We propose and discuss diﬀerent selection strategies. Finally, numerical examples are presented to compare this method with the direct approach without exploitation, for full and sparse matrices, in a wide range of problem size.


Introduction
In this paper, we consider the real square nonsymmetric possibly large sparse linear system  where Q ∈ R nx×nx , A ∈ R na×nx , C ∈ R nc×nx and S, T ∈ R nc×nc are given matrices and f ∈ R nx , g ∈ R na , h ∈ R nc are given vectors (n x , n a , n c being some positive integers); S and T are non-zero diagonal matrices. The contribution of this paper is the exploitation of the structure in problem (1) and its transformation into a smaller symmetric linear system, with a saddle-point structure, from which the solution to (1) can be easily recovered. In particular, two stages are discussed. First, a reduction step generates a smaller linear system and a way to recover eliminated variables from the solution of this reduced system. This step exploits the fact that matrix T is diagonal, and then symbolically solve for (some of) the variables in z. Several different reduction strategies are discussed and compared. The second step aims at rewriting the linear system in a symmetric form, allowing to adopt solvers for symmetric systems, which are usually more time and memory efficient. Despite these advantages, some computational overhead is needed, especially in the reduction step, which might introduce a break-even point, that is, this exploitation may pay off, e.g. in terms of computational time, only under certain conditions, e.g. large instances. In fact, an optimal reduction strategy might exists, depending on the specific properties of the problem; indeed, it may even depend on the specific values of the entries. Throughout the paper, we investigate the influence of the reduction strategy on the performance of the aforementioned two-steps exploitation; however, a detailed optimization of the reduction policy is beyond the scope of this paper. We point out that the proposed method could be combined with constraint-reduction approaches as, e.g., those presented in [5,12]. Once the original problem (1), say V d = r for brevity, has been transformed, a reduced symmetric linear system, sayVd =r, is to be solved. To this end, any method can be adopted. The choice may depend on the problem, in particular on its size, fill-in, sparsity pattern, accuracy requirements and memory constraints, availability of good preconditioners, and so on. Within this work, we compare the effectiveness and the limitations of the proposed method for structure exploitation when a direct solver is adopted to tackle the linear system, see

Motivation
Linear systems with the form (1) arise, e.g., from nonlinear complementarity problems [6], nonlinear optimization problems with inequality constraints [7] and discretized optimal control problems with state and control constraints [8,9]. Usually, these are reformulated through the Karush-Kuhn-Tucker (KKT) necessary optimality conditions, then equivalently transformed into a nonlinear system of equations with the so called NCP-functions [16] and finally solved with a nonsmooth version of Newton's method [14]. Some globalization strategies [11,9] and results in functions spaces [18] have been reported. It has been shown that, for some NCP-functions, this approach is equivalent to a primal-dual active set strategy [10]. Indeed, different NCP-functions exhibit different properties and might affect the convergence behaviour [16,1]. With reference to the problem (1), matrices Q, A and C can be considered as iterate-dependent linear-quadratic approximations of an underlying nonlinear problem, while diagonal matrices S and T originate from the NCP-function adopted and vectors f , g and h are the residuals of the aforementioned nonlinear system of equations.
Example Let us consider a quadratic program (QP) with linear equality and inequality constraints. Hence, we seek an x ∈ R nx minimizing 1 2 x Qx + q x, subject to constraints Ax = a, Cx ≤ c, where Q, A, C and q, a, c are given matrices and vectors, respectively. Here the inequalities are understood componentwise. Linear constraints ensure that regularity conditions are met, then the KKT conditions are necessary for optimality; these read: where λ and µ denote the Lagrange multipliers. In (2c), inequality and complementarity constraints hold componentwise. Let us consider an NCP-function ϕ : R 2 → R, e.g. the original or the penalized Fischer-Burmeister function [7,1], which by definition satisfies for any pair (a, b). Thanks to this property, the KKT system (2) is equivalently rewritten as a nonlinear system ψ(z) = 0, collecting vector z = (x, λ, µ) ∈ R nz , n z := n x + n a + n c , and with vector-valued function ψ : R nz → R nz defined by Here the NCP-function ϕ applies componentwise. A (globalized) possibly nonsmooth Newton's-type method generates a sequence {z k } through the recurrence z k+1 = z k + α k d k , k = 0, 1, 2, . . . , where the step length α k > 0 is determined, e.g., by a line-search procedure of Armijo's type and the search direction d k is the solution of the linear equation V k d = −ψ(z k ) [7,14,11,8,9]. The matrix V k is an element of the Clarke's generalized Jacobian (that is the convex hull of the Bouligand differential [2]) of ψ at z k , namely V k ∈ ∂ψ(z k ) [11,8]. The NCP-function ϕ is the only element in (4) which can possibly make function ψ nonsmooth. Hence, from (4), one obtains with diagonal matrices S k = diag s k 1 , s k 2 , . . . , s k nc and T k = diag t k 1 , t k 2 , . . . , t k nc , whose entries are pairwise coupled via the (possibly generalized) differential of the NCP-function ϕ. Defining v k := c − Cx k the inequality constraint violation at the k-th iterate, for i = 1, 2, . . . , n c , the coupling for the i-th inequality constraint reads [7,1,8]: We remark that matrices S k and T k are diagonal because the NCP-function ϕ applies componentwise in (4). Then, the linear system V k d = −ψ(z k ) to compute the search direction d k corresponds exactly to problem (1).

Outline
This work is organized as follows. Section 2 outlines the structure exploitation strategy and introduces the underlying assumption which permits it. In Sections 2.1 and 2.2 the two main steps are developed and discussed. Section 3 validates the proposed approach numerically, for both full and sparse matrices, with different reduction strategies, showing effectiveness and limitations of the proposed algorithm. Section 4 concludes the paper and presents ideas for future research.

Structure exploitation
Problem (1), also denoted V d = r for brevity, can be directly solved via any linear algebra package, e.g., MA48 [4], PARDISO [15], SUPERLU [13], mldivide in MATLAB [17]. However, we aim at exploiting our knowledge about the structure of matrix V ; computational effort and achieved accuracy might benefit from this, especially for large-scale and sparse linear systems. Firstly, we notice that matrix V and vector r are often computed blockwise and then assembled. Hence, matrices Q, A, C and vectors s, t, f , g and h are here considered as the starting ingredients for solving (1). Overall, two directions are explored, mainly exploiting the diagonal structure of S and T . In Section 2.1 a reduction step is discussed, eliminating some variables and introducing a smaller asymmetric linear system. Then, in Section 2.2, this is transformed into a symmetric one which is equivalent. Nonetheless, these operations for reorganizing the linear system and successive recovering of variables constitutes an overhead of computation. This means that these procedures might be worthy only for certain problems, likely large instances with lots of inequality constraints. In Section 3, numerical tests show that this break-even point corresponds to relatively small problem instances (for the tested implementation). We point out that the proposed method relies on the following assumption on the diagonals of S and T ; it reads: This is a mild requirement, in that it corresponds to problem (1) to be well-posed. One can show the following result: Lemma 1. If problem (1) admits a unique solution, then Assumption 1 holds.
Proof (by contradiction). Let us assume there exist d unique solution to problem (1) and i such that (s i , t i ) = (0, 0). Hence, the row of V corresponding to the i-th inequality constraint consists of zeros only. Then, the matrix V is rank deficient and the problem is undetermined. Two cases are possible, depending on the value of h i on the right-hand side. If h i = 0, then problem (1) admits infinitely many solutions, hence solution d is not unique. If h i = 0, then the linear system is unsolvable (impossible) and d cannot be a solution.
Remark 1. Assumption 1 requires a mild condition to be satisfied by the NCPfunction ϕ. For instance, a sufficient condition is that for any given pair (a, b) ∈ R 2 there exists a pair (s, t) ∈ ∂ϕ(a, b) such that (s, t) = (0, 0); this allows to choose always suitable entries for S and T . The Fischer-Burmeister function and the max function, among other NCP-functions, have this property.
Let us denote I := {1, 2, . . . , , n c } the index set for the inequality constraints, I 0 re := {i ∈ I | t i = 0} and I 0 sy := {i ∈ I | s i = 0} the largest index sets that allow respectively the reduction step and the symmetrization step, discussed below. Thanks to Assumption 1, these satisfy I 0 re ∪ I 0 sy = I. Let us consider an index subset I re ⊆ I 0 re , sufficiently large to satisfy I re ∪ I 0 sy = I. Then, for the associated complement I re := I\I re , it holds I re ⊆ I 0 sy . With this construction, it is possible to apply the reduction step, ruling out a given set I re of variables, and subsequently the symmetrization step on the linear system with the remaining variables, namely those in I re .
Remark 2. We stress that in general it is I 0 re ∩ I 0 sy = ∅ and hence the choice of I re is not unique. This suggests there could be an optimal reduction strategy, possibly dependent on V and with some degree of computation awareness. However, this issue is beyond the scope of this paper.
In Section 3, we compare the following definitions of I re through numerical investigations: where > 0 is a given, sufficiently small value, introduced as a numerical tolerance in (7a)-(7b). For → 0 + , these sets approach the largest and the smallest possible reduction sets, respectively, namely reducing the most and the least of the variables. Instead, the set defined in (7c) represents an arbitrary trade-off, introduced for the sake of comparison; see Fig. 2.

Remark 3.
One could think about performing either the reduction or symmetrization step. However, (i) under Assumption 1, once the system is reduced, the symmetrization step is straightforward, inexpensive and likely effective; (ii) the symmetrization step might be impossible without preliminary reduction, depending on the invertibility of S.  (7) in the |s|-|t| plane: I t re (left), I s re and I ts re (right).

Reduction
The idea behind the reduction step stems from the observation that problem (1) may be separable, i.e. that it may be possible to compute the value of some variables a posteriori, namely once the others are given. In fact, a solution to problem (1) must satisfy where matrix T is diagonal. Given an index set I re ⊆ I 0 re , it is possible to compute z i , for every i ∈ I re , from (8) once the solution vector x is known. To be sure, let us build matrices T re := diag (t i | i ∈ I re ) and T re := diag (t i | i ∈ I re ) and define z re and z re the corresponding vectors of unknown variables which can and cannot be reduced, respectively. Then, partitioning the linear system (1) accordingly with these definitions yields: where matrices C re , C re , S re , S re and vectors h re and h re are constructed analogously, based on I re . The matrix T re is nonsingular, by definition, and then, from (9), one can formally solve for z re , obtaining whose evaluation is straightforward because T re is diagonal. Plugging (10) back into (9) leads to a smaller linear system, after rearrangements, without reduced variables z re , namely: where matrixQ and vectorf are defined by: The larger the set I re , the more reduced variables, the smaller the obtained linear system (11). In turn, the computation ofQ may be costly, involving a matrix-matrix multiplication, Eq. 12a. Also, for sparse problems, the fill-up of matrixQ may become significant. These drawbacks suggest there might be a trade-off in the reduction step, and hence an optimal reduction strategy, as pointed out in Remark 2.

Symmetrization
Linear systems with a symmetric matrix can be solved more efficiently, in terms of time and memory. In order to get a symmetric matrix out of (11), it would suffice to left-multiply the rows associated with inequality constraints, namely with z re , by the inverse of −S re . As discussed above, it is I re ⊆ I 0 sy , hence the matrix S re is nonsingular, by construction; moreover, its inversion is straightforward, since it is diagonal. Then, the reduced symmetric linear systemVd =r reads:  The matrixV is symmetric and smaller than V ; the vectord collects the unknowns corresponding to optimization variables (x), equality constraints' multipliers (y) and not-reduced inequality constraints' multipliers (z re ).
Remark 4. In (12)-(13), the matrix-matrix products T −1 re S re , S −1 re T re and the matrix-vector products T −1 re h re , S −1 re h re can be evaluated as entry-wise vectorvector products. In fact, this is possible because matrices S re , S re , T re and T re are diagonal. Furthermore, one can exploit this feature by choosing a specific multiplication ordering, aiming at the lowest possible computational complexity.
A problem instance consists of matrices Q, A, C and vectors s, t (the diagonal of S and T , respectively), f , g and h. In the case of full matrices, starting from given values of N , α and γ, an instance is generated as follows: where N (µ, σ) denotes the normal continuous probability distribution with mean value µ and standard deviation σ, and U(a, b) the uniform distribution with support in [a, b]. Entries of S and T are pairwise coupled in that they are sampled from a disk in the s-t plane, centered in (1, 1) with unitary radius, with uniform probability distribution. This setting is motivated by and mimics the generalized differential of the Fischer-Burmeister function [8]. Both, the direct and the structure-exploiting methods setup the linear system starting from these inputs. Notice that the reduced approach does not build V nor r, but their reduced and symmetric counterpartV andr. In our implementation, the direct method builds V and r and then adopts the mldivide routine to solve V d = r. Instead, for the reduced approach (with full matrices), the linsolve routine is adopted and explicitly informed that matrixV is symmetric. The problem size N varies between 10 and 10 4 for full and between 10 3 and 2 · 10 4 for sparse matrices. For each problem size, a set of 100 problem instances are generated (only 10 if N > 10 4 ), checked for ill-conditioning and eventually solved. The composition of constraints is chosen to be (α, γ) ∈ {(0, 1), (0, 1.5), (0.5, 1), (0.5, 1.5)} (colored in blue, red, green and violet, respectively). The index sets defined in (7) are adopted and compared, with the tolerance = 10 −3 . Sparse matrices are generated in such a way that they approximately have 10 entries for each row; this makes the number of nonzero entries to increase linearly and not quadratically with the problem size N .  The computation time for the direct and reduced case are depicted in Fig. 3, considering full matrices and the (large) index set I t re . This gives an idea about the adopted implementation and computing hardware; also, one can guess the computational complexity of the underlying algorithm for solving a linear system. As expected in Section 2, the overhead due to partitioning, reducing and recovering, introduces a break-even point, at around N = 60 (for I t re and I ts re , but not for I s re ); hence, the reduced approach is not beneficial only for smallsized problems. This and other considerations can be drawn based on Fig. 4, where it is depicted the (median value of the) ratio of the execution time with the reduced approach, t red , with the different reduction strategies, over the direct one, t dir . Therein, the break-even point corresponds to the unitary ratio; also, the additional computational burden is significant for low values of N . For large N , instead, the ratio decreases to approximately one-half for I t re and I ts re , while for I s re it stays around the unit. As one could expect, the reduction set I s re is not as effective as I t re and I ts re because it does not benefit very much from the reduction step, in that it eliminates only few variables. The relative number of constraints has an impact on the execution time but it does not drastically affect the overall behaviour (for both, full and sparse case). In fact, all else being equal, either decreasing the number of equalities and increasing the number of inequalities reduces the execution time ratio, meaning that the reduced approach is more effective and worthy for (large) problems with many inequality constraints. For what concerns the case of sparse matrices, similar observations are valid, see Fig. 5. In order to show the distribution of the results obtained from the executed tests, along with the median value, the 9% and 91% quantiles are also reported. For full matrices, Fig. 4, the distribution is relatively narrow, while for sparse matrices, Fig.5, the results are relatively scattered. Thus, we argue the sparsity pattern greatly affects the computation time. Nevertheless, for relatively large sparse matrices, the ratio t red /t dir approaches one-half and promisingly decreases. These numerical results suggest the set I t re defined in (7a) to be the most effective reduction strategy among those tested. In fact, it generates the smallest linear system and then post-solves the most variables. However, as argued in Section 2.1, we claim it might be not the case for (much) larger problem instances, because of the required overhead for the reduction step.

Conclusions
This paper proposed and studied a structure-exploiting approach for solving linear systems arising in the context of nonsmooth Newton's method. The applicability of this method was established under well-posedness of the original problem. Numerical examples showed that the developed approach reduces the computational time, with both, full and sparse linear systems. Some of the tested reduction strategies resulted in halving the execution time.
Analogous ideas apply when an iterative linear solver is of choice, e.g. for very large systems; tailored preconditioners are subject of future research. It remains to assess the effectiveness and the drawbacks of the method when embedded into larger routines for numerical optimization. Moreover, it would be interesting to investigate an optimal reduction strategy, possibly computationally aware.