Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI

Received Jun 11, 2019 Revised Dec 16, 2019 Accepted Jan 7, 2020 As the size of devices are scaling down at rapid pace, the interconnect delay play a major part in performance of IC chips. Therefore minimizing delay and wire length is the most desired objective. FLUTE (Fast Look-Up table) presented a fast and accurate RSMT (Rectilinear Steiner Minimum Tree) construction for both smaller and higher degree net. FLUTE presented an optimization technique that reduces time complexity for RSMT construction for both smaller and larger degree nets. However for larger degree net this technique induces memory overhead, as it does not consider the memory requirement in constructing RSMT. Since availability of memory is very less and is expensive, it is desired to utilize memory more efficiently which in turn results in reducing I/O time (i.e. reduce the number of I/O disk access). The proposed work presents a Memory Optimized RSMT (MORSMT) construction in order to address the memory overhead for larger degree net. The depth-first search and divide and conquer approach is adopted to build a Memory optimized tree. Experiments are conducted to evaluate the performance of proposed approach over existing model for varied benchmarks in terms of computation time, memory overhead and wire length. The experimental results show that the proposed model is scalable and efficient.


INTRODUCTION
Rectilinear Steiner Minimal Tree (RSMT) is composed of small set of connected pins through Steiner nodes with minimal cumulative edge size in Manhattan distance for a given set of pins. The construction of RSMT is a major issue in designing Very Large Scale Integration (VLSI) such as interconnects design, placement and floor planning. It has been adopted in computing transmission delay, interconnect delay and in workload computation. It is also adopted in some global routing strategies to build a routing topography of all nets.
The construction of RSMT for VLSI is considered a Non-deterministic polynomial problem [1], as a result rectilinear minimum spanning tree (RMST) has been adopted in some earlier design by exploring space dimensional design. RMST construction requires fast tree computing strategy and since the RMST does not allow Steiner nodes in tree construction the resulting RMST, length is longer than that of RSMT. In [2] showed that RMST is one and half times greater than that of RSMT with less than 50% in terms of accuracy, which is tolerable in earlier design. However, the later design requires good wire length accuracy for which the construction of RSMT is required. In [3] presented a wide range characteristic of RSMT construction. In [4,5] presented an optimal strategy for RSMT construction, which is said to have least computation time. In [6] presented a near optimal solution for RSMT construction. However, they are computationally very heavy and are not suitable for applications, specifically for VLSI design. Many approaches have been presented to reduce time complexity in constructing RSMT. In [7] adopted spanning graph [8] to aid in building the primary set of spanning tree and obtain finest sets for the edge-which are computed iteratively to eliminate longest edge. In [9] presented a greedy batched technique, which improved efficiency and reduced the computation time. The Single Trunk Steiner Tree (STST) is built to connect a set of pins to individual trunks, which traverse vertically or horizontally through set of all pins, but is not efficient for medium size pins. In [10] presented refined single-trunk tree for degree up to 5 nets and it is optimally accurate for medium degree nets with fair run time complexity. In [31,34] spanning tree based approximation algorithm that produced optimal solutions were presented.
In [11,12] presented lookup table based fast and accurate optimal solution for RSMT construction namely FLUTE. In this technique, the nets are recursively broken into sub set of nets. FLUTE is evaluated for low degree nets and it is suitable for VLSI design. FLUTE is also efficient for high degree nets with runtime complexity of ( log ). However, for higher degree nets the accuracy of RSMT construction is severely affected. This is due to the error induced during net breaking technique. To address this issue in [13] presented a scalable net partitioning technique, where the nets are broken into smaller subset of nets and again merged by adding Steiner nodes. This technique could handle both smaller and larger degree nets with slight reduction in accuracy but it induced a runtime complexity of ( log 2 ). In [14] presented a fast lookup table based RSMT construction, which brings a good tradeoff between accuracy and the runtime complexity. As specified in [28][29][30]32] memory is gaining prominence and efficient use of this resource is very important. Both [13,14] did not consider the memory constraint in building a look up table. The future VLSI design consists of fixed blocks such as IP blocks, macros, and so on and FLUTE is adopted by these researchers [15,16]. In such designs minimizing wire length and reducing memory overhead is most desired. To address these issues the proposed work presents a memory optimized RSMT construction that reduces wire length and computational overhead complexities. The contribution of research work:  No prior work has considered memory constraint in designing RSMT construction. The proposed work presents a memory optimized RSMT construction.  The proposed model reduces the wire length and computation time in constructing RSMT.  The proposed model is evaluated considering different benchmark [25] and shows that the proposed model is efficient considering all benchmark in terms of memory overhead, computation time and wire length. The paper organization is as follows: In section 2 extensive literature survey is carried out. The proposed memory optimized RSMT models are presented in Section 3. The experimental study considering various benchmark are presented in penultimate section. The concluding remark and future work is discussed in the last section.

LITTERATURE SURVEY
VLSI is a technique of combining lakhs of transistors into solitary Integrated Circuit (IC) chip. With the increase in transistors, the interconnecting wire length also increases. It is challenging to minimize the resistive and capacitive features, which have an impact on delay. The interconnect wires have fixed width and area making length as the only parameter that can be optimized. As a result, many routing techniques have been proposed in VLSI designs that are as surveyed below.
In [17] showed that the global router generally decompose net through RSMT. Therefore, to reduce congestion and provide flexibility it mainly depends on RSMT construction. FLUTE is a widely adopted technique for fast RSMT construction with minimal wire length. However, it fails to incorporate congestion. To provide flexibility and congestion optimization for net [17] presented a model namely Fthu, which is a two-phase approach by adopting FLUTE. In first phase, it decreases congestion and increases flexibility by applying reformed edge shifting and edge shrinking technique without changing Steiner tree topology. In second phase, the congested Steiner tree is broken and reconnected using MST-based approach. The outcomes show better performance in terms of reduced congestion time. However, there is no improvement in wire length performance.
In [18,19,26,27] presented a model to solve global routing problem. In [18] presented model, namely GRIP (Global Routing Technology via Linear Programming).This model presented integerprogramming model for current large-scale network. The model obtained high quality solution by adopting FLUTE for initial RSMT construction. The outcome shows improvement in cost and wire length performance. However, they did not exploit CPU and memory performance. Linear programming model are prone to get stuck in local optima. To overcome [19] presented a fast congestion driven Steiner tree creation  [20,21,33] adopted game theory approach. The game theory approach is adopted to improve runtime complexity of clustering approach for VLSI routing placement design.
In [22] studied various clustering based placement tool. An efficient clustering approach can aid in reducing wire length, cycle time or optimize a design based on these objectives. However clustering approach can induce time constraint. To address the time constraint [22] exploited a heterogeneous computing and presented a parallel clustering approach for placement. Their model is exploited for both CPU as well as GPU. The model utilizes the CPU and GPU core to full extent. The outcome shows it achieves a good speed up when compared to serial execution strategy. However adopting GPU for processing induces high cost of deployment and their model did not consider the memory constraint. As a result, it increases I/O access time.
Extensive literature survey carried out shows that minimizing time complexity (runtime) and wire length is a critical factor for designing an efficient routing technique in VLSI design. Some existing approaches have considered minimizing wire length or runtime and some considered both for optimization.
To improve runtime few approaches have considered a parallel implementation by utilizing CPU and GPU core. However, none of the approaches has considered memory performance. Utilizing the memory efficiently can aid in reducing the time complexity (i.e. I/O access time). The proposed work presents a memory optimization based RSMT to improve wire length, runtime and memory performance. In the next section below the proposed memory optimized RSMT (MORSMT) model is presented.

PROPOSED MEMORY OPTIMIZED RSMT MODEL
Here we present a memory optimized RSMT construction that reduces wire length, memory usage and computation time. As similar to [14], let us consider that the size of each sub tree be divided based on memory optimized tree and takes memory and spanning tree as input. Firstly it computes the least overhead edges (using memory optimized spanning graph) and selects one of the node as its root. The node, which is closer to the root node, is considered as parent node by realizing child-parent relationship along each of the edges. Then depth-first search and divide and conquer approach is adopted to optimize memory for larger size nets. Let us consider a graph ( , ), where and depicts a set of ordered pairs of edges and nodes respectively. Let = | | and = | | represent set of edges and nodes respectively. Here, we first construct an initial spanning graph by adding Steiner nodes and is considered to be connected to all nodes in . Then divide-conquer approach is applied to build a memory optimized tree of graph . Below table shows the notations and symbols used in the paper.

Memory optimized divide and conquer approach
The memory optimized divide conquer approach takes memory S, Spanning graph G of H and graph H as an input and obtain a tree Gas output which is a depth first search tree of H, where G is retained in memory and H is kept in disk. The algorithm first tests whether graph H can satisfy memory optimization requirement, so that the H can be loaded into S and if so it computes memory optimized tree G of H using available-memory optimization strategy and obtains G. Or else if does not obtain any G, the algorithm further computes memory optimized tree G of H by dividing memory optimized tree G by using divide and conquer approach.
To obtain an efficient memory optimized tree the legal dividend of H must be computed which is set to false initially as shown in flowchart in Figure 1. Then the present spanning graph G is optimized with respect to H until G is a memory optimized tree G of H or we obtain a legal dividend of H on spanning tree Here the dividend is obtained by invoking dividend optimization technique to achieve a graph division H_0,H_1,H_2,…,H_d of H with resultant spanning graph〖G〗_0,G_1,G_2,…,G_d. The dividend optimizer also evaluates a memory optimized graph μ during the merge operation.
The dividend is said to be legal only if d>1 as shown in flow chart in Figure 1. Once the legal tree division is obtained, the memory-optimized tree G_q is computed for all sub-graph H_q using divide and conquer approach in a recursive manner. Then by combining all memory optimized tree G_q of H_q based on μ the memory optimized tree G is computed and obtain G as memory optimized tree of H. The overall flow of proposed memory optimized RSMT construction is shown in Figure 1.

Memory optimized division algorithm
The objective of memory optimized divide and conquer approach is to maximize the number of divided subgraph. In existing model, for a given spanning tree and graph , the division is obtained using structure 0 with same parent as . This leads to following problem. Firstly, is obtained on top of 0 , where 0 is generated based on only one level of nodes in . The relationship of subgraph induced by subtrees rooted at leaf nodes of 0 is complex or when the parent 0 at has limited number of child nodes, after evaluating the division by contracting all SCCs (Strongly Connected Component), this might result in availability of only few divided subgraphs. Secondly, is obtained by scanning graph on disk once and evaluate set of edges ̅ , namely ̅ ( ̅ , ̅ ) with ̅ and ̅ be the leaf node of 0 in , whereas, the number of leaf node 0 is less, then ̅ may be smaller than the ̅ , which is available in graph. As a result large amount of ̿ is computed but not utilized during scanning edges. This reduces the I/O efficiency, which results in the increase in computation time.
Our proposed model will overcome these problems by enlarging the size of 0 and its correspondent with respect to memory size (i.e. whether they can fit in main memory). To satisfy memory constraint, the model considers multiple levels of nodes in to generate 0 and it's correspondent . The multi-level subtree is defined as a partitioned tree . Let us consider a tree with parent node 0 , partitioned tree that is a subtree of must satisfy the following condition. Firstly, the parent of should be 0 . Secondly, for any node , for instance the leaf nodes of in are 1 , 2 , 3 , … , , if ∈ ( ), then is either a node in with leaf nodes 1 , 2 , 3 , … , or a leaf node of in .

2963
To satisfy these constraints, consider a tree with parent 0 and memory constraint ̅ , the partitioned tree is generated as follows. The initially is composed of one node 0 . Then the child node are iteratively selected from , where all leaf nodes of in as the leaf nodes of in .Note with respect to comprises of at least| ( )| 2 edges. As a result, the execution is stopped when adding node | ( )| 2 > .
The memory optimized division model is as presented in Figure 2. Here 0 is constructed in top-down fashion based on . The algorithm first evaluates partition tree ̅ 0 of using above discussed method and initialize to be ̅ 0 . Then it searches all edges ( , ) in on disk and add ( , ) = ( , )into , if and belongs to ( ̅ 0 ). Then the model finds all in and top-down methodology is used to generate 0 . After that 0 and FIFO (First in First out) queue is initialized. It first pushes parent 0 of into . Then the edges are iteratively added into 0 until becomes null. In every round, it first retrieves the top node in and pushes all leaf nodes of into 0 (i.e. if is in the tree and is not a Steiner node). For each such instances (i.e. pushed into 0 ), it is further pushed into for further expansion. Once 0 is computed, is updated. The updated by popping (deleting) all nodes that are not in ( 0 ) from . Lastly, divided subgraph 1 , 2 , 3 , … , and subtrees 1 , 2 , 3 , … , are evaluated.

Memory optimized merging algorithm
The merge algorithm presented in Figure 3 takes as input, a divided tree 0 , 1, 2 , … , and the corresponding and outputs a graph . To perform the merge operation according to the algorithm in Figure 3, the following issues must be solved. First issue is how to arrange 0 , 1, 2 , … , in the merged tree , such that is a tree of graph . And Second issue is how to handle the Steiner node in tree 0 , 1, 2 , … , . The flow of merging algorithm is as shown in Figure 3. To solve the first issue, we use information of (i.e. is a graph that preserves the topology of edges of all partitioned subgraphs). Then sort all the nodes in and rearrange the nodes in 0 based on reverse topological order of correspondent nodes in and then merge all (1 ≤ ≤ ) with 0 to obtain tree of , we need to be assured that is a DAG and ( ) = ( 0 ). To solve the second issue, we merge all trees 0 , 1, 2 , … , , to obtain a tree . For each Steiner node ∈ ( 0 ), for instance root node of in is , then delete edge ( , ) from , and for each leaf node of in , we eliminate edge ( , ) from and add edge ( , ) into . This method aids in improving to validate that the resultant tree is tree of . The performance study of the proposed approach is presented in the next section.

RESULT AND ANALYSIS
The MORSMT algorithm is implemented using C++ object oriented programming language. The GCC compiler is used to compile the code. The eclipse Kepler IDE used for running the algorithm. The system environment used to run the algorithm is Centos 7.0 Linux operating system, 3.2 GHz, Intel I-5 Quad core processor and 16GB RAM. The IBM benchmark [25] is considered for evaluation, which is as shown in Table 2. The experiment is carried out to evaluate the performance of MORSMT over existing approach [14] in terms of wirelength, memory utilization and computation time.

Wirelength performance
Experiments are carried out to evaluate wirelength performance and 18 IBM circuit in ISPD98 benchmark is used. The information of benchmark is shown in Table 2. and there are 1.57 million nets in total. The proposed MORSMT approach is compared with FLUTE [14] with default accuracy = 3 in terms of wirelength performance, which is shown in Table 3. The outcome shows that MORSMT performs better than existing approach in terms of wirelength reduction for all the cases. An average reduction of 0.026% is achieved by MORSMT over existing approach.

Memory utilization performance
To evaluate the performance of memory usage, valgrind [23,24] has been used. The proposed MORSMT approach is compared with FLUTE [14] with default accuracy = 3 interms of memory utilization performance which is as shown in Table 5. The outcome shows that MORSMT performs better than existing approach in terms of memory consumption for all the cases. An average reduction of 77.71% is achieved by MORSMT over existing approach. The outcome shows that memory usage is directly dependent on wire length and degree size.

CONCLUSION
This work presented a memory efficient RSMT construction. The proposed model is an improvement of original FLUTE. FLUTE does not consider memory optimization in RSMT construction and adopted breadth first search to find minimum spanning tree, which induced memory overhead. To address this problem the proposed work adopts divide and conquer and depth first search to find the minimum spanning tree. The experiments are conducted to evaluate the performance of proposed approach over existing approach for varied benchmarks. The outcome shows significant performance improvement of 0.026%, 76.3%, and 32.62% over existing approach in terms of wirelength, memory overhead, and computation time reduction respectively. The future work would consider presenting parallel memory optimized RSMT construction and experiment will be carried out on multi-core environment such as CPU or GPU to improve speedup performance.