See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/320142436

# Design and implementation of low power testing using advanced razor based processor

Article in International Journal of Applied Engineering Research · January 2017

| CITATION:<br>11                                                                     | ;                                                                                                     | reads<br>24 |  |  |
|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|-------------|--|--|
| 2 autho                                                                             | 2 authors, including:                                                                                 |             |  |  |
|                                                                                     | Karthick Ramachandran<br>Sethu Institute of Technology<br>13 PUBLICATIONS 45 CITATIONS<br>SEE PROFILE |             |  |  |
| Some of the authors of this publication are also working on these related projects: |                                                                                                       |             |  |  |



IMPLEMENTATION OF LPT USING SPIDER TECHNIQUE View project

Smart power generation using RAMA View project

## Design and Implementation of Low Power Testing Using Advanced Razor Based Processor

#### **R.Karthick**

Research Scholar, Department of Electronics and Communication Engineering, Bharath University, Chennai, India.

Orcid ID: 0000-0002-3222-0185

#### Dr. M.Sundararajan

Dean-Research, Department of Electronics and Communication Engineering, Bharath University, Chennai, India.

Orcid ID: 0000-0003-2317-5228

#### Abstract

In order to cope up with the functional operation criteria, our work concentrate on the percentage of indefinite values in the tests performed. A low-power broad side test set is shown from a functional broadside set with the derivation of skewedload test cubes in BIST circuits The twin effect of programmable truncated multiplication and fault-tolerant Digital Signal processing (DSP) design is put on to reduce voltage beyond critical timing level. Timing modulations properties of truncated multiplication are examined for the betterment of fault-tolerant designs, reducing error correction burden, and extending the system operating voltage range. The lower power test schemes along with advanced Razor technique is implemented with the original Digital signal Processing. Only demerit is the degradation of the output SNR.

Keywords: Digital Signal processing, BIST, Razor technique.

#### INTRODUCTION

In Very Large Scale Integrated circuits, voltage scaling is implemented to reduce dynamic power consumption and to achieve stable power management. If chosen scaling factor is M, then the factor of reduction in power consumption is obtained as  $M^2$ . The progress in Complementary Metal oxide Semiconductor (CMOS) technology exploits scaling to overcome issues from process voltage temperature (PVT) deviations.

In [20],digital signal processing, voltage over scaling levels is incorporated for energy consumption in retaining DSP function. The main advantages offered in contrast to timing constraints are presentation of appraisal of subsystem that delivers estimation in case of fault-detection and skills that vary the data capture by supplementing the latches or flip flops on the critical path by extending execution time. These combine features promises low power systems with moderate function with a compromise on signal degradation and processing time.

The unspecified broad-side test cubes are separated from the complete specified one and then concentrated in the low-power test set. The resultant target faults are grouped as CBRD. The merging of test cubed from CBRD forms a low-power broadside test set. The skewed-load test cubes derived from functional broadside tests are brought under skewed-load test. A mixed low-power test can also be performed combining broadside and skewed-load tests. The major effort is put on the extraction of the skewed-load test cubes.

#### TRUNCATED MULTIPLICATION

If a system ignores implementation of sections of the least significant part of the semi- product matrix, then the truncated multiplier promises to accomplish the necessary outputs. Generally product values generated by fixed width " $N \times N$ " bit multipliers are truncated or rounded back to the original bit width in latter stages of the algorithm flow. Truncation is a method that replaces the lower parts of the semi-product matrix by a smaller compensation circuit. The variant sorting from aggressively truncated applications to faithfully rounded truncated multipliers.

#### A. RAZOR IMPLEMENTATION

The programmable truncated multiply and-accumulate (PT-MAC) architecture is structured as a means to apply PTM in low-power biomedical applications with a need for modest DSP, such as ECG filtering or fall detection [19]. In our work, it is exploited as a platform to entertain and for grouping of programmable truncation and fault tolerance.

The Testing Based Low Power with PT-MAC is introduced as an outspread of BIST to support general DSP architecture. International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 17 (2017) pp. 6384-6390 © Research India Publications. http://www.ripublication.com

LT-PT-MAC is inexpensive towards implementing PTM with low power applications.

The projected DSP is structured as shown in Fig.6. The total control unit functions under five stage program, memory blocks and pipeline in a multi-bus Harvard configuration. The other units are I/O connectivity interfaces, an arithmetic unit designed with MAC structure of 16-bit PTM, a 40 bit accumulator for scaling and rotating the accumulated value. The projected CUT for Testing based Razor based PT-MAC architecture (LT-PT-MAC) is structured as shown.

The accumulator unit of the LT-PT-MAC has to deal with the fault tolerance, and it is obtained by introducing a fault tolerant version named Razor Accumulator. The original flip-flops were substituted by a version of the Razor registers which is newly implemented.

The amplified cells are structured and accumulated as library cells for post synthesis insertion. These cells responds to the unique implementation that is Razor implementation, where the Razor registers containing the shadow latch is replaced with shadow-flip-flop. Hence it avoids combination issues. The meta-stability sensor required in Razor implementations is programmed as the delay of an inverter, which acts as a limitation to the hold time of the Razor and accumulator.

Likewise the overall delay constraints potentially leading to meta-stability are then detected as timing errors, providing a lower assurance for the performance of Razor circuit. On performing Static timing analysis of LT-PT-MAC, it is inferred that the particular registers concerned with potentially critical paths within LT-PT-MAC were traced in the accumulator. The multiplication and accumulation of the input data is calculated within a clock cycle.



B. LOW POWER TESTING BASED PT-MAC (LT-PT-MAC)

Figure 1: CUT for LT-PT-MAC top level diagram

The architecture has to be modified in a Razorimplementation for error rectification by allocating a special clock cycle for the data error. The error is to be rectified by replacing with the correct data. From different possible pipeline management plans recommended in the Razor works, architectural repetition was suggested the most suitable for PT-MAC. All the implementation performs write / read or arithmetic operations. So, simple repeat strategy where a stall flag is issued in the presence of Razor error, which easily corrects fault results with a small area overhead, when avoiding issues with post incrementing address pointers. The instruction of an execution cycle on the sleep mode Low Power Razor-augmented PT-MAC (SL-RPT-MAC) can thus be divided in four possible stages.

The proposed Digital Signal Processing includes a control unit operating in a five-stage pipeline, program and memory blocks in a multi-bus Harvard configuration, some Input and Output connectivity and an arithmetic unit consisting of a MAC structure with a 16-bit PT-MAC, a 40-bit accumulator, and a 40-bit barrel shifter for scaling and rotating the accumulated value.

The fault tolerance is accomplished by some alteration of the PT-MAC unit. The accumulator unit of the PT-MAC was replaced by a fault tolerant version named Razor Accumulator where the original flip-flops were substituted by a version of the Razor registers Functional broadside test sets are larger than nonfunctional broadside test sets for the same set of target faults. By avoiding the inclusion of functional broadside tests in the low-power test set, and using test cube merging to form tests, the procedure is able to generate compact low-power test sets.

A functional broadside test cube that detects a fault f creates signal transitions that can also occur during functional operation in a sub-circuit around the site of f. Thus, 'f' is detected with functional operation conditions in a sub-circuit around it when the test cube is merged with other test cubes to form a test. In general, after test cube merging, the tests create functional operation conditions in sub-circuits that are defined by the test cubes. The tests thus satisfy a stricter constraint on their switching activities than the constraint that considers only their total switching activities. The maximum switching activity of a functional broadside test is used to bind the switching activities of the tests that are obtained by merging. Thus, the switching activity is prevented from being unnecessarily low or too high.

Two classes of efficient simulation-based procedures exist for the generation of functional broadside tests. The procedure usedin this brief extracts functional broadside tests from functional test sequences. When the functional test sequences satisfy functional constraints on primary input sequences, the functional broadside tests that are extracted from them satisfy the same constraints. For the discussion in this brief, the primary input sequences are assumed to be unconstrained. Functional test sequences are generated by a low complexity sequential test generation process that does not consider target faults.

#### **RESULT AND DISCUSSION**

This result is based on applying the BIST technique in the previously discussed chapter i.e., Razor based PT-MAC technique. Now we are reducing the area, power and speed in this technique. The result for PT-MAC, I-PT-MAC and LT-PT-MAC is shown in Table 1 and Bar charts are given below.

| Table 1: Comparison Results of Various PT-MAC |
|-----------------------------------------------|
| Architecture                                  |

| PT-MAC                                                   | I-PT-MAC                                                 | LT-PT-MAC                                                |  |
|----------------------------------------------------------|----------------------------------------------------------|----------------------------------------------------------|--|
|                                                          | AREA                                                     |                                                          |  |
| Number of slice registers<br>=228                        | Number of slice registers=<br>119                        | Number of slice<br>registers= 96                         |  |
| Number of Slice LUTs=<br>534                             | Number of Slice LUTs = 421                               | Number of slice LUTs<br>= 86                             |  |
|                                                          | is Report                                                |                                                          |  |
| Adder/subtractions = $24$                                | Adder/subtractions = 22                                  | Adder/ subtractions = 15                                 |  |
| Registers= 36                                            | Registers = 18                                           | Registers = 11                                           |  |
| Latches =29                                              | Latches = 22                                             | Latches = 11                                             |  |
| Comparators=8                                            | Comparators = 6                                          | Comparators=3                                            |  |
| Multiplexers = 120                                       | Multiplexers = 110                                       | Multiplexers = 87                                        |  |
| Tristates = 1                                            | Tristates = 0                                            | Tristates = 0                                            |  |
| XOR = 145                                                | XOR =122                                                 | XOR =89                                                  |  |
|                                                          | Advanced HDL                                             | synthesis Report                                         |  |
| Adder/subtractors = 1                                    | Adder/subtractors = 1                                    | Adder /subtractors = 1                                   |  |
| Counters = 2                                             | Counters = 1                                             | Counter =1                                               |  |
| Registers = 198                                          | Registers = 145                                          | Registers= 34                                            |  |
| Comparators = 1                                          | Comparators = 2                                          | Comparators = 1                                          |  |
| Multiplexers = 89                                        | Multiplexers = 73                                        | Multiplexers = 43                                        |  |
| XOR = 73                                                 | XOR = 73                                                 | XOR=0                                                    |  |
|                                                          | Pov                                                      | ver                                                      |  |
| Logic = 0.926                                            | Logic = 0.848                                            | Logic = 0.712                                            |  |
| IOs = 84.6816                                            | IOs = 84.6816                                            | IOs = 65.4896                                            |  |
|                                                          | Spee                                                     | ed                                                       |  |
| Minimum period = 2.432<br>ns                             | Minimum period =<br>2.101ns                              | Minimum period = 1.140 ns                                |  |
| Minimum input arrival<br>time before clock =<br>1.912 ns | Minimum input arrival<br>time before clock =<br>1.499 ns | Minimum input arriva<br>time before clock =<br>1.302 ns  |  |
| Maximum output required<br>time after clock = 3.408 ns   | Maximum output required<br>time after clock = 3.408 ns   | Maximum output<br>required time after<br>clock = 0.822ns |  |
| Maximum combinational path delay = 3.003 ns              | Maximum combinational<br>path delay = 3.003 ns           | Maximum<br>combinational path<br>delay = 0.145ns         |  |
| Total Real time to XST<br>completion = 18.38 sec         | Total Real time to XST completion = 17.42 sec            | Total Real time to<br>XST completion =<br>8.02 sec       |  |
| Total CPU time to XST<br>completion = 19.06 sec          | Total CPU time to XST<br>completion = 13.24sec           | Total CPU time to<br>XST completion =<br>8.13 sec        |  |

From the experimental analysis, we got results on area, no of circuits, power, energy, time and speed for the three architectures. The results are compared and depicted as a bar chart as follow.



Figure 2: Area of Slice register analysis

PT-MAC requires large area for slice registers and slice LUTs. On the other hand, LR-PT-MAC requires small area. Also the area for slice registers is more than that for slice LUTs in PT-MAC and I-PT-MAC. But in LT-PT-MAC number of slice LUTs is more than number of slice registers



Figure 4: Adder/subtraction Multiplier and register analysis

No. of adders, registers required for PT-MAC is more than that required for others. I-PT-MAC don't have latches while LT-PT-MAC has more latches than PT-MAC. Unlike PT-MAC and I-PT-MAC, LT-PT-MAC has more comparators. It is also seen that in all the three architectures, no of registers used is greater than no of adders, latches, comparators used.



Figure 4: Multiplexer and Tristates analysis



Figure 5: Multiplexer and register analysis

Among these three, PT-MAC needs more multiplexers. There are no tri-states in I-RPT-MAC and LT-PT-MAC. Same number of XORs is used in PT-MAC and PT-MAC. LT-PT-MAC doesn't have XORs in its architecture.

By comparing the required no of multiplexers and registers, it is clear that in all the three architectures the no. of registers is more than the no. of multiplexers used.





More number of XORs is used than

Multiplexers in PT-MAC and I-PT-MAC.

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 17 (2017) pp. 6384-6390 © Research India Publications. http://www.ripublication.com

Though there are no XORs in LT-PT-MAC, it has few multiplexers.



Figure 7: Counter and comparator analysis



Figure 8: Power analysis of Logic power

From counter and comparator analysis, it is shown that in PT-MAC and I-PT-MAC the no. of comparators are less than counters. But in I-PT-MAC more comparators are used than counters. Thus it is clear that the no of circuits required is less in LT-PT-MAC.



Figure 9: Power analysis of IOs

Considering the logic power, the high value corresponds to PT-MAC and low value corresponds to LT-RPT-MAC.



Figure 10: Delay analysis

Equal IOs power is used by PT-MAC and I-PT-MAC. For LT-PT-MAC IOs power is low. Among the three architectures LT-PT-MAC has low logic power and low IOs power.

From the graph it can be obviously said that the minimum period, minimum input arrival time, maximum output required time maximum combinational path delay are low for I-PT-MAC. So it is proved that the delay is very low in LT-PT-MAC.



Figure 11: Time and Speed analysis

Total real time and CPU time needed for Xst completion is high for I-PT-MAC and low for LT-PT-MAC.

### CONCLUSION

The use of advanced Razor technique on a PT-MAC structure has been tested at a post synthesis simulation level to study the effect and interactions of both energy minimizing techniques on a previously tested DSP design. The timing and power effects of VOS with error correction and the application of PT-MAC resulted in significant power minimizations. It describes a test pattern generation procedure briefly that produces a compact low-power skewed-load test set by merging of skewed-load test cubes that are derived from functional broad-side tests. Such test cubes create functional operation conditions in sub-circuits around the sites of detected faults. These conditions are preserved when a test cube is merged with other test cubes. For testable circuits, it is f to use lowvourableer values in order to benefit from higher levels of test compaction. Test-cube merging was implemented in a way that would ensure that the fault coverage of the final test set will not be limited by the faultcoverage of functional broadside tests.

Thus, the proposed method shows that the delay-modulation properties of PT-MAC and BIST using testable circuits can be exploited to increase the energy consumption of fault-tolerant DSP architectures where multiplier are involved in the critical path of the circuits.

#### REFERENCES

- R.Hegde & N.R. Shanbhag, "Soft digital signal processing", Very Large Scale Integration (VLSI) Systems, Vol.9, no. 6, pp.813-823, 2001.
- [2] B.Shim, S. R. Sridhara, & N. R. Shanbhag. "Reliable low-power digital signal processing via reduced precision redundancy." Very Large Scale Integration (VLSI) Systems, Vol. 12, no. 5, pp: 497-510,2004.
- [3] D. Ernstn, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler . "Razor: A low-power pipeline based on circuit-level timing speculation." Proceedings. 36th Annual IEEE/ACM International Symposium, 2003, pp. 7-18.
- [4] S. Das, C. Tokunaga, S. Pant, M.Wei-Hsiang, S. Kalaiselvan, K. Lai, D.M. Bull & D.T. Blaauw. "RazorII: In situ error detection and correction for PVT and SER tolerance." Solid-State Circuits Vol.44, No. 1, pp: 32-48, 2009.
- [5] W.N.Paul, S.Das & D.M. Bull. "A low-power 1-ghz razor fir accelerator with time-borrow tracking pipeline and approximate error correction in 65-nm cmos." Solid-State Circuits, Vol.49, no. 1, pp: 84-94,2014.
- [6] Kidambi, S.Sunder, F.E.Guibaly & A.Antoniou. "Area-efficient multipliers for digital signal processing applications." Analog and Digital Signal Processing, Vol. 43, No. 2, pp: 90-95, 1996.
- [7] Garofalo, Valeria, N. Petra, D. D. Caro, A. M. Strollo, and E. Napoli. "Low error truncated multipliers for DSP applications." In Electronics, Circuits and Systems, 15th IEEE International Conference on, pp. 29-32. 2008.
- [8] Tu, J. Hao & L. D. Van. "Power-efficient pipelined reconfigurable fixed-width Baugh-Wooley multipliers." Transactions on computers, Vol.58, No. 10, pp: 1346-1355, 2009.
- [9] Kuang, S. Rong, & J. P. Wang. "Design of powerefficient configurable booth multiplier." IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 57, No. 3, pp: 568-580, 2010.
- [10] D.L.GuiaSolaz, Manuel & R. Conway. "Comparative study on wordlength reduction and truncation for low power multipliers." In MIPRO, Proceedings of the 33rd International Convention, pp. 84-88, 2010.
- [11] Petra, Nicola, D. D. Caro, V. Garofalo, E. Napoli & A.G. Strollo. "Truncated binary multipliers with variable correction and minimum mean square error."

Transactions on Circuits and Systems I: Regular Paper, Vol. 57, No. 6, pp: 1312-1325, 2010.

- [12] D. L. GuiaSolaz, Manuel, W. Han & R. Conway. "A flexible low power DSP with a programmable truncated multiplier." IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 59, No. 11, pp: 2555-2568, 2010.
- [13] Chandrakasan, P. Anantha , M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen. "Optimizing power using transformations." Computer-Aided Design of Integrated Circuits and Systems 14, no. 1, pp: 12-31, 1995.
- [14] Sakurai, Takayasu, and A. R. Newton. "Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas." Solid-state circuits, Vol. 25, No. 2, pp: 584-594, 1990.
- [15] Fojtik, Matthew, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, & D. Sylvester. "Bubble Razor: An architecture-independent approach to timing-error detection and correction." International Solid-State Circuits Conference, pp. 488-490, 2012.
- [16] Pomeranz & Irith. "Low-power test generation by merging of functional broadside test cubes." IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 22, No. 7, pp: 1570-1582,2007.
- [17] I. Pomeranz. "Low-power skewed-load tests based on functional broadside tests". ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol.19, No.2, pp.18, 2014
- [18] I. Pomeranz. "Static test compaction for delay fault test sets consisting of broadside and skewed-load tests." 29th VLSI Test Symposium, 2011.
- [19] D. L. GuiaSolaz & R. Conway. "Razor based programmable truncated multiply and accumulate, energy-reduction for efficient digital signal processing". Very Large Scale Integration (VLSI) Systems, Vol.23, No.1 pp.189-193, 2015.
- [20] R.Karthick, Dr.M.Sundararajan," A Reconfigurable method for Time correlated MIMO channels with decision Feedback Receiver".International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 15, pp. 5234-5241,2017