A Dependable High Performance Wafer Scale Architecture for Embedded Signal Processing

A high performance, programmable, floating point multiprocessor architecture has been specifically designed to exploit advanced twoand three-dimensional hybrid wafer scale packaging to achieve low size, weight, and power, and improve reliability for embedded systems applications. Processing elements comprised of a 0.8 micron CMOS dual processor chip and commercial synchronous SRAMs achieve more than 100 MFLOPS/Watt. This power efficiency allows up to 32 processing elements to be incorporated into a single 3D multichip module, eliminating multiple discrete packages and thousands of wirebonds. The dual processor chip can dynamically switch between independent processing, watchdog checking, and coprocessing modes. A flat, SRAM memory provides predictable instruction set timing and independent and accurate performance prediction. Index Terms —Computer reliability, embedded processing, wafer scale integration, parallel architectures, memory hierarchies, high-speed integrated circuits.


INTRODUCTION
THE Wafer Scale Signal Processor (WSSP) is a high performance architecture for embedded computing and signal processing designed for fault tolerance, power efficient floating point performance (measured in MFLOPS/Watt), and low size, weight, and power (SWAP).Computing elements of the WSSP combine commercial synchronous SRAMs with a dual processor chip, using a multichip module (MCM) technology supporting area interconnect, such as hybrid wafer scale integration [1].The WSSP computing element is sufficiently compact and power efficient to allow as many as eight computing elements to be placed within a single layer MCM.Multiple MCMs are interconnected with commercial PCI [2] buses to form a hierarchical multiprocessor environment.
Dependability, low size, weight, and power, and high sustained performance are all important to embedded signal processing applications.The metric of MFLOPS/Watt is important to floating point intensive applications, such as signal processing, embedded military applications, and supercomputing, but not as important to the broader marketplace of data and desktop processing, where many important applications involve little floating point arithmetic.Studies of instruction mixes averaged across data processing applications show floating point instructions consume less than 5 percent of the overall instruction mix [3].Data processors seek the highest possible performance on these mixes.However, these mixes are not representative of signal processing and scientific computing application spaces.
This paper first defines the architecture of the single processing element in terms of memory, processor, and input/output capabilities and features for reliability.Next, the impact of wafer scale packaging is discussed.Wafer scale multichip modules are used to densely and reliably package numerous processing elements.Finally, the software development environment is described.

PROCESSING ELEMENT ARCHITECTURE
A processing element is comprised of a dual processor chip capable of eight single precision floating point operations per clock cycle and two variable depth banks of commercial synchronous SRAMs.Fig. 1 shows an example of a processing element with 16 SRAMs before interconnecting the chips with hybrid wafer scale integration.The two processors, denoted A and B, can each access either memory bank, as shown in Fig. 2. Each of the processors on the dual processor chip has a 72 bit (64 bits plus byte parity) bus to memory.The dual processor chip also includes an interface to an external 72 bit I/O bus supporting direct memory access to and from the element memory.
The memory banks can be up to four, 72 bit wide layers deep.In the element of Fig. 1, commercially available 2 Mbit synchronous SRAMs organized 64K × 36, comprise Memory Bank A from eight memory die organized 256K × 72.Another MCM fits two processing elements into the same area, but each with only a single layer of four memory chips.As SRAMs densities continue to improve, the depth of each memory layer can expand up to 512K (32 bit) words, giving a maximum memory configuration of 2M words, or 16 megabytes per processor.
Memory has become a driving factor in terms of both power dissipation and area for embedded signal processing applications.This is particularly true where size, weight, and power are very costly, such as in satellites.Note the relative proportion of surface area devoted the processor and memory in Fig. 1.To avoid memory dominating the surface area of the WSSP, an innovation was devised whereby short stacks of thinned memory chips, with an overall thickness approximating that of a single chip, could multiply the memory density by factors of four to eight, yet keep element size small, less than one square inch, with the additional benefit of reducing the capacitance of processor to memory interconnects [4].This increased density improves speed and power dissipation by reducing interconnect parasitics, while greatly improving size and weight.In the context of Fig. 1, two short stacks, each with four 64K × 36 components can be used to form each memory bank.Since the stacked chips are "layers" of memory, at most one is selected and actively dissipating power at any time.This controls the watts dissipated per square inch, which is critical to allowing these multichip modules to be stacked in turn and yet cooled from the top or bottom surface [5].
Perhaps the most important feature of the element architecture from a user's viewpoint is its return to flat, SRAM memory.There is no cache within the processor, which, when coupled with the absence of DRAM, yields a simplified memory hierarchy with predictable timing and a simplified processor.However, large DRAM stores can be placed on the interelement bus and accessed with message passing.Long vector lengths and nonunit stride memory access patterns typically employed by signal processing applications present several difficulties to a DRAM/cache based memory architecture.The long vector lengths cause data to fall out of the cache before they are reused.The nonunit stride memory access patterns cause entire cache lines to be loaded, even though only a single datum in the line may be used, and operations on vectors located in different portions of memory may severely reduce performance due to DRAM page opening penalties.As a result, signal processing applications requiring high performance on DRAM/cache based architectures often require explicit management of data movement into and out of the cache [6].
The dual processor chip devotes approximately 30 percent of the silicon area to the floating point units.From a perspective of MFLOPS/Watt, the remainder represents overhead to provide programmability, reliability, multiprocessor scaleability, and high sustained performance.Each of the two processors has a floating point ALU and a floating point multiplier, each of which can perform either two IEEE single precision operations per clock cycle or one IEEE double precision operation per clock cycle.Supporting both single and double precision reaches beyond signal processing applications to address some traditional supercomputer applications (e.g., computational fluid dynamics) that require 64 bit precision.
A superscalar-vector assembly language instruction set has been written into the Very Long Instruction Word (VLIW) ROM control store.The scalar portion of this assembly language instruction set resembles a typical RISC load/store instruction set.The addition of vector instructions, such as Fast Fourier Transform (FFT), dot products, and vector add/subtract/multiply, allows the processors to fully utilize the on-chip computing resources.The VLIW approach improves the power efficiency of the processor and provides the fine grain control necessary to orchestrate addressing and arithmetic operations.
Each processing element can operate in any of three modes: independent mode, coprocessor mode, and watchdog mode.In standalone mode, each processor A and B operate completely independently on separate programs, with their own private memory bank.In coprocessor mode, processors A and B work in a client-server relationship to finish a single assignment, such as a long FFT, with lower latency.In this case, they communicate and share data and results through shared memory access.In watchdog mode, also known as a self-checking pair configuration, both processors perform exactly the same calculations with lockstep timing, and compare results each clock cycle, to detect errors.If the watchdog processor detects a discrepancy, it will interrupt the "active" processor and halt.
Since the mode of operation is not determined in hardware, but left for software control, the processing elements can change between the modes of operation as the applications are executed.Thus, the processor can move from independent to coprocessor mode when latency becomes critical.Alternatively, the processing element can transition into and out of watchdog mode when critical sections of code are executed with concurrent error detection, similar to the V.cmp [7].However, unlike voting based error correcting architectures, this approach inserts no logic into the critical path of the processing chain and does not require additional buffering or reduce the operating frequency.Switching modes requires less than 100 clock cycles, similar to a context switch.
The processing elements within the MCM communicate with one another over a multiprocessing bus called the IOBUS (as shown in Fig. 3).The IOBUS is a derivative of the FutureBus+ standard [8] and the PCI local bus [2].The FutureBus+ low level messaging format is used by the IOBUS; however, at the signaling level, the IOBUS is synchronous, more like the PCI bus.The IOBUS is 64 bits wide with byte parity checking, and it operates at the speed of the processors.The IOBUS may transfer one 64 bit word per clock cycle.
Transactions on the IOBUS take precedence over the computation on the processors.This is accomplished without delay on the IOBUS by inhibiting the clocks of the processors on the receiving element.The logic throughout the dual processor chip avoids dynamic logic to allow the processor clocks to be disabled for indefinite periods of time without losing state information.The result is that an N-word write within the MCM requires only N + 2 cycles to complete from the clock cycle the bus is granted.
The external PCI interface to the MCM can be implemented as an FPGA or ASIC depending on speed and logic requirements.This ASIC can also to interface to local DRAM, as shown in Fig. 3.A set of eight state-of-the-art DRAMs can be attached to a controller within the MCM interface for a reasonable percentage of the overall MCM area.This overhead can be further reduced by using two short stacks of DRAM in lieu of eight discrete components.

WSI AND PACKAGING FOR RELIABLE EMBEDDED SYSTEMS
The challenge for dense packaging approaches has traditionally been to get the power into the MCM and the heat out.While approaches to three-dimensional packaging were demonstrated in the early 1990s [9], [10], a power efficient processor is the key to exploiting these packaging advances.Multichip module technology is ideal for packaging many processors and their memories into one package.Area high density interconnect (HDI) allows bare dice to be positioned very close together.This can be achieved because of the fine lithography of the interconnect, the lack of wire bonding, and the elimination of individual packages.
Dense packing is critical for embedded applications where size, weight, and power are constrained.A floating point intensive embedded application with a fixed computational requirement and size constraint necessarily implies a minimum acceptable MFLOPS/in 3 computational density.A rack mounted system typical of many embedded systems is comprised of a number of boards, each with one or more processors.The application's MFLOPS requirement divided by the sustainable MFLOPS on a board determines the minimum number of boards required in the system.Reducing the number of boards in a system reduces size and weight and, typically, increases its reliability, since the connector between the board and the motherboard is a frequent failure mechanism in a rack mount chassis.
The number of processing elements on a board may be limited by either the board's power budget or its available circuit area.Assume a board with available circuit area A and power budget P (excluding the power and area required for I/O circuitry), and that each processing element consumes W Watts and occupies S area (chips plus routing area).The maximum number of processors on the board is N P W A S = min , 3 8 .The total performance is equal to the number of processing elements (N) times the MFLOPS sustained by an element (M).
Multichip module technology can reduce the area of a processing element by a factor of 10, but can dissipate limited power.An MCM may only accommodate one or two 30 Watt data processors.However, eight WSSP elements dissipate only 20 Watts, and, at only 1 in 2 per processor, all eight elements fit on a single MCM layer.Therefore, far more processors can fit in a given area, increasing the number of processors on the board.
In addition to a single layer, planar arrangement of dice, multiple layers can be stacked on top of one another to yield even higher computational density.To effectively stack multiple layers of processor and memories requires controlling the Watts/in 2 to overcome the thermal resistance of the stack.At only 2.5 Watts/in 2 (20 Watts per layer), a four layer stack of eight element MCMs dissipates 80 Watts and occupies the board area of a single MCM.The low power architecture that permits the integration of so many processors onto an MCM brings more of the interconnect structure into MCM, thus minimizing board and backplane interconnections.This yields savings in size, weight, and power compared to a systems of similar computational capability that cannot achieve as high a computational density.As discussed below, bringing the interconnections into the MCM also improves the reliability of the system.
The number of pads within an eight element MCM (processors and memories) exceeds 8,000, but the number of signal, power, and ground leads coming off the MCM with wire bonds is less than 400, a reduction of 20 to 1.Because wire bonds are susceptible to stress failure, reducing the number of wire bonds can improve the reliability of the system.
The area HDI allows signals on the interior of the chip to be brought out to the HDI where it is convenient to do so.The dual processor in Fig. 1 has three columns of pads down the center of the die.The outer columns are the data bits to and from the memory chips for the two memory banks.Because each processor can access either memory bank, the most convenient location for the pads is in between the processors.Using only pads around the periphery of the chip would require far more routing on the chip and its associated capacitance.The perimeter of the processor is insufficient to support the 458 pads on the chip.
The area HDI removes the constraint that the pads be at the periphery of the chip, because wire bonds are not used.This is particularly important in routing power.With the power available only at the periphery, the voltage rails at the center of the chip may collapse during periods of high current demand because of the inductance and resistance of the power network [11].Area HDI allows the power signals to be distributed via a grid throughout the chip, and the copper metalization used in HDI routing is many times thicker than the aluminum metalization commonly used in VLSI.Because the HDI routing is thicker and may be much wider without sacrificing chip area, it provides lower impedance power distribution.
In addition to reducing board area required for processors and their memories, close die-to-die spacing and reduced capacitance due to fine lithography allow HDI to reduce the overall power dissipation.The lower power, in turn, may allow more chips to be packaged in one MCM because the power dissipation from driving interdie signals may be significant.
One drawback of relying upon area interconnect is that the bare dice are difficult to test with wafer probes.Therefore, the WSSP bare-die testing is performed using Joint Test Action Group (JTAG) [12] boundary scan and internal scan testing.
Hybrid wafer scale MCMs allow different types of components to be integrated into a single MCM.One of the most important components to place inside the MCM is decoupling capacitors to reduce simultaneous switching noise (see Fig. 1) [13].The hybrid also allows chips to be fabricated in different processes, each optimized to the function and manufacturing requirements of that component.For example, a fabrication process for a microprocessor might emphasize fast transistors, while a DRAM process emphasizes area efficient capacitors.The hybrid wafer scale integration permits the optimal process to be used for each component without making a compromise.In addition, the yield may be increased because parts may be individually tested prior to packaging in the MCM.The WSSP uses commercial memories to reduce cost.
Embedded systems may be subjected to severe vibration, mechanical and thermal shock, and (in a military system) excessive continuous g-forces that can substantially reduce the reliability of the system.High density interconnect does not rely upon wire bonding to connect dice on a layer.The HDI interconnect has been shown [1] to survive 1,500 g drop shock, 100,000 g gas gun shock, 178,000 g centrifuge, and 85 percent relative humidity at high temperature and pressure.

SOFTWARE ENVIRONMENT
The WSSP processor is a fully programmable digital signal processor.Its highly optimized vector routines complement a generalpurpose RISC-like assembly language.The GNU tool set including the C compiler, assembler, linker, libraries, a graphical debugger, and a real-time, POSIX-compliant Operating system.To access the high performance vector instructions, Basic Linear Algebra Subroutines have been supplemented with additional common vector operations and encapsulated in C callable library functions for single and double precision arithmetic.The Message Passing Interface (MPI) library implements standard interprocessor communications.

CONCLUSIONS
The Wafer Scale Signal Processor provides unique support for dealing with the throughput, latency, and reliability challenges of embedded systems development.Optimized to the metric of MFLOPS/Watt, the small, low power processing elements are able to fully exploit recent advances in dense, two-and three-dimensional, hybrid wafer scale packaging.These packaging techniques have been shown to be extremely reliable in harsh environments.
Processing elements containing dual processors can dynamically switch between independent processing, concurrent error detection, and a coprocessing mode of operation to balance needs for throughput, reliability, and low latency.
To allow developers and tools to understand, model, and predict system performance more accurately, the dual processor chip eliminates cache and DRAM in favor of flat, SRAM memory banks providing predictable instruction timing.This tightens the worst case performance prediction critical to ensuring real-time systems deadlines are met.
Devoting 30 percent of processor area to floating point circuitry and sustaining eight single precision or four double precision IEEE floating point computations per clock cycle per element reflect the emphasis of floating point performance in the processor architecture and provide the computational engine for both low latency and high throughput.