Parallel Computation for Electronics Design

The use of parallel computational techniques for the design of vacuum electronic devices is presented with applications. The capabilities of parallel machines with an overview of the environment of the Connection Machine at NRL are discussed. An overview of particle-in-cell numerical techniques with application to both one-dimensional and three-dimensional codes is presented. >


I. INTRODUCTION
UE to the increasing complexity of the design of vac-D uum electronics, the vacuum electronics branch at the Naval Research Laboratory has been developing an increased computational support base.The extension of this support to parallel architectures is in development on the NRL Connection Machine.A distinction is made between vector machines such as the Cray series and the many processor machines such as Goody ear's Massively Parallel Processor and Thinking Machine's Connection Machine series.
The use of parallel computational processing for the study of the motion of electrons began before the use of digital computers with the use of Hartree's differential analyzer, an analog system.By dividing the workload among the processors, i.e., his students, the effective throughput was greatly increased.The use of computers to study the motion of electrons in vacuum electron devices dates back to work on magnetrons by Hartree and Buneman, and on the traveling-wave tube by Nordsieck in the 1940's.These simulations were onlyable to include a small fraction of the system with greatly contrained physics but provided a significant improvement in our understanding of the nonlinear behavior in these devices.Today, available simulation tools are on the verge of becoming fully three dimensional with the complete set of Maxwell's equations satisfied.The computational resources required by these codes include a significant amount of computer memory, speed, and cost.On a Cray class machine, with information organized in typically 64 bit words, computational speed measured in floating point instructions per second (flops), problems of interest with 105-107 operations require megawords ( IO6 words) and hundreds of millions of Flops (MFlops), which can cost hundreds to thousands of dollars to provide answers within an acceptable time.These constraints upon resources arise from both the computer architectures and available sim-

ZAIDMAN
ulation codes that are constrained to work on these architectures.One possible approach to improving the simulation's usefulness in design is the implemention of threedimensional fully electromagnetic design codes on parallel architecture machines.
In the following, the advances of parallel computer architecture are summarized and compared to the advances in vector architectures in Section 11.A discussion of particle-in-cell (PIC) simulation techniques with examples for a one-dimensional parallel electrostatic code is presented in Section 111.Section IV presents possible future directions in the use of parallel computational techniques for vacuum electronics design.

COMPUTER ARCHITECTURE ADVANCES
As one tries to improve the performance of the computational hardware upon which simulations are to be done, several problems appear immediately in the traditional serial machine approach.The circuit switching speeds are limited by materials and fabrication techniques.The requisite compactness of design due to the finite speed of signal propagation gives heat dissipation difficulties.In practical devices, the rate at which heat can be removed places a maximum circuit density limit that, for a given material, is an important factor in determining the net computational speed available.These problems are largely avoided by using many less densely packed processors, substituting instead problems of definition of solution methodology, memory access conflicts, and interprocessor communications.
The vendors in the vector supercomputer arena include Fujitsu, Hitachi, NEC, ETA, CDC, and Cray.Consider the advancements in the Cray systems for example.Just considering the processor speeds, the improvement has been from = 10' flops in the Cray-1 to a projected speed of = 1OI2 flops in the proposed Cray-4.The first Cray-1 [l] was delivered to Los Alamos Scientific Laboratory in early 1976.Its increase in speed over the then current computers was made possible to a large extent by the inclusion of vector pipelining, a technique to keep the CPU busy.To further increase performance in the Cray-2, the removal of heat in the more compact geometry was addressed by immersion of the boards into liquid refrigerant.The Cray-3 uses GaAs technology to improve switching speeds as does the proposed Cray-4 (with = 10l2 flops projected) which is to contain 64 processors.The vector supercomputers all use relatively few high-speed CPU's, but the instructions to each CPU are pipelined so that par-U.S. government work not protected by U.S. copyright allel-type instructions may be completed rapidly with busbased communications.
The opposite approach is to marshal1 the powers of many less powerful processors to attack the computations in parallel.A seminal paper for parallel architectures is The SOLOMON Computer by Slotnick et al. [2].They proposed a design with 1024 processors under the control of a central control unit with a single instruction stream.The SOLOMON concept forms the basis of several parallel machines, including the ILLIAC IV [3], International Computers Ltd.Distributed Array Processor (ICL-DAP) [4], Goodyear Aerospace Corporation's Massively Parallel Processor (MPP) [5], and Thinking Machines Corporation's Connection Machine (CM) [6].
To orchestrate the computations among the processing elements, a communications network is established.To be useful, this communications network must be relatively efficient and cost effective.The approach of connecting all of the processors together directly is seen by most as being prohibitively expensive.The ILLIAC IV computer was delivered to NASA Ames Research Center in 1972, but was not really operational until 1975 due to the fact that the state-of-the-art techniques incorporated into its manufacture were too ambitious at that time.The delivered machine had 64 processing elements with each processor connected to its nearest neighbors.Fig. 1 illustrates nearest neighbor connectivity with a two-dimensional grid.The International Computers Ltd.Distributed Array Processor (ICL-DAP) was introduced in the 1970's with 4096 processors in a 64 by 64 array with nearest neighbor communication.The Goodyear Aerospace Corporation's Massively Parallel Processor (MPP) with 16 384 simple processors, each with 1 kbit of local memory, also uses nearest neighbor communication.A compromise between the direct connection of all processors and the nearest neighbor approach is to connect each processor to each of 2" others.This Boolean n-cube or hypercube connectivity is more widely used and is the communication network used in the Connection Machine (CM).The Boolean n-cube connectivity is perhaps best illustrated recursively with the connections per processor increased by one for every doubling of the total number of processors as seen in Fig. 2. The CM series has up to 65 536-bit sliced processors each with 64-kbit local memory with reconfigurable data routing among the hypercube connection paths as well as nearest neighbor.The bit sliced processor works with data 1 bit at a time rather than with words of, for example, 64 bits as in a Cray.The design complexity and cost of such a processing element can thus be relatively small, a necessary attribute for machines with so many of them.Well established techniques were used for the construction of the CM, avoiding the ILLIAC IV difficulties.
A comparison between a typical coarse division of computation, a two-processor Cray X-MP/24 and the most finely divided division in a CM-2 may be instructive.The Cray performance ranges from about 5 Mflops for an unvectorized code to around 480 Mflops for highly vecto-  rized cases.The CM-2 peaks at 10 000 Mflops for an essentially fully parallel dot product.The word size on the X-MP is 64 bits while on the CM-2 a bit-sliced approach is used, but 64-bit floating point coprocessing chips can be used.The amount of memory for the X-MP/24 is 4 Mwords of 64 bit words and the CM-2 can have 64 Mwords of 64 bit words but the processors are really bit sliced.The cost of these machines is roughly $8.5 million for the X-MP/24 and roughly $1 million for the CM-2.
The cost per instruction per second on the CM-2 approaches 0.lC compared to about 8.5C for the X-MP/24.At NRL's CM-1 and X-MP/24 for FFT calculations, the crossover point in CPU performance comes at about 4000 processors operating simultaneously [7].The ability to develop algorithmic techniques to distribute the computational load (load balancing) and access memory in an efficient fashion across processors determines the applicability of such architectures.

PARTICLE IN CELL SIMULATION TECHNIQUE
The use of the particle-in-cell (PIC) method [8] is particularly effective for vacuum electronic device simula-tion because the fundamental Maxwell and Newton-Lorentz equations are used to calculate the particle-field interactions self-consistently .The principle bottleneck for sequential computer architectures (including vector architectures) is the massive number of particles required for realistic PIC simulations.The use of multiple CPU's to perform particle calculations simultaneously could alleviate this difficulty.Buzbee [9] examined fusion directed particle codes and found a large degree of parallelism.The parallel architecture CHI system by Dawson et al.
[ 101 showed that a significant performance improvement with reasonable cost could be achieved in practice with a parallel PIC model.More recently, in a hypercube geometry, Liewer et al. [I l] implemented a one-dimensional electrostatic PIC code on the JPL Mark I11 Hypercube.The particle calculations were done in parallel, with an efficiency (speeduphumber of processors) for the particle push of 84 percent, resulting in an overall code performance of 1 /7 the speed of a Cray 2 for a model problem.
The principle algorithmic areas in a PIC simulation code include: 1) conversion of particle quantities such as charge and current into quantities defined on a computational grid; 2) solution of the field quantities at these grid positions; 3) interpolation of these fields back to the particle locations, so that 4) the particles may be advanced forward in time.
With the advent of particle c* grid simulation techniques, the number of operations required to solve the system at each step forward in time went from =N;altic~e to = Ngrid log Ng,d.For parallel architectures, when the communication pattern is limited as is the nearest neighbor scheme, extreme programming care is required as the interprocessor communications time can easily eliminate any gains due to a multiple processor distribution of the computation.Perhaps one could not use a grid at all and instead use discrete Fourier transform techniques with a resulting operation count Npalticle Nk for Nk Fourier modes.
Since Nparticle >> Ngrid the gridless system requires more steps, but if each particle can be assigned to its own processor, the operation count goes to = N k .When connections are more readily established between CPU's, the grid based approach becomes attractive again as the element with which the geometry of the device can be mapped onto the computer.
In the traditional serial and vector machines, the codes are often dominated by the particle calculations.When the number of processors becomes on the order of the number of particles, this is completely reversed.The bottleneck becomes the translation between the particle and grid quantities.'Thus, areas 1 and 3 above, which depend heavily upon the details of memory access, define the new bottleneck.In area 2, the solution for the field quantities on the grid may be implemented with each grid node processed by a different CPU using a local formulation of the relevant equations.Area 4 may also be treated as a algorithmically local task.Some first steps toward implementing a vacuum electronic device design code were ac-complished with the construction of a simple onedimensional electrostatic code that addresses these four areas.
The onedimensional electrostatic code developed for the Connection Machine environment was written in a variant of Common Lisp denoted "*Lisp" with a Symbolics 3600 series Lisp machine front end.The user interface resides on this front end and the serial computations are done there.(The Connection Machine was originally envisioned as a computer for parallel artificial intelligence applications, so Lisp was the first high-level language for it.)Some of the most powerful constructs on the CM are involved in the distribution of information from one parallel variable into another.A parallel variable such as the set of x positions for all particles is distributed among the processors with each particle's attribute ( x position) at one processor (multiplied by the number of virtual processor layers if virtual processors are used).We look at the four areas starting with the particleto-grid gathering.For each processor element that has a particle associated with it, that processor is turned on for the operation of selecting that particle's charge and position.The particle's position is the parameter that determines which processor receives the charge of that particle into its charge-accumulation grid variable.Although the target code was to use linear interpolation for each dimension, the feasibility of the overall approach was first looked at using a zeroth-order scheme.For the zeroth order nearest grid point interpolation (NGP) scheme on a unit-spaced grid, this all occurs in one line of *Lisp code ( *pset : add ( ! !charge) chargedensity (truncate!!x -particle)).
An explanation of the various elements in this statement is as follows.The ""pset" is a *Lisp function that takes one parallel variable (pvar) and writes it into another.The ":add" qualifier to the "*pset" indicates that in writing into the destination pvar the values written are to be added to the values already in place.The value pvar "(!! charge)" is actually composed of the parallizing operator !! operating on the normal variable charge.This forms a pvar with the value "charge" at each processor.The uniform syntactic form between data and program in Lisp allows the substitution of this program, i.e., distribute charge into a pvar, for data, i.e., a value pvar.The destination pvar, "charge-density ," is written into based upon the address pvar "(truncate!! x-particle)." The value pvar is written into the destination pvar of the processes addressed by the address pvar.Each processor can contain the elements of many pvars, so that processors can contain both particle elements and grid elements, one or more particles and/or grid nodes at each processor.
The process of scattering the field quantities back to the particles, area 3, is similar to the gathering step, area 1.The corresponding *Lisp code is ).This statement is only slightly more complicated than that for area 1.The "*set" is a *Lisp function that takes a pvar and transfers its contents to another pvar in a purely local sense, Le., the element of the source pvar in processor n is transferred into the element of the destination pvar also in processor n.The "elx-particle" is the destination pvar.The source pvar is what remains in the above expression, and it is here that the communications between processors is used.A finite-difference approximation to the electrostatic field from the electrostatic potential is used with the particular CPU address locations referenced via "pref!!" located at right and left of the self-address of the processor associated with the particle in NGP fashion via the "(truncate!! x-particle)" construct.
The solution of the field at the grid positions, area 2, may be accomplished any of several ways.These may be broken down into the broad classes of local and nonlocal algorithms.In the nonlocal class, a fast Fourier transform technique may be used as an example.Here, Poisson's equation a2/ax2+ = -p is transformed into the wavenumber domain for solution + ( k ) = k -' p ( k ) , and then an inverse transform on + yields the desired result.Since the relative speed of operations within a single processing element to the speed when processor-to-processor communications must be used is on the order of 1 ps to 1 ms, algorithms that do not rely heavily on global communication may be preferred.This local class of algorithms must still account for the boundary conditions of the device simulation, however.For local sampling, k 2 is replaced by k 2 d i f ( k A x / 2 ) , which actually reduces to the use of the three-point finite differencing of V2.For the one-dimensional electrostatic code a relatively simple direct tri-diagonal solver for periodic boundaries was implemented for the solution of Poisson's equation.If one lets the potential at processor 0 be 0 (Lisp starts counting at 0) and uses charge neutrality to obtain a potential at processor 1, the serial approach takes steps through each of the remaining processors in turn to solve 6; = p i -+

2&--
The principle bottleneck in serial and vector machines, area 4, the particle advancement step, is the most efficient on the parallel processor machine since all of the required quantities are local to each processor and no communication is required.
One of the initial test cases is the electrostatic twostream instability.The simulation length was 128 AD with 2000 particles, wpI = up2 = 1, with streaming velocities -+0.2.Timing information was not taken for these runs as the implementation was not optimized.The wall clock time was almost completely determined by the output of graphics on the Symbolics front end.The simulation proceeded as anticipated with Fig. 3 showing the phase-space as anticipated for the instability.
IV. DISCUSSION AND FUTURE DIRECTIONS Experience with the design and development of the onedimensional electrostatic code was so encouraging with respect to the ease of establishing flexible communication patterns that the construction of a three-dimensional code is in progress.The overall structure of this code is similar to that described above.The major differences arise from the use of three-dimensional generalized geometries.To accomodate the wide variety of geometries that differentiate among the various devices that might be considered, a metric is defined locally at each grid node allowing the use of conformal mapping of the grid to the device.A local Cartesian space is used for particle pushing to avoid some numerical difficulties.The solution of the field quantities uses local convolution operators to solve the equations.The boundary conditions for the fields are then used as a global condition that is propagated through the system in an iterative manner.The efficiency with which the boundary information is propogated through the computation and the locality of these conditions is expected to significantly influence the potential usefullness of the parallel code implementations.The difficulties of load balancing and memory management have been significantly less important than previously expected.The assignment of particles uniformly among processors distributes the loading, while the combination of nearest neighbor and hypercube communication routing in a large bandwidth path minimizes the penalty.To improve the numerical noise and accuracy, digital filter techniques will be utilized, with grid discreteness and aliasing to be minimized with the same approach as the convolution techniques already in place.
From the user's perspective, the code should provide for ease in describing the device to be simulated with a clear exposition of the device characteristics.The implementation of a variety of diagnostic and device specific algorithms needs to be done for the code described herein.Perhaps the most difficult part of the code development is the design of the interface to input the system that is to be simulated and also in the design of the diagnostics that inform the user of the progress and results of the simulation.The design and use of an interactive graphical and textual interface is expected to be necessary to achieve the flexibility required.The use of standardized files from computer-aided design packages is expected to help ease the input burden and the use of a fast frame buffer interfaced to video output devices is expected to allow the display of a significant fraction of the information that can be generated in the three-dimensional fully electromagnetic simulation environment.
It now appears that the use of parallel computers can make a contribution to the computational needs of elec- The author acknowledges helpful discussions with Dr. R. H. Jackson.
Manuscript received April 7, 1988; revised August 4 , 1988.This work The author is with the Naval Research Laboratory, Washington, DC IEEE Log Number 8823880.was supported by the Office of Naval Research.

Fig. 1 .
Fig. 1.A nearest neighbor connection pattern in two dimensions connects processor P to the nearest four processors, i.e, to the north (N), east (E), south ( S ) , and west (W).

Fig. 2 .
Fig. 2. A hypercube connection pattern is obtained by doubling the number of processors in the system and adding one connection to each processor.This recursive definition is demonstrated with the new connection being light connecting line for (a) 0-cube, (b) 1-cube, (c) 2-cube, (d) 3-cube, and so forth.

Fig. 3 .
Fig. 3. Typical output from the one-dimensional electrostatic code for the two-stream test case.