A 100 Gbps LDPC Decoder for the IEEE 802.11ay Standard

IEEE 802.11ay is the amendment to the 802.11 standard that enables Wi-Fi devices to achieve 100 Gbps using the unlicensed mm-Wave (60 GHz) band at comparable ranges to today’s commercial 60 GHz devices based on the 802.11ad standard. In this paper, we propose a full row-based layered LDPC decoder supporting all the coding rates for 802.11ay. Taking the property of the parity check matrix of 802.11ay, combining multiple layers into single layer improves the hardware utilization hence increases the throughput. The throughput is further increases by interleaving multiple frames to improve the utilization of each pipeline stage. The decoder is synthesized at both 28 nm and 16 nm CMOS technology and power estimated with stimuli at 7db and 3.5db. The 28 nm implementation running at 600 MHz and achieves a throughput of 67 Gbps for coding rate 13/16 at 4 iterations with area efficiency of 160 Gbps/sqmm and consumes an average power consumption of 408 mW and 141 mW, yielding energy efficiency of 6.05 pJ/bit and 2.1 pJ/bit at 3.5db and 7db. The 16 nm implementation running at 1 GHz and achieves a throughput of 112 Gbps at 4 iterations with area efficiency of 589 Gbps/sqmm and consumes an average power of 408mW and 163mW, yielding energy efficiency of 3.64 pJ/bit and 1.45 pJ/bit at 3.5db and 7db.


I. INTRODUCTION
Low-density parity-check (LDPC) codes [1] [2] [3] play an important role in the wireless communication domain, since the first adoption in the DVB-S2 standard in 2003.This is due to the near Shannon limit error correction capability and quite mature hardware implementations.
The 60 GHz spectrum is known as the oxygen absorption band.That means radio waves at those frequencies are actually degraded by the presence of oxygen in the air.For this reason, 60 GHz was considered only appropriate for point-to-point, outdoor applications using highly-directional antennas (e.g., wireless links between two networks) until recently.IEEE 802.11ad for the Gigabit-wireless local area network (WLAN) and IEEE 802.15.3c for the wireless personal area network (WPAN) are multi-gigabit data rate standards working in the 60 GHz band.
While enabling multi-Gbps wireless local communications was a significant achievement, throughput and reliability requirements of new applications, such as augmented reality/virtual reality (AR/VR) and wireless fronthauling/backhauling, exceed what 802.11ad can offer.To meet the demanding requirements of such diverse applications, the IEEE 802.11Task Group ay (802.11ay) was formed in 2015 to define PHY and MAC amendments to the 802.11 standard that enable Wi-Fi devices to achieve 100 Gbps using the unlicensed mm-Wave (60 GHz) band at comparable ranges to todays commercial 60 GHz devices based on the 802.11ad standard.Where 802.11ad uses a maximum of 2.16 GHz bandwidth, 802.11ay bonds four of those channels together for a maximum bandwidth of 8.64 GHz.MIMO is added with a maximum of 8 streams.The link rate per stream is 44 Gbps with 256-QAM, with four streams this goes up to 176 Gbps.
In recent years, there are many published designs of multi-Gbps LDPC decoders.However, designing an LDPC decoder with 100+ Gbps throughput, low power consumption and low area cost while maintaining good BER performance has been a big challenge.
The deep scaled CMOS technology brings significant implementation gains for digital circuits design [4] in terms of speed, area and power consumption.It serves as a major enabler of wireless communication at higher data rates with improved energy and area cost.The commercial 16 nm Fin-FET technology from TSMC [5] claims around 50% speed gain under same power consumption when compared to the 28 nm node.However, technology scaling alone is not enough to cover the increased throughput requirements.Code innovations and new architectures are the key driver to make ultra-high speed LDPC decoders.
There is a multi-dimensional design space for high speed LDPC decoders [6], such as decoding scheduling, optimization of check node calculation, parallelism extensions, pipeline stage optimization, frame level pipeline, etc.To design a high throughput decoder, the designer should consider these multi-dimensional optimizations.The optimization requires a trade-off in performance, power and area, depending on the application.
A fully parallel LDPC decoder [7] can reach the highest throughput.The state of art (SOA) CMOS technology is able to support this kind of architecture for short code length (<1000).However, as the code length increases, long interconnects create routing congestions that limit an efficient design and result in a large decoder with low area utilization, poor timing and power consumption.
For architecture-aware (AA) LDPC codes, a partial rowbased parallel architecture [8] is a good alternative.It reduces the routing congestion problem while increasing the throughput when compared to a block serial decoding [9].

IEEE 10th International Symposium on Turbo Codes & Iterative Information Processing
row-based scheme, extending the bit node parallelism to the code length while keeping the check node parallelism same as the size of the sub-matrix.This achieves high throughput while reducing complexity compared to a fully parallel architecture.The unrolling architecture [12] is based on a fully parallel architecture and further increases the decoding parallelism by unrolling the iteration loop.Dedicated hardware is added N times that unrolls the iterations in a two-phase decoding architecture, where N equals to the maximum iteration number.The unrolled architecture is pipelined, to allow processing one block per clock cycle.This architecture also reduces the routing congestion compared to classical fully parallel architectures.The unrolling architecture significantly increase the throughput and opens a door for 100+ Gbps FEC [13] [14].However, this architecture lacks the flexibility in coding rates and early stop capability.
In this work, we propose a full row-based layer LDPC decoder with frame interleaved scheduling for the 802.11ay standard.Ultra high throughput is achieved using frame interleaved and highly compressed decoding making best use of hardware resources.For example, it only takes 3 cycles to finish message passing for 4 layers for the 1/2 rate.Different from the unrolling architectures, the decoding parallelism is increased by changing message update scheduling and make the best utilization of a row-based layer LDPC decoder, hence the early stop and flexibility in rates are inherited from row-based layer LDPC decoder.The proposed architecture is verified in 28 nm CMOS as well as in 16 nm FinFET technology.With these advanced technologies, the proposed decoder can reach 100 Gbps throughput at clock rate of 1 GHz and shows great implementation results in both area and energy efficiency.
The rest of the paper is organized as follows.In section II, the LDPC codes in the 802.11ay standard are introduced, followed by a brief introduction of layered decoding algorithm.Section III details the proposed frame level pipelined full rowbased layer architecture.The implementation results are shown in section IV.Section V concludes the paper.

II. BACKGROUND A. LDPC codes in IEEE 802.11ay
An LDPC code is a linear block code defined by an M × N sparse parity check matrix H, where N is the code length and M is the number of parity checks.The LDPC codes in IEEE 802.11ay are a kind of architecture-aware (AA) LPDC codes.The parity check matrix is described as an M b × N b based matrix H b , with M b = M/z and N b = N/z, where z is the size of the sub-matrix.A sub-matrix is either a zero matrix or a cyclic permutation of rows of an identity matrix.In our case, the length of the codes is equal to 1344 or 672 and the size of the sub-matrix is 42.Severn coding rates, 1/2, 5/8, 2/3, 3/4, 13/16, 5/6 and 7/8 are supported for each code length.The LDPC codes for 802.11ay are derived from the LDPC codes in 802.11ad by lifting [16].The rate 2/3, 5/6 are created by shortening from rate 3/4, 7/8 and rate 7/8 is created by puncturing from rate 13/16.There are two different The base matrix of rate 1/2 for short code length and lifting code length are shown in Fig. 1 and in Fig. 2, respectively.The number written in each block denotes the cyclic-rightpermutation value for the identity sub-matrix.An empty block denotes a null sub-matrix of size 42 × 42.The lifted parity check matrix has the same check node and bit node degree as the base parity check matrix.

Algorithm 1 Layered decoding
Initialization:  Fig. 3. BER for all coding rates over AWGN channel with maximum 14 iterations (5bits) with layered scheduling.A group of information can be calculated and passed between bit nodes and check nodes by applying permutation at one clock cycle.For a block serial LDPC decoder, the a posterior information vector Q n with size of z is initialized with input log likelihood ratio llr n and the extrinsic information vector r mn is initialized with zeros.Each iteration consist of M b sub-iterations.During each sub-iteration, the a posterior information Q n psses through a permutation network with a shifting value m,n equal to P (m, n), where P (m, n) is the shifting value indicated in the base matrix.The check nodes utilize the intrinsic information q mn which is achieved by subtracting the extrinsic information r mn from the a posterior information Q n to calculate the updated extrinsic information by using SPA or simplified min-sum [24] algorithm.Afterwards, the updated extrinsic information r mn is added to the a prior information.Finally, the a posterior information passes through the permutation network again with a shifting value τ m,n equal to z −P (m, n) and saved back to memory.Fig. 3 and Fig. 4 show the performance comparison between floating and fixed point under different maximum iteration number for the five coding rates with code length 1344 using QPSK modulation over AWGN channel.The 5 bits quantization performance shows negligible performance when compared with floating point decoding with maximum 14 iterations, which gives optimal BER performance.The following decoder is configured at 4 bits quantization with 4 maximum iterations, when compared to optimal performance in Fig. 3, the performance degradation is around 0.3 to 1 dB for different coding rates.

III. PROPOSED ARCHITECTURE
In this section, we detail a full row-based layered LDPC decoder with frame interleaved scheduling for the 802.11ay standard.The architecture is shown in Fig. 5.There are three pipeline stages in the architecture and it takes 3 clock cycles to finish one layered decoding.In the first cycle, the a posterior information Q n is read from 32 storage elements, then passing through the barrel shifter and subtracted by extrinsic information to produce the a priori information q mn .The syndrome detection is also performed in the first pipeline stage.In the second cycle, the a priori information is saturated into smaller bit width and fed into the CFU (check node function unit).The CFU takes charge of finding the minimum absolute value and overall sign of all the fed in a priori information.In the last cycle, the output module generates the extrinsic information, added to the pre-stored a priori information and then directly written back to the storage elements.
In the parity check matrix of the code rate 1/2 and code Fig. 6.Min-sum search tree with 16 inputs rate 5/8, there are groups of two layers with no overlapped connection with any bit node and the degree of each combined layer is not larger than 8. Based on this property, it is feasible to perform message exchanges between two layers at one layered decoding.This is detailed in [10].Every consecutive two layers in the lifted parity check matrix do not have overlapped connection with any bit node as well.Hence performing two consecutive layers in one layered decoding is feasible.We call this kind of layer as compressed layer.Taking coding rate 1/2 whose parity check matrix is shown in Fig. 2 as example, it is even feasible to perform 4 specific layers in one layer decoding in 3 clock cycles, which boosts the throughput by a factor of 2 compared to a full row-based layer LDPC decoder for the 802.11ad standard.We call this kind of layer as super compressed layer.Taking rate 5/8 with code length 1344 as example, it has 2 compressed layers and 2 super compressed layers.Each compressed and super compressed layer is one effective layer in hardware implementation as shown in I.
A min-sum search architecture with 16 inputs is shown in Fig. 6.In the proposed architecture, two of these are instantiated.The 32 inputs min-sum search tree is able to create check node information for normal layer, compressed layer and super compressed layer.How to map the a priori information from bit node of a combined layer to min-sum tree is not straightforward.In 802.11ad, column and layer reordering [10] guarantees every two-bit node vector has a oneto-one mapping to the first half and second half of the minsum search tree.In 802.11ay, an additional two level routing network is required.The first level switches the input order of every two inputs, which is called a cell.The second level switches the input order of every two cells.One cell consists of 4 elements in the parity check matrix and has two possible orders as show in Fig. 2 top with a combined parity check matrix with column and layer switch for coding rate 1/2 is shown at the bottom.
In the architecture proposed in Fig. 5, a 3-stage pipeline to perform message exchange between bit nodes and check nodes, as defined in the compressed parity check matrix.To design for high throughput and optimal hardware utilization, a frame interleaved scheduling architecture is used.This makes sure that all pipeline stages are actives on each clock cycle, working on 3 frames at once as shown in Fig. 7.

IV. IMPLEMENTATION RESULTS
The LDPC decoder is implemented in 16 nm FinFET as well as 28 nm CMOS technology.Synthesis is performed with Cadence Encounter using Worst Case PVT settings.The design is synthesized at 1 GHz in 16 nm and 600 MHz in 28 nm resulting in an area of 0.19 mm 2 resp 0.42 mm 2 .For a fixed number of iterations, the minimum decoding throughput of the frame-interleaved decoder is calculated as where N ldpc is 1344, N f rame is 3, f clk is the clock speed, C layer is 3 and N layer as listed in Table I.For 16 nm, running at 1 GHz, this gives a throughput of 112 Gbps for code rate 13/16 at 4 iterations and 67 Gbps throughput for the other rates.
Table .II gives a comparison between this work and the SOA high speed LDPC decoders for 802.11ad.Results reported for this work use a netlist from front-end logical synthesis.Power number is obtained through simulation using input LLRs stimuli created at E b /N 0 3.5 dB without early stop and 7.0 dB with early stop.The advanced 16nm technology node brings a factor of 2.2 gain in the eare and a factor of 1.6 in speed and energy efficiency when compared to 28nm.

V. CONCLUSION
A full row-based layered LDPC decoder with frame interleaved schedule is proposed for 802.11ay, implemented in 16 nm FinFET as well as 28 nm CMOS.The proposed architecture is flexible to support all coding rates and early stop as well.The 16 nm design has an area of 0.19 mm 2 with a power consumption of 408 mW and 163 mW at 3.5dB and 7dB running on an 0.8 V supply at 1 GHz clock.It achieves a throughput of 112 Gbps with maximum 4 iterations at the highest coding rate, resulting in area efficiency of 589 Gb/iter/sqmm and an energy efficiency of 3.64 pJ/bit and 1.45 pJ/bit.The implemented results show significant improvement in throughput, area and energy efficient when compared to SOA.These performance gains are related to the architecture improvements and going to a more advanced technology node.

n
Hard decision according to |Q| nB.Layered decoding scheduleThis kind of architecture aware AA-LDPC codes can be decoded by partial parallel Sum-Product Algorithm (SPA) 3 2018 IEEE 10th International Symposium on Turbo Codes & Iterative Information Processing

5
2018 IEEE 10th International Symposium on Turbo Codes & Iterative Information Processing