A 506Gbit/s Polar Successive Cancellation List Decoder with CRC

Polar codes have recently attracted significant attention due to their excellent error-correction capabilities. However, efficient decoding of Polar codes for high throughput is very challenging. Beyond 5G, data rates towards 1Tbit/s are expected. Low complexity decoding algorithms like Successive Cancellation (SC) decoding enable such high throughput but suffer on errorcorrection performance. Polar Successive Cancellation List (SCL) decoders, with and without Cyclic Redundancy Check (CRC), exhibit a much better error-correction but imply higher implementation cost. In this paper we in-depth investigate and quantify various trade-offs of these decoding algorithms with respect to error-correction capability and implementation costs in terms of area, throughput and energy efficiency in a 28nm CMOS FD-SOI technology. We present a framework that automatically generates decoder architectures for throughputs beyond 100Gbit/s. This framework includes various architectural optimizations for SCL decoders that go beyond State-of-the-Art. We demonstrate a 506Gbit/s SCL decoder with CRC that was generated by this framework.


I. INTRODUCTION
Channel coding is an essential part in baseband processing and enables reliable transmission. It has a long history going back to Shannon's noisy channel coding theorem in 1948 [1]. Advanced channel decoding schemes exploit soft information to improve the error-correction capabilities. The State-of-the-Art (SOA) coding schemes for soft decoding known today are Turbo codes, Low Density Parity Check (LDPC) codes and Polar codes. These codes are adopted in many communications standards like LTE, WiMAX, Wi-Fi, DVB-S2 and Ethernet, to name but a few. Polar codes are relatively new [2]. In 2009, Erdal Arıkan proved that these codes achieve channel capacity for the Binary Symmetric Memoryless Channel (BSMC). Due to their excellent error-correction performance, Polar codes have attracted significant attention and became part of the new 5G standard [3]. Future beyond 5G use cases are expected to require data rates in the Tbit/s range, which is one to two orders of magnitude higher than the throughput of 5G.
Successive Cancellation (SC) and Successive Cancellation List (SCL) [4] are the most prominent decoding algorithms for Polar codes. SC comes with a low algorithmic complexity but a limited error-correction performance for finite code lengths. SCL applies list decoding on the SC algorithm which significantly improves the error correction at the cost of higher algorithmic complexity. SC and SCL decoding algorithms traverse the Polar Factor Tree (PFT) in a depth-first manner [5] resulting in a sequential decoding procedure. Hence, it is common belief that Polar code decoding with SC and SCL cannot compete with LDPC codes in terms of very high throughput since LDPC codes are decoded with the Belief Propagation (BP) algorithm. The BP has an inherent parallelism, which allows a very high throughput in a natural way. Due to this behavior, BP was also adopted for Polar code decoding [2]. However, BP applied to Polar code decoding needs a large number of decoding iterations to approach the error-correction performance of SC [6]. This large number of iterations decreases the throughput and increases the latency. Moreover, even for a very high number of iterations, BP cannot compete with the error-correction performance of SCL decoding [7]. Hence, in this paper, we focus on advanced SCL decoder architectures for throughput towards Tbit/s due to their good error-correction capabilities.
To reach very high throughput without relinquishing errorcorrection performance, improvements on algorithmic and architectural level were investigated in the past. On algorithmic level, pruning the PFT for SC [5], [8] and SCL [9] can largely reduce the implementation complexity. At the architectural level, an unrolling of the PFT traversal and corresponding pipelining mitigates the sequential data dependencies and, thus, enables very high throughput. The principle of unrolling, applied to LDPC in decoders [10], was later adopted for Polar code decoding with SC [11] and SCL [12]. 644 Gbit/s coded throughput is reported for SC based decoder in a 28 nm technology in [13]. In [14], even 1274 Gbit/s coded throughput is presented for the same technology, but these results are based on synthesis only. After placement and routing, a much lower throughput is expected. As of today, only one publication of an unrolled SCL decoder exists which achieves 12 Gbit/s coded throughput in a 28 nm technology [12].
978-1-7281-4490-0/20/$31.00 © 2020 IEEE Concatenating Polar codes with a Cyclic Redundancy Check (CRC) notably improves the error-correction performance [4]. To the best of our knowledge, none of the previously published high-throughput architectures provide CRC support. This paper makes the following new contributions: • We in-depth investigate and quantify various trade-offs of the SC and SCL decoding algorithms with respect to error-correction performance and implementation costs in terms of area, throughput and energy efficiency in a 28 nm CMOS FD-SOI technology.
• We present a framework that automatically generates SCL decoder architectures. The framework includes various architectural optimizations for SCL decoders that go beyond State-of-the-Art.
• Finally we present a 506 Gbit/s SCL decoder with CRC that was automatically generated. The remainder of this paper is structured as follows. We give a brief review of Polar codes and their decoding algorithms, targeting optimized SCL, in Section II. In Section III, we present the architectural concept of our SCL decoders, the framework and the optimized nodes. Section IV gives detailed post-place-and-route results for area, timing and power in 28 nm technology. We discuss the trade-offs between errorcorrection performance and implementation cost and compare our designs against a SOA SCL decoder implementation. Section V concludes the paper.

II. BACKGROUND
Polar codes are linear block codes of length N = 2 n , K information bits and rate R = K/N , denoted by P(N, K). They belong to the class of multilevel concatenated codes and are related to Reed-Muller-Codes [15]. In contrast to Reed-Muller-Codes, Polar codes use the phenomenon of channel polarization [2] to split bit-channels into reliable and noisy channels. Information bits are sent over reliable channels, specified in the information set I. The unreliable channels, specified in the frozen set F = I C of length N − K, are used to include redundancy. The unreliable positions in an input vector u of length N , called frozen bits, are set to zero while the information bits are set in the positions of I. Polar code construction is equivalent to finding the most reliable channels for a given code length and rate, i. e., the set I, and depends on the underlying channel assumptions.

A. SC Decoding
The process of SC decoding [2] for Polar codes of length N can be represented as message passing between nodes of a balanced binary tree [5] with 2N − 1 nodes, named PFT. The leaf nodes at layer s = 0 are the N bits to be estimated by the decoder. The input to the root node with s = n is a vector α of N received channel Log-Likelihood Ratios (LLRs). In layer s of the PFT, a node of size N v = 2 s receives a vector α v of N v LLRs from the parent node and returns a hard decision vector β v of same length. The messages to the left and the right child nodes, α l and α r , are computed element-wise by with the hardware-efficient min-sum formulation of f-and g-functions [16] being Finally, the message to the parent node, the partial sum vector β v , is calculated by where ⊕ denotes the XOR-operation. We call (5) h-function. In this way, the PFT is traversed depth first with priority to the left child, since its result is used in (2) to determine the message to the right child. In the leaf nodes, β v only has one element, the estimated bitû i determined bŷ The PFT has N leaf nodes and N − 1 non leaf nodes. For the former, in total N bit decisions have to be made while for the latter, the three vector functions (1), (2) and (5) have to be calculated. Nodes whose leafs are all frozen (called Rate-0) or nodes whose leafs are all information bits (called Rate-1 nodes) can be simplified [5]. Alike, PFT tree pruning can be carried out for nodes that represent Repetition code (REP) nodes and Single Parity-Check code (SPC) nodes [8]. This PFT pruning largely reduces the number of nodes of the PFT. In unrolled decoder architectures, each vector function executed by a node, called a computational kernel, corresponds to a pipeline stage. Hence, the PFT pruning largely reduces the number of the pipeline stages and therefore the complexity of such high-throughput decoders.

B. SCL Decoding
SCL decoding has been introduced in [4] to improve the error-correction performance of Polar codes. In the PFT, instead of vectors α and β, lists of vectors are passed among the nodes. At the leafs of the PFT both possible values, 0 and 1, for an information bit estimation are considered. The number of codeword candidates (paths) therefore doubles for every bit estimation. The number of paths is limited to the list size L in order to keep decoding complexity practical. This implies that after a bit estimation paths must be discarded if their number exceeds L. The reliability of each path is rated by a Path Metric (PM) which is updated at every bit estimation. The lower the PM of a path, the more likely the associated codeword, thus paths with low PMs are allowed to survive.
For an LLR-based SCL [17], the PMs are initialized with 0 and each PM of path l, when estimating bit i, is updated by with HDD(α v ) being Hard Decision Decoding (HDD) on α v . The PM can be seen as a cost function, that is increased for the more unlikely bit decision with the absolute value of the dedicated LLR. After the last bit decision, the path with the lowest PM is chosen and output by the decoder. By including a CRC in the input vector u before the encoding, the selection of candidates in SCL decoding and therefore the error-correction performance can be further improved.
On architectural level, unrolling and pipelining of the PFT traversal for the SCL algorithm can be performed in the same way as for the SC algorithm [12]. The high throughput gained by unrolling comes at the cost of flexibility, since resulting decoders are specialized for a specific code and rate.

III. POLAR DECODER FRAMEWORK
In [13], a framework was presented to automatically generate unrolled pipelined high-throughput SC and Soft Cancellation (SCAN) Polar decoders. We extended this framework to also support SCL decoders. The framework written in C++ has the following inputs: the Polar code, selected decoding algorithm, target throughput, quantization and technology information. It outputs a fully synthesizable VHDL decoder model with corresponding test bench and test data. It is guaranteed by construction that the VHDL model and simulation model are bit-level equivalent. The central data structure of the framework is the PFT. The framework performs an automatic optimization and simplification of the PFT as described in the previous section. The framework contains a library of optimized computational kernels for the different PFT nodes. These kernels are implemented in C++ for error-correction performance simulation and as parameterizable VHDL hardware building blocks. In detail, the library contains the following kernels: • generic sorter with details provided in Section III-B. An SCL decoder architecture is generated by traversing the automatically optimized PFT. Whenever a node is visited a pipeline stage is created by using the building block associated with the corresponding computational kernel of the node. Some additional pointer management is necessary to keep track of the correct decoding paths. These pointers are determined in the leaf nodes, where the bit vectors are estimated from the LLRs, and represent the index of the input candidate each bit vector originates from. The leaf nodes also update the PMs, as described in Section II-B. In the sorter instances after each leaf node, the L best paths are selected and propagated. Parallel to the decoding process, the information bits determined in each leaf node are pushed to update the CRC register pipeline (see Section III-C). To optimize the decoder architecture, the framework inserts and balances registers between the different pipeline stages for a given throughput constraint according to derived timing models (see Section III-D).

A. Rate-1 Node
The Rate-1 node is an important node of SCL decoding which is described in more detail in the following. Depending on list size L and node size N v , the node input consists of L × N v LLRs with corresponding L PMs, while it outputs L out ×N v bits with corresponding L out PMs and path pointers. It is impractical to consider all L · 2 Nv possible codewords, thus different approaches were presented in literature. The architecture of [12] implements the algorithm of [9] where an approximation is introduced to reduce the number of candidate paths. They split the decoding path only for the two least reliable bits in the Rate-1 node. Thus, the errorcorrection performance sustains a loss. [21] proved that the needed number of paths splits without any error-correction performance degradation is However, the decoder architecture based on this algorithm presented in [20] needs P + 1 time-steps to process a Rate-1 node. This approach is not suitable for our framework, since the limit for the processing of one stage is one clock-cycle. The same accounts for the procedure presented in [22], where also multiple clock-cycles are needed.
To generate candidates in the Rate-1 nodes, we apply the following approach: The input LLRs are decoded via HDD in a first step. Then, according to (8), the P lowest absolute values of the LLRs of each input path are identified. This is done according to the first two steps of the sorting procedure described in Section III-B. L bit vectors of length N v save the estimated positions. Each of these vectors is then used to build 2 P bit vectors of length N v representing split flags, which are XOR-ed with corresponding HDD-estimate of the input. For L inputs of N v LLRs in this way partial sums β v are generated. The PMs are calculated according to (7) for each candidate. To keep the complexity manageable we introduced a restriction of N v = 4 for L > 4.
Decoders without CRC do not need to further explore different candidates in the rightmost Rate-1 nodes of the PFT since there are no more frozen bits which can change the order in the candidate list. The most probable path at this stage will stay the most probable one until the end of the decoding process. Hence, the right-most Rate-1 nodes only have to process the most probable input by HDD. In this way, especially for high code rates, the area and power requirements of the whole decoder can be reduced significantly. This is, to the best of our knowledge, the first implementation that decodes a complete Rate-1 node in one clock-cycle without any error-correction performance loss.

B. Sorter
The sorter unit is a further essential building block in the SCL decoder. The outputs of optimized leaf nodes with information bits, e. g., Rate-1, are connected to sorters. They reduce the number of L in candidates to the list size L before the next node can start its calculations. Hence, as already mentioned by [17], sorting is a bottleneck for SCL decoders. [23] proposes an iterative approach to reduce the sorting complexity by comparing the path metrics bit by bit. Depending on the numerical distribution of the PMs, a reduced number of iterations is required. However, a pipelined architecture requires building blocks with fixed latency. Hence, the advantage of such a dynamic iterative sorting approach cannot be exploited in our fully pipelined and unrolled decoder architecture. Other optimized sorting algorithms like pairwise sorting [24], simplified bubble sort or pruned bitonic sort [25] compare pairs of inputs in a list and switch positions, if required. The optimizations in these sorters focus on reducing the number of comparisons to minimizes the pairwise sorting cascades. E. g., in case of bitonic sort, the number of cascaded comparisons, corresponding to the latency, increases with O(log 2 (L in )). Since our architecture requires the sorting result in one clock cycle such an approach is also unfeasible.
To mitigate these drawbacks, in this work, the sorting is performed in a 4-step approach: comparisons are needed to fill M, since after comparing PM i with PM j , the values for m i,j and m j,i are known. The memory to store M can be reduced by omitting the entries for i = j. Thus, M is of size L in × (L in − 1), where the number of ones in row i represents how many PMs are smaller than PM i .

2) Rate paths
Each row of M represents the corresponding decoding path. To rate them, the row weight p i is calculated by for all rows i ∈ [0, ..., L in − 1] of M.

3) Candidate mapping
To realize input to output mapping of the candidates a multiplex-architecture is required. A vector c of length L in holds the indices of each path i ∈ [0, ..., L in − 1] in sorted order and is assigned according to It is guaranteed by the construction of p i that the assignment from i to p i is bijective.

4) Connect sorted candidates
In the last step, the best input paths are assigned in sorted order to the output. This mapping has to be done for all PMs, bit vectors and path pointers for all i ∈ [0, ..., L − 1]. This algorithm enables a single clock cycle implementation of the sorter unit with flexible L in to L configurations instead of restricted 2L to L sorting as used in [17]. The flexibility is necessary to sort the different number of candidates generated by the kernels. Up to list sizes of 8 the sorter unit is not the dominant critical path.

C. CRC On-The-Fly
To improve the Frame Error Rate (FER) performance, Tal and Vardy [4] proposed the concatenation of Polar codes with a short CRC to assist the selection of the best candidate at the end of the decoding process. To avoid an increase in the latency due to the additional calculation of the L CRCs, the CRC update has to be carried out in parallel to the decoding. We transformed the CRC update process of [20], which was used in a sequential decoder architecture, to our pipelined decoders. The CRC registers are integrated in and synchronized to the overall memory structure of the decoder. For a given CRC polynomial, the framework automatically generates the corresponding VHDL building blocks. In this paper, we selected a 6-bit CRC defined by the CRC-polynomial g(x) = x 6 +x 5 +1, due to its good performance for the chosen codes. For a fair comparison of FER performance with respect to the code rate, the CRC bits are placed in the positions of the best frozen bit channels. Thus, the code rate for the Polar code increases, but the overall code rate, including the CRC, does not change.

D. Optimized Register Balancing and Power Optimization
Deeply pipelined high-throughput architectures require a huge amount of memory that costs area and more particularly power. To face this issue, in the architecture of [12], registers of the delay line are removed, while the output registers of all computational elements are retained. The delay line holds α v and β l values until they are needed for the calculation of (2) and (5) respectively.
Thus, the architecture of [12] is partially-pipelined with an initiation interval of I = 20, i. e., their decoder inputs a new block respectively outputs a result every 20 clock cycles.
Our architectures are fully pipelined with an initiation interval I = 1. Registers are automatically inserted and balanced in the delay lines and computational units for a given frequency constraint f that is input to the framework. In this way the number of registers is minimized and, hence, the area and power of the generated decoder is optimized for given f . Note that f fixes the throughput since a code block of length N is processed every clock cycle (I = 1) yielding a coded throughput of N · f . Key to this register optimization is a technology dependant timing characterization of the eight PFT kernels and a timing engine that automatically calculates the critical timing paths in the architecture for a given framework input, i. e., block size, decoding algorithm, list size. The delay of the kernels depends on list size L and/or node size N v . Thus, we developed parametrizable VHDL building blocks for each kernel and characterized their timing behavior by performing in total 251 synthesis runs in 28 nm technology for various parameter settings. A parameter fitting with minimizing the least-squares was carried out to approximate the delay of the kernels by quadratic delay functions. For all kernels except the sorter, the corresponding delay function t is with k being corresponding weight factors. These factors were determined by the mentioned parameter fitting for each kernel. In the same way the delay of the sorting kernel is approximated by The timing engine of the framework internally uses these timing models to optimally insert and balance the registers along the pipeline stages in the architecture. To further reduce the power and clock load, the framework also supports clock gating, latch-based design and some further optimizations [13].

IV. RESULTS
This section presents error-correction performance, implementation results and corresponding trade-offs for various SC and SCL decoders. All codes used in the evaluation were constructed according to [26] with a design Signal-to-Noise-Ratio (SNR) of 0 dB for an Additive White Gaussian Noise (AWGN) channel. The architectures of the decoders were automatically generated with the aforementioned framework and implemented in a 28 nm Fully Depleted Silicon on Insulator (FD-SOI) technology under worst case Process, Voltage and Temperature (PVT) conditions (125 • C, 0.9 V for timing, 1.0 V for power). Synthesis is performed with the Design Compiler, Placement & Routing is carried out with the IC-Compiler, both from Synopsys. Power numbers are calculated with back-annotated wiring data. Error-correction performance simulations are carried out with Binary Phase Shift Keying (BPSK) over an AWGN channel. Channel values and internal LLRs are quantized with 6 bit without fractional bits. The PMs for SCL are quantized with 8 bit. This quantization scheme is used for all presented error-correction performance simulations and implementations and matches floating point precision with negligible FER performance impact. Presented throughput numbers are always given as coded throughput. For discussion of FER performance and implementation cost, we select an SNR of 4 dB as reference point.

A. Impact of List Sizes
We investigated the impact of list sizes on the errorcorrection performance and implementation cost. For this, a Polar code of length N = 128 was selected to enable list sizes up to 8 with feasible area consumption. List sizes 2, 4 and 8 with and without CRC were considered.
The maximum achievable frequency for the SCL4-CRC6 decoder was 499 MHz resulting in a throughput of 64 Gbit/s. Hence, we used this throughput as reference and set 64 Gbit/s respectively 499 MHz as target throughput constraint for all other decoders. The corresponding implementation results are listed in Table I. Fig. 1 shows the corresponding FER performance. We can observe a gain of~0.2 dB at an FER of 10 −3 (SNR of 4 dB) for list size 2 compared to SC decoding. This gain diminishes for higher SNR. But it comes with an increase of 2.9× in area and 2.6× in power consumption compared to the SC decoder. List sizes larger than 2 do not provide additional gains for this short code while the implementation costs increase dramatically.
The situation changes when a CRC is used in the SCL decoder. The FER gain with L = 2 compared to SC decoding is~0.  It is important to mention that, for this short code, list sizes larger than 2 without CRC do not provide much errorcorrection benefit. In contrast when using CRC, the FER performance can be further improved with increasing list sizes. List sizes larger than 4, however, only provide little benefit in error correction that is not in relation with the large increase in implementation costs.   Fig. 2 and Table II, respectively. Our SCL decoder with list size 2 achieves a throughput of 516 Gbit/s that is nearly the same throughput as the SC decoder. The SCL decoder with CRC achieves 506 Gbit/s. This is, to the best of our knowledge, the fastest Polar SCL decoder implementation. Fig. 2 shows the gain in error-correction performance of an SCL decoder with list size 2 compared to an SC decoder. At an FER of 10 −5 (SNR of 4 dB) we observe a gain of~0.3 dB. This gain must be paid with an increased implementation complexity, i.e., the area increases by 3.7× and the power consumption by 2.7×. The SCL decoder with CRC further improves the FER by~0.7 dB compared to the SC decoder at the cost of an 3.8× larger area and 2.9× in power. Corresponding layouts are shown in Fig. 3. The additional implementation costs for the CRC in the SCL decoder are relatively small compared to the additional gain in FER performance (~0.4 dB). To be more precise, the increase in implementation costs do not stem from the CRC itself, but are caused by the need of exploring multiple paths in all Rate-1 nodes (see Section III-A). All Rate-1 nodes together occupy a cell area of 0.18 mm 2 which is 3.2 % of the total decoder cell area and are highlighted in dark-red in Fig 3b. In contrast, the CRC units only take 0.05 %. Furthermore, the PFT changes when using CRC, since the CRC bits replace frozen bits and thus change the rate of the Polar code (see Section III-C).

C. Comparison with State-of-the-Art
To the best of our knowledge there is only one known high-throughput implementation of an SCL decoder [12] with list size 2, also implemented in 28 nm FD-SOI technology. A fair comparison is challenging, since [12] was designed for a systematic Polar code, which provides a slightly better BER compared to a non-systematic code. In this paper, we considered non-systematic Polar code to enable the CRC onthe-fly calculation. To make the comparison as fair as possible, we used the same code size, code rate and internal quantization as the authors of [12]. Note, that the constraints on the node sizes given in Section III differ from the ones of the reference but it has no impact on the error-correction performance. Since this decoder was implemented in the same technology we generated two decoders with our framework. In one we constrained the area to the area of [12]. In the other we constrained the frequency to the frequency of [12]. Implementation results of the two decoders are shown in Table III. According to our comparison methodology our decoders outperform the SOA in throughput and area efficiency while providing even a better energy efficiency.

V. CONCLUSION
In this paper we presented a detailed analysis on the tradeoffs between error-correction performance versus implementation costs for SC and SCL decoders. We studied the impact of list sizes and showed the advantage of CRC supported SCL decoding. This analysis is based on 10 advanced Polar SCL decoder implementations for very high throughput whose architectures were automatically generated by a framework. The generated decoders can achieve throughput up to 516 Gbit/s in 28 nm CMOS FD-SOI technology under worst case PVT conditions.