Strong 8-bit Sboxes with efficient masking in hardware extended version

Block ciphers are arguably the most important cryptographic primitive in practice. While their security against mathematical attacks is rather well understood, physical threats such as side-channel analysis (SCA) still pose a major challenge for their security. An effective countermeasure to thwart SCA is using a cipher representation that applies the threshold implementation (TI) concept. However, there are hardly any results available on how this concept can be adopted for block ciphers with large (i.e., 8-bit) Sboxes. In this work we provide a systematic analysis on and search for 8-bit Sbox constructions that can intrinsically feature the TI concept, while still providing high resistance against cryptanalysis. Our study includes investigations on Sboxes constructed from smaller ones using Feistel, SPN, or MISTY network structures. As a result, we present a set of new Sboxes that not only provide strong cryptographic criteria, but are also optimized for TI. We believe that our results will find an inspiring basis for further research on high-security block ciphers that intrinsically feature protection against physical attacks.


Introduction
Block ciphers are among the most important cryptographic primitives.Although they usually follow ad hoc design principles, their security with respect to known attacks is generally well understood.However, this is not the case for the security of their implementations.The security of an implementation is often challenged by physical threats such as side-channel analysis or fault-injection attacks.In many cases, those attacks render the mathematical security meaningless.Hence, it is essential that a cipher implementation incorporates appropriate countermeasures against physical attacks.Usually, those countermeasures are developed retroactively for a given, fully specified block cipher.A more promising approach is including the possibility of adding efficient countermeasures into the design from the very start.
For software implementations, this has been done.Indeed, a few ciphers have been proposed that aim to address the issue of protection against physical attacks by facilitating a masked Sbox by design.The first example is certainly NOEKEON [17], other examples include Zorro [19], Picarro [34] and the LS-design family of block ciphers [20].
For hardware implementations, the situation is significantly different.Here, simple masking is less effective due to several side effects, most notably glitches (see [28]).As an alternative to simple masking, a preferred hardware countermeasure against side-channel attacks is the so-called threshold implementation (TI) [33], as used for the cipher FIDES [6].TI is a masking variant that splits any secret data into several shares, using a simple secret-sharing scheme.Those shares are then grouped in non-complete subsets to be separately processed by individual subfunctions.All subfunctions jointly correspond to the target function (i.e., the block cipher).Since none of the subfunctions depend on all shares of the secret data at any time, it is intuitive to see that it is impossible to reconstruct the secret by first-order sidechannel observations.We provide a more detailed description of the functionality of threshold implementations in Sect. 2.
Unfortunately, it is not trivial to apply the TI concept efficiently to a given block cipher.The success of this process strongly depends on the complexity of the cipher's round function and its internal components.While the linear aspects of any cipher are typically easy to convert to TI, this is not generally true for the nonlinear Sbox.For 4-bit Sboxes, it is possible to identify a corresponding TI representation by exhaustive search [10].However, for larger Sboxes, in particular 8-bit Sboxes, the situation is very different.In this case, the search space is far too large to allow an exhaustive search.In fact, 8-bit Sboxes are far from being fully understood, from both a cryptographic and an implementation perspective.
With respect to cryptographic strength against differential and linear attacks, the AES Sbox (and its variants) can be seen as holding the current world record.We do not know of any Sbox with better properties, but those might well exist.Unfortunately, despite considerable effort, no TI representation is known for the AES Sbox that does not require any additional external randomness [7,9,32].
Our Contribution In this article we approach this problem of identifying cryptographically strong 8-bit Sboxes that provide a straightforward TI representation.More precisely, our goal is to give examples of Sboxes that come close to the cryptanalytic resistance of the AES Sbox.Also, the straight application of the TI concept to an Sbox should still lead to minimal resource and area costs.This enables an efficient and low-cost implementation in hardware as well as bit-sliced software.
In our work we systematically investigate 8-bit Sboxes that are constructed based on what can be seen as a minicipher.Concretely, we construct Sboxes based on either a balanced Feistel network (operating with two 4-bit branches and a 4-bit Sbox as the round function), an unbalanced Feistel network (operating with two branches of different size and a matching Sbox as the round function), a substitution permutation network or the MISTY network.This general approach has already been used and studied extensively.Examples of Sboxes constructed like this are used for example in the ciphers Crypton [26,27], ICEBERG [42], Fantomas [20], Robin [20] and Khazad [3].A more theoretical study was most recently presented by Canteaut et al. [15].
Our idea extends the previous work by combining those constructions aiming at achieving strong cryptographic criteria with small Sboxes that are easy to share and intrinsically support the TI concept.As a result of our investigation, we present a set of different 8-bit Sboxes.These Sboxes are either (a) superior to the known constructions from a cryptographic perspective but can still be implemented with moderate resource requirements or (b) outperform all known constructions in terms of efficiency in the application of the TI concept to the Sbox, while still maintaining a comparable level of cryptographic strength with respect to other known Sboxes.All our findings are detailed in Table 1.
Outline This article is structured as follows.Preliminaries on well-known strategies to construct Sboxes as well as the TI concept are given in Sect. 2. We discuss the applicability of TI on known 8-bit Sboxes in Sect.3. The details and results of the search process are given in Sects.4 and 5, respectively.We conclude with Sect.6.

Cryptanalytic properties for Sboxes
In this subsection we recall the tools used for evaluating the strength of Sboxes with respect to linear, differential and algebraic properties.For this purpose, we consider an n-bit Sbox S as a vector of Boolean functions: S = ( f 0 , . . ., f n−1 ), We denote the cardinality of a set A by # A and the dot product between two elements a, b ∈ F n 2 by: a, b = n−1 i=0 a i b i .

Nonlinearity
To be secure against linear cryptanalysis [29] a cipher must not be well-approximated by linear or affine functions.As the Sbox is generally the only nonlinear component in an SP-network, it has to be carefully chosen to ensure a design is secure against linear attacks.For a given Sbox, the main criterium here is the Hamming distance of any component function, i.e., a linear combination of the f i , to the set of all affine functions.The greater this distance, the stronger the Sbox with respect to linear cryptanalysis.The Walsh transform W S (a, b), defined as can be used to evaluate the correlation of a linear approximation (a, b) = (0, 0).More precisely, The larger the absolute value of W S (a, b), the better the approximation by the linear function a, x (or the affine function a, x + 1, in case W S (a, b) < 0).This motivates the following well-known definition.
Definition 1 (Linearity) Given a vectorial Boolean function S, its linearity is defined as The smaller Lin(S), the stronger the Sbox is against linear cryptanalysis.
It is known that for any function S from F n 2 to F n 2 it holds that Lin(S) ≥ 2 n+1 2 [16].Functions that reach this bound are called Almost Bent (AB) functions.However, in the case n > 4 and n even, we do not know the minimal value of the linearity that can be reached.In particular, for n = 8 the best known nonlinearity is achieved by the AES Sbox with Lin(S) = 32.

Differential uniformity
A cipher must also be resistant against differential cryptanalysis [5].To evaluate the differential property of an Sbox, we consider the set of all nonzero differentials and their probabilities (up to a factor 2 −n ).That is, given a, b ∈ F n 2 we consider which corresponds to 2 n times the probability of an input difference a propagating to an output difference b through the function S.This motivates the following well-known definition.

Definition 2 (Differential uniformity) Given a vectorial
Boolean function S, its differential uniformity is defined as The smaller Diff(S), the stronger the Sbox regarding differential cryptanalysis.
It is known that for Sboxes S that have the same number of input and output bits it holds that Diff(S) ≥ 2. Functions that reach that bound are called Almost Perfect Nonlinear (APN).While APN functions are known for any number n of input bits, APN permutations are known only in the case of n odd and n = 6.
In particular, for n = 8 the best known case is Diff(S) = 4, e.g., AES Sbox.

Algebraic degree
The algebraic degree is generally considered as a good indicator of security against structural attacks, such as integral, higher-order differential or, most recently, attacks based on the division property.
Recall that any Boolean function f can be uniquely represented using its Algebraic Normal Form (ANF): where x u = n−1 i=0 x u i i , with the convention 0 0 = 1.Now, the algebraic degree can be defined as follows.

Definition 3 (Algebraic degree)
The algebraic degree of f is defined as: This definition can be extended to vectorial Boolean functions (Sboxes) as follows For a permutation on F n 2 the maximum degree is n − 1. Lots of permutations over F n 2 achieve this maximal degree.Again the AES Sbox is optimal in this respect, i.e., the AES Sbox has the maximal degree of 7 for 8-bit permutations.

Affine equivalence
An important tool in our search for good Sboxes is the notion of affine equivalence.We say that two functions f and g are affine equivalent if there exists two affine permutations A 1 and The importance of this definition is given by the well-known fact that both the linearity and the differential uniformity are invariant under affine equivalence.That is, two functions that are affine equivalent have the same linear and differential criteria.

Construction of 8-Bit Sboxes
Apart from the AES Sbox, which is basically the inversion in the finite field F 2 8 , hardly any primary construction for useful, cryptographically strong, 8-bit Sboxes is known.
However, several secondary constructions have been applied successfully.Here, the idea is to build larger Sboxes  Later, this approach was modified and extended.In particular, it was used by several lightweight ciphers to construct Sboxes with different optimization criteria, e.g., smaller memory requirements, more efficient implementation, involution, and easier software-level masking.
There are basically four known constructions (cf.Fig. 1), all of which can be seen as mini-block ciphers: Feistel networks, the MISTY construction, SP-networks, and Lai-Massey scheme [25].Figure 1 shows how these constructions build larger Sboxes from smaller Sboxes.Note that the MISTY construction is a special case of the SPN.Indeed, the MISTY construction is equivalent to SPN when F 1 = I d and the matrix A = 1 1 1 0 .For a small number of rounds, we can systematically analyze the cryptographic properties of those constructions (see [15] for the most recent results).However, for a larger number of rounds, a theoretical understanding becomes increasingly more difficult in most cases.
Table 1 shows the different characteristics of 8-bit Sboxes known in the literature that are built from smaller Sboxes.We excluded the PICARO Sbox [34] from the list, since it is not a bijection.Furthermore, Zorro is also excluded since the exact specifications of its structure are not publicly known.We often refer to this table as it summarizes all our findings and achievements.

Threshold implementations
The first attempts to realize Boolean masking in hardware were unsuccessful, mainly due to glitches [28,31].Combinatorial circuits which receive both the mask and the masked data, i.e., secret sharing with 2 shares, most likely exhibit first-order leakage.Threshold implementation (TI) has been introduced to deal with this issue and realize masking in glitchy circuits [33].
The TI concept has been extended to higher orders [8], but our target, in this work, is resistance against first-order attacks.Hence, we give the TI specifications only with respect to first-order resistance.Let us assume a k-bit intermediate value x of a cipher as one of its Sbox inputs (at any arbitrary round) and represent it as x = x 1 , . . ., x k .For n − 1 order Boolean masking, x is represented by (x 1 , . . ., x n ), where x = n i=1 x i and each x i similarly denotes a k-bit vector x i 1 , . . ., x i k .Applying linear functions over Boolean-masked data is trivial, since L(x) = n i=1 L(x i ).However, realization of the masked nonlinear functions (Sbox) is generally non-trivial and is thus the main challenge for TI.According to the basic concepts of first-order TI [33], at least n = t +1 shares should be used to securely mask an Sbox with algebraic degree t.These restrictions come from the non-completeness property of first-order TI, which we discuss in the following.For more detailed explanation of the restriction on the number of shares, the interested reader is referred to the original publication [33].We only briefly introduce the three core properties of TI.
Correctness The masked Sbox should provide the output in a shared form (y 1 , . . ., y m ) with m i=1 y i = y = S(x) and m ≥ n.
Non-completeness Each output share y j∈{1,...,m} is provided by a component function f j (.) over a subset of the input shares.Each component function f j∈{1,...,m} (.) must be independent of at least one input share.
Uniformity The security of most masking schemes relies on the uniform distribution of the masks.Since in this work we consider only the cases with n = m and bijective Sboxes, we can define the uniformity as follows.The masked Sbox with n × k input bits and n × k output bits should form a bijection.Otherwise, the output of the masked Sbox (which is not uniform) will appear at the input of the next masked nonlinear functions (e.g., the Sbox at the next cipher round), and lead to first-order leakage.
Indeed, the challenge is the realization of the masked Sboxes with high algebraic degree.If t > 2, we can apply the same trick used in [33] and [35], i.e., by decomposing the Sbox into quadratic bijections.In other words, if we can write S : G • F, where both G and F are bijections with t = 2, we are able to implement the first-order TI of F and G with the minimum number of shares n = 3.Such a construction needs registers between the masked F and G to isolate the corresponding glitches.
After the decomposition, fulfilling all the TI requirements except uniformity is straightforward.As a solution, the authors of [10] proposed to find affine functions A 1 and A 2 in such a way that F : If we are able to represent a uniform sharing of the quadratic function Q, applying A 1 on all input shares, and A 2 on all output shares gives us a uniform sharing of F.

TI of 4-bit permutations.
In [11] the authors analyze 4-bit permutations and identify 302 equivalence classes.In the following, we use the same notation as in [11] to refer to these classes.Out of these 302, six classes are quadratic.These six quadratic functions whose uniform TI can be achieved by direct sharing or with simple correction terms (see [11]) are listed in Table 2.We included their minimum area requirements as the basis of our investigations in the next sections.In contrast to the others, Q 300 also needs to be decomposed for uniform sharing.

TI of feistel constructions
One easy to verify-but interesting and important-observation is that uniformity of a TI for a Feistel construction is actually no additional requirement.The reason for this is simply that the TI is again a Feistel construction itself and therefore a permutation.More precisely, consider a oneround Feistel construction with a round function F : That is we consider (ignoring the final swapping) Assume we have k shares (x L i , x R i ) for the inputs with Given a non-complete and correct sharing of F as k component functions f j , i.e., Fig. 2 Threshold implementation of a Feistel construction the TI for R becomes which as claimed is again a Feistel construction itself.Therefore, R T I is in particular a permutation and as such satisfies uniformity.This is also depicted, for the case k = 3 in Fig. 2.

Design architectures
Due to the high area overhead of threshold implementations (particularly the size of the shared Sbox), serialized architectures are favored, e.g., in [9,32,35,40].Our main target in this work is a serialized architecture in which one instance of the Sbox is implemented.Furthermore, we focus on bytewise serial designs due to our underlying 8-bit Sbox target.In such a scenario, the state register forms a shift register, that at each clock cycle shifts the state bytes through the Sbox and makes use of the last Sbox output as feedback.Figure 3 depicts three different architectures which we can consider.Note that extra logic is not shown in this figure, e.g., the multiplexers to enable other operations like ShiftRows.A shared Sbox with 3 shares should contain registers, e.g., PRESENT [35] and AES [9,32].As an example, if the shared Sbox contains 4 stages (see Fig. 3a) and forms a pipeline, all of the Sbox computations can be done in n + 3 clock cycles, with n as the number of state bytes.We refer to this architecture as raw in later sections.Note that realizing a pipeline is desirable.Otherwise, the Sbox computations would take 3n + 1 clock cycles.
As an alternative, we can use the state registers as intermediate registers of the shared Sbox. Figure 3b shows the corresponding architecture, where more multiplexers should be integrated to enable the correct operation (as an example in Skinny [4]).In this case, all n shared Sboxes can be computed in n clock cycles.It is noteworthy that such an optimization is not always fully possible if intermediate reg- isters of the shared Sbox are larger than the state registers (e.g., in case of AES [9,32]).
If the Sbox has been constructed by k times iterating a function F, it is possible to significantly reduce the area cost.Figure 3c shows an example.Therefore, similar to a raw architecture without pipeline, (k − 1) n + 1 clock cycles are required for n Sboxes.This is not efficient in terms of latency, but is favorable for low-throughput applications, where very low area is available and in particular when SCA protection is desired.We refer to this architecture as iterative.

Threshold implementation of known 8-bit Sboxes
Among 8-bit Sboxes, the AES TI Sbox has been widely investigated while nothing about the TI of other Sboxes can be found in public literature.The first construction of the AES TI Sbox was reported in [32].The authors made use of the tower-field approach of Canright [14] and represented the full circuit by quadratic operations.By applying second-order Boolean masking, i.e., three shares as minimum following the TI concept, all operations are independently realized by TI.On the other hand, the interconnection between (and concatenation of) uniform TI functions may violate the uniformity.Therefore, the authors integrated several fresh random masks-known as remasking or applying virtual shares [11]-to maintain the uniformity, in total 48 bits for each full Sbox.Since the AES TI Sbox has been considered for a serialized architecture, the authors formed a 4-stage pipeline design, which also increased the area by 138 registers.
Later in [9] three more efficient variants of the AES TI Sbox were introduced.The authors applied several tricks, e.g., increasing the number of shares to 4 and 5 and reduce them back to 3 in order to relax the fresh randomness requirements.Details of all different designs are listed in Table 1.In short, the most efficient design (called nimble) forms a 3-stage pipeline, where 92 extra registers and 32 fresh random bits are required.

CLEFIA
CLEFIA makes use of two 8-bit Sboxes S 0 and S 1 as depicted in Fig. 4a.The first one is formed by utilizing four different 4-bit bijections and multiplication by 2 in GF(2 4 ) defined by polynomial X 4 +X +1.The entire SS 0 :E6CA872FB14059D3,1 SS 1 :640D2BA39CEF8751, SS 2 :B85EA64CF72310D9, and SS 3 : A26D345E0789B FC1 are cubic and-based on the classification given in [11]-belong to classes C 210 , C 163 , C 160 , and C 160 , respectively.Unfortunately, all these classes are of non-alternating group and cannot be shared with 3 shares, i.e., no solution exists either by decomposition or remasking. 2We should use at least 4 shares (which is out of our focus), and its uniform sharing with 4 shares also needs to be done in at least 3 stages.Therefore, a 4-share version of TI S 0 can be realized in 6 stages.
The second one is constructed following the AES Sbox, i.e., inversion in GF(2 8 ), but with a different primitive polynomial and affine transformations.Based on the observations in [2,37], inversion in one field can be transformed to another field by linear isomorphisms.Therefore, S 1 and the AES Sbox are affine equivalent and all difficulties to realize the AES TI Sbox hold true for S 1 .

Crypton V0.5
Crypton V0.5 utilizes two 8-bit Sboxes, S 0 and S 1 , in a 3-round Feistel, as shown in Fig. 4b.By swapping P 0 and P 2 the Sbox S 0 is converted to its inverse S 1 .P 1 : AF4752E693C8D1B0 belongs to the cubic class C 295 .Similar to the subfunctions of CLEFIA, it belongs to the non-alternating group and cannot be shared with 3 shares.In short, at least 4 shares in 3 stages should be used.Further, P 0 : F968994C626A135F and P 2 : 04842F8D11F72BEF are quadratic, non-bijective functions, but that does not necessarily mean that their uniform sharing with 4 shares does not exist.We have examined this issue by applying direct sharing [11], and we could not find their uniform sharing with either 3 or 4 shares.In this case, remasking is a potential solution.However, due to the underlying Feistel structure of S 0 and S 1 , the non-uniformity of the shared P 0 and P 2 does not affect the uniformity of the resulting Sbox as long as the sharing of the Sbox input is uniform.More precisely, P 0 output is XORed with the left half of the Sbox input.If the input is uniformly shared, the input of P 1 is uniform regardless of the uniformity of the P 0 output.See [8] and [11], where it is shown that a • b (AND gate) cannot be uniformly shared with 3 shares, but a • b + c (AND+XOR) can be uniform if a, b, and c are uniformly shared.Therefore, a 4-share version of TI S 0 (resp.S 1 ) can be realized in 5 stages.

Crypton V1
Crypton V1 Sboxes as shown in Fig. 4c are made of two 4-bit bijections P 0 : FEA1B58D9327064C, P 1 :BAD78E05F63 4192C and their inverse in addition to a linear layer in between.P 0 and its inverse P −1 0 belong to the cubic class C 278 , which can be uniformly shared with 3 and 4 shares but in 3 stages.Both P 1 and its inverse P −1 1 are affine equivalent to the non-alternating cubic class C 295 , that-as given above-must be shared at least with 4 shares.Therefore, in order to share each Crypton V1 Sbox, 4 shares in a construction with 6 stages should be used.

ICEBERG
The Sbox of ICEBERG as shown in Fig. 4d is formed by two 4-bit bijections S 0 :D7329AC1F45E60B8 and S 1 :4AFC0D9 BE6173582 in a 3-round SPN structure, where permutation P 8 is a bit permutation.Both S 0 and S 1 are affine equivalent to the cubic class C 270 , which needs at least 3 stages to be uniformly shared with 3 shares.Therefore, a uniform sharing of the ICEBERG Sbox with 3 shares can be realized in 9 stages without any fresh randomness.Among the smallest decompositions, we suggest 3DB50E8679F14AC2, A 4 : AC24E860BD35F971, and for S 1 with A 1 : 63EB50D827AF149C, A 2 : D159F37 BC048E26A, A 3 : 2AE608C43BF719D5, A 4 : C5814D0 9E7A36F2B, and Q 294 : 0123456789BAEFDC.

Fantomas
As shown in Fig. 5a, Fantomas utilizes one 3-bit bijection S 3 : 03615427 and one 5-bit bijection S 5 : 00, 03, 12, 07, 14, 17, 04, 11, 0C, 0F, 1F, 0B, 19, 1A, 08, 1C, 10, 1D, 02, 1B, 06, 0A, 16, 0E, 1E, 13, 0D, 15, 09, 05, 18, 01 in a 3-round MISTY construction.S 3 is affine equivalent to the quadratic class Q 3  3 , which can be uniformly shared with 3 shares in at least 2 stages.As a decomposition, we considered S 3 : The construction of S 5 , as shown here, consists of 4 Toffoli gates and 4 XORs.The quadratic F and G, as well as linear parts L 1 and L 2 are correspondingly marked.Hence, we can decompose S 5 : The uniform sharing of both F and G can be found by direct sharing.Therefore, the Fantomas Sbox can be uniformly shared with 3 shares in 4 stages, without any fresh mask.Figure 5b depicts the block diagram representation, and the area requirements are listed in Table 1.Each Sbox cannot be implemented iteratively, and each Sbox computation has a latency of 4 clock cycles.However, a pipeline design can send out Sbox results in consecutive clock cycles, but with a 4-clock-cycle latency.

Khazad
Khazad utilizes the Anubis Sbox, which is also based on a 3-round SPN as shown in Fig. 4e.Besides a bit permutation layer, the two 4-bit bijections P : 3FE054BCDA967821 and Q : 9E56A23CF04D7B18 are utilized to form the 8-bit Sbox.Similar to ICEBERG, both P and Q belong to the cubic class C 270 .Therefore, the uniform sharing of the Khazad (resp.Anubis) Sbox can be realized in 9 stages without fresh masks.For the decomposition, we suggest

Robin
Robin is constructed based on the 3-round Feistel, similar to Crypton V0.5, but a single 4-bit bijection S 4 plays the role of all functions P 1 , P 2 , and P 3 .Although the swap of the nibbles in the last Feistel round is omitted, the Robin Sbox is the only known 8-bit Sbox which can be implemented in an iterative fashion.S 4 : 086D5F7C4E2391BA has been taken from [43], known as the Class 13 Sbox.S 4 is affine equivalent to the cubic class C 223 and, as stated above, can be uniformly shared with 3 shares in 2 stages.As one of the smallest solutions we considered Therefore, with no extra fresh randomness we can realize uniform sharing of the Robin Sbox with 3 shares in 6 stages.
In order to implement this construction, we have four different options.A block diagram of the design is shown in Fig. 5c (the registers filled by the gray color are essential for pipeline designs).Note that extra control logic (such as multiplexers) is required for all iterative designs which is excluded from Fig. 5c and Table 1 for the sake of clarity.

Scream V3
The structure of Scream V3 is similar to that of Crypton V0.5, i.e., 3-round Feistel.P 0 , and P 2 are replaced by two almost perfect nonlinear (APN) functions AP N 1 : 020B300A1E06A452 and AP N 2 : 20B003A0E1604A 25, and P 1 by S 1 : 02C75FD64E8931BA.Similar to Crypton V0.5, the two APN functions are not bijective.However, they are cubic rather than quadratic.The source of these two APNs is the construction given in [15].We can decompose both of them into two quadratic functions as AP N 1 : F • G and AP N 2: F • (⊕1) • G, with F : 020B30A01E06A425 and G : 0123457689ABCDFE.By (⊕1) we represent an identity followed by XOR with constant 1, i.e., flipping the least significant bit.Uniform sharing of G with 3 shares can be easily achieved by direct sharing.F, however, cannot be easily shared.F consists of three 2-input AND gates which directly give three output bits.To the best of our knowledge, F cannot be uniformly shared without applying remasking.However, as stated for Crypton V0.5, the non-uniformity of F (in general AP N 1 and AP N 2) does not play any role if S 1 is uniformly shared.
S 1 is affine equivalent to the cubic class C 223 which can be uniformly shared in 2 stages with 3 shares.Therefore, the Scream V3 Sbox can be shared by 3 shares in 6 stages, without any fresh random masks.There are many options to decompose S 1 ; as one of the smallest solutions we suggest S 1 :

Whirlpool
Whirlpool employs three different 4-bit bijections E, E −1 and R in a Lai-Massey scheme depicted in Fig. 4(f).E : 1B9CD6F3E874A250 and its inverse are affine equivalent to the cubic class C 278 , which can be uniformly shared with 3 shares in at least 3 stages.R : 7CBDE49F638A2510 also belongs to the cubic class C 270 .As given for ICEBERG and Khazad, C 270 needs 3 stages for a uniform sharing with 3 shares.Hence, the entire Whirlpool Sbox can be uniformly shared with 3 shares in 9 stages, without any extra randomness.The decomposition of R is similar to that of Khazad, i.e., R: However, the decomposition of E and E −1 are more costly.One of the cheapest solutions is Due to their required minimum 4 shares, except for CLE-FIA, Crypton V0.5, and Crypton V1, we have implemented TI for all the aforementioned Sboxes, and have given their area requirements as well as the number of stages (clock cycles) in Table 1.For the synthesis, we used Synopsys Design Compiler with the UMCL18G212T3 [44] ASIC standard cell library, i.e., UMC 0.18 µm technology node.It is noteworthy that among all the Sboxes we covered, the Robin Sbox is the only one which can be iteratively implemented.We should also emphasize that Midori [1] and Skinny [4] (in their 128-bit versions) make use of 8-bit Sboxes.Midori 8-bit Sboxes are made by concatenating two 4-bit Sboxes and the Skinny one by four times iterating an 8-bit quadratic bijection.In both cases their differential and linear properties are 64 and 128, respectively, which are notably less compared to the strong 8-bit Sboxes listed in Table 1.Therefore, we did not consider them in our investigations.

Finding TI-compliant 8-bit Sboxes
Our goal is to find strong 8-bit Sboxes which can be efficiently implemented as threshold implementations.To this end, we incorporate the idea of building an 8-bit Sbox from smaller Sboxes in our search.In particular, we aim to construct a round function that can be easily shared and iterated to generate a cryptographically strong Sbox.Easily shareable in our context refers to functions for which an efficient uniform shared representation is known.Thus, if we find a function with these properties, the resulting sequence of round functions will be a good cryptographic Sbox which can be efficiently masked.As done previously, we concentrate on the three basic constructions mentioned above: Feistel, SPN, and MISTY.As the number of possible choices for SPN is too large for an exhaustive search, we focus on two special cases for the linear layer of the SP-network.First, instead of allowing general linear layers we focus on bit permutations only.Those have the additional advantage of being basically for free, both in hardware and in a (bit-sliced) software implementation.Second, we focus on linear layers which correspond to matrix multiplications over F 16 .Those cover the MISTY construction as a special case.
In all cases, the building blocks for our round function are 4-bit Sboxes.As described in Sect.2, those Sboxes are well-analyzed and understood regarding both their threshold implementation [11] and their cryptographic properties.To minimize the number of required shares, we mainly consider functions with a maximum degree of two.Additional shares, otherwise, may increase the area or randomness requirements for the whole circuit.In [11], six main quadratic permutation classes are identified which are listed in Table 2.All existing quadratic 4-bit permutations are affine equivalent to one of those six.However, it should be noted that permutations of class Q 4 300 cannot be easily shared with three shares without decomposition or additional randomness.Therefore, we mainly focus on the other classes from our search.Note that we include the identity function A 4 0 in the case of the SPN construction.Since the identity function does not require any area, round functions based on a combination of identity and one quadratic 4-bit permutation can result in very lightweight designs.
One important difference to all previous constructions listed in Table 1 is that we do consider higher number of iterations for our constructions.This is motivated by two observations.First, it might allow to improve upon the cryptographic criteria and second it might be beneficial to actually use a simpler round function, in particular those that can be implemented in one stage, more often than a more complicated round function with a smaller number of iterations.As can be seen in Table 1, this approach of increasing the number of iterations is quite successful in many cases.
Next we describe in detail the search for good Sboxes for each of the three constructions we considered.

Feistel construction
As a first construction, we examine round functions using a Feistel network similar to Fig. 1a.By the basic approach described below, we were able to exhaustively investigate all possible constructions based on any 4-bit to 4-bit function for any number of iterations between 1 and 5.This can be seen as an extension (in the case of n = 4 and for identical round functions) to the results given in [15] where up to 3 rounds have been studied.
However, such an exhaustive search is not possible in a naive way.As there are 2 64 4-bit functions and checking the cryptographic criteria of an n-bit Sbox requires roughly 2 2n basic operations, a naive approach would need more than 2 80 operations.
Fortunately, this task can be accelerated by exploiting the distinct structure of Feistel networks while still covering the entire search space.
We recall the definition of a Feistel round for the function We denote by Feistel n F the nth functional power of Feistel 1 F , i.e., To reduce the search space, we show below that if G = A • F • A −1 for an invertible affine function A, then Feistel n F is affine equivalent to Feistel n G .Thus, we can reduce our search space from all 2 64 functions, to roughly 2 46.50 functions.Indeed, Brinkmann classified all 4 to 4 bit functions up to extended affine equivalence [13].There are 4713 equivalence classes up to extended affine equivalence.The following proposition summarizes the equivalence we described above.

Proposition 1 Let F and G be such that there exists an affine function A such that G
Then we have Feistel 1 )), since middle terms cancel each others.Thus, we have Feistel n G is affine equivalent to Feistel n F and have the same cryptanalytic properties.In Fig. 6 we represent the two equivalent representation of Feistel 1  G .Now, it is enough to consider all functions of the form A 1 • F + C, where A 1 is an affine permutation and C is any linear mapping on 4 bits.As Feistel Fig. 6 Illustration of the two equivalence representation of Feistel 1 G affine equivalence.Doing so, we reduce the search space to: 16 2 46.50 . (1)

Unbalanced feistel construction
While the aforementioned description considers splitting an 8-bit input into two 4-bit parts L , R, we also conducted a smaller-scale experiment on unbalanced Feistel networks.For this, the n-bit input is split into two parts L , R with lengths l L = l R where l L + l R = n.For the common Feistel network to operate, we need an l L -bit to l R -bit round function.
As before, it is not practicable to search through all possible round functions for a given parameter set.Thus, we reduce our search space by considering only quadratic round functions.This allows us to investigate whether better candidates exist for the unbalanced case than for the balanced case.Also, for certain splitting parameters we can search up to significantly higher iteration counts than it was feasible for the balanced case.However, for the promising (6,2) and (5,3) splits the search spaces are still 2 44 and 2 48 , respectively, for a given number of iterations.This is similar to the search space of the balanced Feistel construction.

Implementation details
The search spaces are not infeasible but still large.As such, we used multiple GPUs to speed up the exhaustive search process.For a given search we fix the iteration count and we iterate over all round functions.The final version of the implementation uses 2 26 threads per GPU and each group of 64 threads on the GPU receives a candidate round function.At that point we compute the Sbox in parallel within that thread group, each thread processing 4 out of 256 inputs to the Sbox.Doing so gives us the full Sbox.
A straightforward subsequent step would be to simply compute the differential uniformity and linearity of this Sbox, storing the Sboxes on the host if they meet our criteria.However, requiring full computation of these properties is computationally fairly expensive, especially when compared to the initial computation of the Sbox.Instead, we want to prune bad Sbox candidates as soon as possible.Note that in the differential uniformity computation we find the highest count of the indices b = S(x) ⊕ S(x ⊕ a) for a, b ∈ F n 2 by iterating over all x for a specific a.Then we take the maximum for all a to get the final resulting differential uniformity.Because we always take maximums, for any intermediate count we know that the differential uniformity cannot be lower than this value.Thus, if the intermediate count is too high, we can simply prune the whole candidate early.Note that in practice we need to be careful about synchronizing the branching within the thread groups since we want to avoid diverging threads.To make this particular task easier is why we compute one Sbox using multiple threads as opposed to one Sbox per thread.
Moreover, if consider again the aforementioned computation of b, we see that there are always two values for x that lead to the same b, i.e., x, and x ⊕ a.This means that we can half our input space for x and simply increment each b twice, taking care to iterate over x in such on order that we do not have duplicate values for either x or x ⊕ a.
The last trick for the uniformity we want to mention is that because the initial Sbox computation is also done by iterating over all Sbox inputs x, we can interleave the differential uniformity computation for a specific a with the computation of the Sbox.This leads to even quicker short-circuiting behavior as we do not even need to compute the full Sbox.Note that computation of b depends on only two Sbox outputs, i.e., x and x ⊕ a.Thus, in the same way as before we iterate over all x in such an order that we can compute outputs for x and x ⊕ a without duplication, thereafter counting the b we receive twice.
This short-circuiting behavior is very effective at pruning bad candidates early.In practice, relatively few Sboxes survive if we set our uniformity upper bound to be ≤8.Therefore, we do not need the linearity computation on the GPU and can simply do that on the host machine in a separate step.To accommodate the relatively few cases where we generate many low-uniformity Sboxes, we also include the option to perform the linearity computation of the GPU.

SPN construction with bit permutations as the linear layer
In addition to Feistel networks, we examined round functions which are similar to Fig. 1c.However, A is replaced by an XOR with a constant followed by an 8-bit permutation.Depending on F 1 and F 2 , this construction can lead to very lightweight round functions since constant addition and simple bit permutations are very efficient in hardware circuits.
For F 1 and F 2 we consider the five quadratic permutations (listed in Table 2) as well as the identity function (denoted by A 4 0 ).Obviously, we exclude the combination There are 8! different 8-bit permutations and 256 possibilities for the constant addition.If we looked for all combinations of all affine equivalents of the chosen functions, we would have to tests Sboxes.This is clearly not feasible.Therefore, we decide to restrict the number of possibilities for each of the two functions.In particular, we only consider the representative for each class as presented in [11] without affine equivalents.This reduces the search space to which can be completely processed.Similar to the Feistel network, it is possible to further reduce the complexity of the search.To this end, we first define the round function for this type of Sbox as where || denotes the concatenation of the two parts.Furthermore, it can be trivially seen that for every combination of an 8-bit permutation P 1 and an 8-bit constant C 1 there exist a complementary combination of an 8-bit permutation P 2 and an 8-bit constant C 2 with Thus, the search can be speed up since BitPerm 1 is the same as BitPerm 1 F 2 ,F 1 ,C 2 ,P 2 .Therefore, we only need to check #Sboxes = 256 • 8! • 20 • 10 2 31  (4) Sboxes for this type of round function.

SPN construction with F 16 -linear layers only
For the last type of construction, we consider another special case of the construction depicted in Fig. 1c.Here we restrict ourselves to the case where A corresponds to a multiplication with a 2 × 2 matrix with elements from F 16 .Additionally, a constant is again added to the outputs of F 1 and F 2 .As noted before, a special case of this construction is the MISTY technique.
For F 1 and F 2 we consider the five quadratic functions and the identity function.Just like for the bit permutation round function, it is not feasible to check all affine equivalents.Therefore, we limit our search to these functions.The field multiplication is performed with the commonly used polynomial X 4 + + 1 [22].Given that the matrix needs to be invertible and provide some form of mixture between the two halves, this leaves us with 61200 possibilities for the matrix multiplication.It is further possible to apply the same optimization as for permutation-based round functions.Therefore, we need to check #Sboxes = 256 • 61200 • 20 • 10 2 31.5  (5) Sboxes for this type of round function.

Results
We completed the search for the three aforementioned types of round functions with up to ten iterations.The search for balanced Feistel networks for all 4713 classes takes around two weeks on a machine with four NVIDIA K80s for a specific set of parameters.In particular, the performance depends on the bounds defined by cryptographic properties (differential uniformity) as well as the iteration count of the network.Note that, with respect to cryptographic criteria, our search shows that for iterations ≤ 5 no 8-bit balanced Feistel with identical round functions can have a linearity below 56 and a differential uniformity below 8.
With respect to unbalanced Feistel networks, on the same machine as above it takes approximately five days to search through all 2 44 quadratic round functions for 4 iterations.The same caveat as before applies though performance varies depending on iteration count and, in particular, the likelihood of low-uniformity Sboxes.Our search yields the following observations: for splits where l R > l L , e.g., the (2,6) split, no Sboxes up to 30 iterations exist that match (or improve) the balanced Feistel network's 8 differential uniformity and 56 linearity.For the (7,1) we can match these values after 8 iterations.For (6,2) we found candidates after 4 iterations that match these values but no better values could be found up to 5 iterations.
Furthermore, the search for SPNs with bit permutations (resp.with F 16 -linear layer) required around 48 hours (resp.54 hours) on one Intel Xeon CPU with 12 cores.It was possible to detect some very basic relations between the security, number of iterations and area of the Sbox. Figure 7 shows the smallest differential uniformity and linearity values which can be achieved for a specific number of iterations using a round function based on the F 16 -linear layer with constant addition.As expected, the more iterations are applied, the higher resistance against linear and differential cryptanalysis could be achieved.The size of each of the considered quadratic permutations is given in Table 2. Bigger functions like Q 4  293 and Q 4 299 achieve good cryptographic properties with fewer iterations than smaller functions like Q 4  4 .For the other combinations of (F 1 , F 2 ) and types of round functions Fig. 7 The smallest achievable differential uniformity and linearity for each number of iterations for round functions with F 16 -linear layers and . a Differential uniformity, b Linearity the graphs behave similarly.Depending on the remaining layers of the cipher and the targeted use case, a designer needs to find a good balance between the parameters.In the following, we present a few selected Sboxes optimized for different types of applications.
In our evaluation, we only consider Sboxes with differential uniformity at most 16 and linearity of at most 64.These are the worst properties between the observed constructed 8bit Sboxes in Table 1.From the cryptographic standpoint, our Sboxes should not be inferior to these functions.We identified the following strong Sboxes that cover the most important scenarios.
-SB 1 : This Sbox possesses a very small round function.In a serial design the round function is usually implemented only once to save area.-SB 2 : This Sbox is selected to enable an efficient implementation in a round-based design.For this not only the size of the round function is important but also the number of iterations.Additional iterations require additional instantiations of the round function with a dedicated register stage.Furthermore, this Sbox requires the least number of iterations and can be implemented with a very low number of AND gates.Thus, it is also suited to masked software implementations.-SB 3 : This Sbox has very good cryptographic properties and requires one less iteration than SB 4 .-SB 4 : This Sbox has very good cryptographic properties.-SB 8 : This Sbox is similar to SB 4 regarding the cryptographic properties.However, its round functions is much smaller which results in an efficient iterative threshold implementation.

Selected Sboxes
In this section, we supply the necessary information to implement the selected Sboxes.For this, we first recall the basic structure of the round functions.Table 1 shows that our selected round functions consist of bit permutations and F 16linear layers.The structure of both types is similar to Fig. 1c.We denote the most (resp.least) significant four bits as L (resp.R).The round function Round : where C is an 8-bit constant and P(.) denotes either an 8-bit permutation or an F 16 -linear layer.In Table 3, we describe a specific bit permutation with an eight-element vector where each element denotes the new bit position, e.g., no permutation is 01234567 whereas complete reversal is 76543210.The F 16 -linear layer is realized as a multiplication with a 2 × 2 matrix with elements in F 16 .Let us denote the most (resp.least) significant four input bits to the matrix multipli-cation as L M (resp.R M ).The multiplication is then defined as where E 1 , E 2 , E 3 , E 4 ∈ F 16 are the elements of the chosen matrix.To describe the linear layers of our Sboxes we give the specific [E 1 , E 2 , E 3 , E 4 ] for each matrix in Table 3.
These parameters combined with the number of iterations enable the realizations of each Sbox.To increase efficiency of the TI the constant is added to only one of the shares.In some cases, the area of the design can be reduced by adding a particular constant to the two remaining shares.This is based on the fact that an additional NOT gate can turn, e.g., an AND gate to a smaller NAND gate [36].The following linear layer still needs to be applied to all shares.Table 3 contains this condensed description of the selected Sboxes.
For SB 4 , since it uses a Feistel network, we construct the Sbox using the round function H where F is taken from the 4713 equivalence classes; G and A represent the linear and affine parts, respectively.H , F, G and A are all 4-bit to 4-bit functions.The full definition of the round is then simply (L , R) → (R ⊕ H (L), L).
For SB 7 and SB 8 , we simply give the F-function and construct the Sbox using the round function as (L , R) → (R ⊕ F(L), L).Note that R and L are not the same size and F is a 6-bit to 2-bit (resp.7-bit to 1-bit) function for SB 7 (resp.SB 8 ).

Comparison
Table 1 gives an overview of our results and we summarize the most important observations in the following.The first observation is that our proposed designs do not require fresh mask bits to achieve uniformity.This is an improvement over all TI types of the AES Sbox and some other Sboxes from Table 1.They need up to 64 bits of randomness for one full Sbox.Given that modern ciphers usually include multiple rounds with many Sboxes, this can add up to a significant amount of randomness which needs to be generated.
Furthermore, all of our proposed Sboxes can be implemented iteratively.This comes with the advantage that even the more complex designs, e.g., SB 4 , SB 5 , and SB 8 , can be realized with very few gates depending on the design architecture.From all the other Sboxes in Table 1 this is only possible for Robin and its round function requires more area than any of our proposed Sboxes.
In particular, SB 1 and SB 2 require the least area in their respective target architectures (i.e., iterative and raw) out of all considered 8-bit Sboxes.The difference for the iterative architecture is especially large where SB 1 needs roughly six times less area than the Robin Sbox.
SB 2 requires the least number of stages.Additionally, it requires only 12 AND gates for the whole Sbox which is very close to the best number, i.e., 11 for Fantomas.This is an advantage for masked bit-sliced implementations making SB 2 suitable for software and hardware designs.
For completeness, we also look at the masked bit-slice implementation of Sboxes with a low number of AND gates (≤16), i.e., SB 1 and SB 2 .Software implementations are not vulnerable to glitches hence the probing model [23] is good to model the security of these implementations.We use the solution for secure AND proposed in [23] and take advantage of the proof of Rivain and Prouff [39] to limit the number of shares.The results are plotted in Fig. 8.As expected the number of AND is determinant for large masking order and the cost of the linear part becomes negligible.In particular, SB 2 , Scream v3 and Robin have the same number of AND (12) and differ just by the linear part.The 3 curves converge toward the same curve.
As expected, we did not find any Sbox with better cryptographic properties than the AES Sbox.However, SB 3 , SB 4 , SB 7 , SB 8 can still provide better resistance against cryptanalysis attacks than most of the other considered Sboxes.This comes at the cost of an increased area for the raw implementations.Nevertheless, the required area is still smaller than any AES TI and their round function is still smaller than Robin for iterative designs.
As depicted in Fig. 7, a trade-off between resources and cryptographic properties is possible.If SB 1 and SB 2 do not provide the desired level of security and SB 3 and SB 4 are too large, SB 5 and SB 6 might be the best solution.Their cryptographic properties are still better or equal than the competitors while the area is significantly smaller than SB 3 and SB 4 .For the sake of completeness, we included the area requirement of the unprotected implementation as well as the latency of different designs in Table 1.
Decryption usually requires the inverse of the Sbox.Therefore, it is important that the Sbox inverse has comparably good properties to the original Sbox.For SB 4 , SB 7 , and SB 8 this is obvious since the Feistel structure makes it straightforward to construct the inverse.Therefore, inverse SB 4 , SB 7 , and SB 8 have exactly the same properties as the original Sboxes.For the other cases, this is not trivial.Nevertheless, the inverse of each of our considered quadratic functions is self-affine equivalent.For completeness, we constructed the inverse of the non-Feistel Sboxes and compared their efficiency with their original counterpart.As shown in Table 1, the inverse versions are often less efficient regarding their area and latency.Nevertheless, they still perform well compared to the other Sboxes in the upper half of the table.

Conclusion and future work
In this work we identified a set of eight 8-bit Sboxes with highly useful properties using a systematic search on a range of composite Sbox constructions.Our findings include 8-bit Sboxes that provide comparable or even higher resistance against linear and differential cryptanalysis with respect to other 8-bit Sbox but intrinsically support the TI concept without any external randomness.At the same time our selected Sboxes come with a range of useful implementation properties, such as a highly efficient serialization option, or a very low area requirement.Future work comprises extended criteria for the Sbox composition, including diffusion layers beyond permutations.

Fig. 1 a
Fig. 1 a Feistel, b MISTY, c SPN, d Lai-Massey iterations of a unique function c Number of AND gates, important for masked bit-sliced software implementations d Excluding the required extra logic, e.g, multiplexers and registers e Fully combinatorial f Including pipeline registers g Number of stages in the pipeline h Number of fresh mask bits required for each full Sbox

Fig. 3
Fig. 3 Different serialized design architectures.a Raw, b interleaved, c iterative

Fig. 5 a
Fig. 5 a Fantamos Sbox, b, c threshold implementation of fantomas and robin Sboxes, each signal represents 3 shares, the gray registers for pipeline variant.a Fantomas structure, b Fantomas, c Robin

Table 2
Performance figures of 4 × 4 quadratic bijections with respect to their TI cost

Table 3
Specifics of the selected Sboxes SB 6 : This Sbox is similar to SB 2 that is optimized for raw implementations.However, it trades area for better cryptographic properties.-SB 7 : This Sbox is similar to SB 4 regarding the cryptographic properties.However, it requires one less iteration and, thus, its raw threshold implementation is smaller.