A Probabilistic Design for Practical Homomorphic Majority Voting with Intrinsic Differential Privacy

As machine learning (ML) has become pervasive throughout various fields (industry, healthcare, social networks), privacy concerns regarding the data used for its training have gained a critical importance. In settings where several parties wish to collaboratively train a common model without jeopardizing their sensitive data, the need for a private training protocol is particularly stringent and implies to protect the data against both the model's end-users and the other actors of the training phase. In this context of secure collaborative learning, Differential Privacy (DP) and Fully Homomorphic Encryption (FHE) are two complementary countermeasures of growing interest to thwart privacy attacks in ML systems. Central to many collaborative training protocols, in the line of PATE, is majority voting aggregation. Thus, in this paper, we design SHIELD, a probabilistic approximate majority voting operator which is faster when homomorphically executed than existing approaches based on exact argmax computation over an histogram of votes. As an additional benefit, the inaccuracy of SHIELD is used as a feature to provably enable DP guarantees. Although SHIELD may have other applications, we focus here on one setting and seamlessly integrate it in the SPEED collaborative training framework from [20] to improve its computational efficiency. After thoroughly describing the FHE implementation of our algorithm and its DP analysis, we present experimental results. To the best of our knowledge, it is the first work in which relaxing the accuracy of an algorithm is constructively usable as a degree of freedom to achieve better FHE performances.


INTRODUCTION
Most works combining Differential Privacy (DP) and Fully Homomorphic Encryption (FHE) falls into two categories, either an exact FHE is used and, independently, the plaintexts are adequatly noised to achieve DP [21,24,37] or the post-decryption noise of an approximate FHE such as CKKS is tuned to provide such guarantees [32,36].The latter approach is presently limited to simple aggregation rules as no efficient techniques for e.g.argmin/max computations are presently known for CKKS (to the best of our knowledge).In the present paper, we focus on the first of these two approaches and aim at avoiding explicit DP noise addition over the plaintexts by designing an homomorphic majority voting 1 operator (that may be viewed as a stochastic argmax), called SHIELD (Secure and Homomorphic Imperfect Election via Lightweight Design), whose structural stochasticity leads to both lighter FHE computational cost and DP protection.In fact, we prove that the inaccuracies due to the approximate behavior of our algorithm translate into consistent DP guarantees, and therefore that explicit noise addition becomes unnecessary for DP.In doing so, and by means of a carefully crafted FHE implementation of the algorithm, we are able to achieve a reduction of more than 25% in the homomorphic computation time compared to the state of the art on majority voting based on exact argmax computations over an histogram [9,23].To the best of our knowledge, it is the first work in which relaxing the accuracy of an homomorphic calculation at the algorithmic level2 is constructively usable as a degree of freedom to achieve better FHE performances.
A well-suited use-case to SHIELD is the PATE protocol [33] and especially its extension SPEED [20].In a nutshell, the PATE protocol labels a subset of a public dataset and uses this partially labeled dataset to train a student model in a semi-supervised way.The labelization is achieved by aggregating (usually by means of a majority voting) the labels -considered as votes -provided by a set of teacher models trained on private datasets.Since the teachers' labels would leak information on their training data, the PATE protocol makes use of differential privacy (DP).To get a reasonable privacy-utility trade-off, the aggregation of votes is performed on a trusted independent server.The SPEED approach [20] builds upon the work from [33] and uses Fully Homomorphic Encryption (FHE) to blind the server by having it perform the aggregation directly over encrypted votes, therefore protecting the data against an honest-but-curious server.Still, FHE being computationally intensive, this comes at significant communication and computation costs on the server (6.5 minutes to compute the homomorphic argmax for 100 queries).We here propose that the server performs the vote aggregation using SHIELD, which is then seamlessly integrated in SPEED.Our experiments show that SHIELD reduces the server's computational burden while providing DP guarantees comparable to the ones in [20] with respect to the teachers (albeit to the exception of the server which should not be revealed the final model).Overall, in our experiments, the correct majority vote is output with a probability of more than 90% and, more importantly, SHIELD only incurs a model accuracy loss of 0.7% compared to an exact argmax-based aggregation with no DP.
The paper is organized as follows.First of all, we explore the related work in Section 2 and give some preliminaries about HE and DP in Section 3.Then, we introduce and describe our operator SHIELD in Section 4 and more specifically its FHE implementation in Section 5, before presenting the SPEED application case in Section 6. Section 7 develops an analysis of SHIELD from the points of view of DP and HE.Finally, our experimental results are presented in Section 8.

RELATED WORK
In [41], the authors survey recent works in which DP and cryptographic primitives take advantage of each other, either • cryptography for DP: cryptographic primitives allow to get the privacy-utility trade-off of a standard DP mechanism but without the need of a trusted server [2,10,16,19].This is an improvement compared to local DP which, by making the data owners noise their data before outsourcing them, does not need a trusted server either but gives a poorer privacy-utility trade-off [26,38], • or DP for cryptography: design "leaky" cryptographic primitives that ensure DP and are more efficient than traditional primitives [5,39,40].
The approaches from [5,39,40] are tailored to specific applications, respectively SQL queries, anonymous communication systems and oblivious RAM.Our work uses exact FHE but with an approximate algorithm, in the context of election.
In [42], the authors propose an algorithm with a close goal, namely heavy-hitters (most frequent items) detection, which is inherently differentially private thanks to random sampling.Nevertheless, the goal of this inherent probabilistic behavior is not computational efficiency since the method is not articulated with cryptographic primitives.Moreover, this algorithm works on sequential data.Even if it does not restrict its generality since any data can be seen as sequential, the utility does depend on the sequential representation of the data, which may not be optimal if there is no semantic value to this representation.Finally, the algorithm is iterative and thus requires a lot of communication with the users.Some recent works propose emerging approaches to combine DP and approximate FHE like CKKS.Among them, [29] uses a postprocessing noise ensuring DP to turn an IND-CPA approximate cryptosystem into an IND-CPA  one.Indeed, whereas IND-CPA  is equivalent to IND-CPA for exact cryptosystems, it is not the case for approximate ones.In [32], the author analyzes the noise native to CKKS encryption and proves that it can provide DP without additional noise, yet their analysis holds at most quadratic polynomial evaluations.The work from [36] also leverages the error induced by encryption to derive DP guarantees, but in a federated learning context.The aggregation protocol is based on the security of LWE problem and on the Multi-Party Computation protocol of Packed Shamir secret sharing scheme [18].Nevertheless, LWE is not used to directly encrypt the values of interest but rather to generate one-time pads of the same dimension of these values while only needing to communicate much smaller vectors to the server.These one-time pads allow a secure aggregation and DP guarantees are ensured by the error induced by LWE encryption.
In contrast with these latter works, we use an exact HE cryptosystem, namely BFV, and craft a majority voting algorithm which is approximate by design.It is this approximate behavior of the algorithm that we leverage to achieve DP, and not the noise native to encryption like in the aforementioned CKKS-based approaches.
Some works in the literature try to increase the speed of an argmax operation by making it stochastic but without this stochasticity leading to provable DP guarantees.Fair comparison with such works is tricky.Among the very few works in this line, [22] have a very efficient CKKS-based approximate argmax operator, but it is difficult to use outside of the biometric identification protocol for which it is intended and does not provide DP guarantees.For this reason we only compare our work with state-of-the-art on majority voting based on exact homomorphic argmax operators over an histogram [9,23].

PRELIMINARIES 3.1 Homomorphic encryption
Fully homomorphic encryption (FHE) schemes allow to perform arbitrary computations directly over encrypted data.That is, with a fully homomorphic encryption scheme Enc, we can compute an addition Enc(m 1 + m 2 ) and a multiplication Enc(m 1 × m 2 ) from encrypted messages Enc(m 1 ) and Enc(m 2 ).
In this section we recall the general principles of the BFV homomorphic cryptosystem [17].Since we know in advance the function to be evaluated homomorphically, we can stick to the somewhat homomorphic encryption (SHE) version described below.Let  = Z [] /Φ  () denote the polynomial ring modulo the -cyclotomic polynomial with  =  ().The ciphertexts in the scheme are elements of the polynomial ring   , where   is the set of polynomials in  with coefficients in Z  .The plaintexts are polynomials belonging to the ring   = / for  << .For  ∈ , we denote by []  the element in R obtained by applying modulo q to all its coefficients.
As with all SHE schemes, BFV allows us to evaluate additions and multiplications and comes with a relinearization function that helps to manage the size of the ciphertexts.As homomorphic multiplication increases the size of the ciphertexts, relinearization reduces the size back to a smaller, more manageable form.This process is crucial for maintaining the efficiency of computations over encrypted data.Using Single Instruction, Multiple Data (SIMD) packing (or batching), one can significantly increase the time performance of BFV-based computation.SIMD is a technique used to perform computations on multiple values simultaneously.It allows to pack multiple plaintexts into a single ciphertext and perform operations on them in parallel.This method was originally described in [7,35].For further details we refer the reader to the original paper [17].

Differential privacy
Differential privacy [14] is a gold standard concept in privacypreserving data analysis.It provides a guarantee that under a reasonable privacy budget (, ), two adjacent databases produce statistically indistinguishable results.In the context of SPEED, we consider, as in [20], that a database is the concatenation of the data owners' (teachers) datasets and that two databases are adjacent if they differ by one teacher.
In the whole paper, P() denotes the probability of event .(, ) constitutes the privacy cost of the mechanism -the lower  and , the more private the mechanism.Definition 3.1 implies that a differentially private mechanism is necessarily probabilistic.Most often, a deterministic mechanism is turned differentially private by adding random noise at a certain moment in the computation.
Since the privacy cost increases with the number of queries to the mechanism, we determine the privacy cost of our protocol via a two-fold approach.First of all, we determine the privacy cost per query and then we compose the privacy costs of each query to get the overall cost.The classical composition theorem (see e.g.[15]) states that the guarantees  of sequential queries add up.Nevertheless, training a deep neural network, even with a collaborative framework as presented in this paper, requires a large amount of calls to the databases, precluding the use of this classical composition.Therefore, to obtain reasonable DP guarantees, we need to keep track of the privacy cost with a more refined tool called the moments accountant, introduced in [1] and closely related to Rényi DP [31].This technique allows for far better composition of the privacy costs along the queries.
Finally, an important property of DP, widely used in DP analysis, is that it is immune to post-processing.This means that, quite intuitively, applying a function to the output of a differentially private mechanism does not disclose more information to the adversary.Proposition 3.2 ([15]).Let M be a probabilistic mechanism, with output range R, that is (, )-differentially private, with (, ) ∈ (R + ) 2 .Let  : R → R ′ be an arbitrary probabilistic mapping.Then  • M is (, )-differentially private.We propose a novel operator that can be viewed as a stochastic majority voting rule, or (more or less) as a probabilistic argmax.Previous works address this matter by homomorphically building an histogram of the votes and then applying an homomorphic argmax operator.In contrast, SHIELD (Secure and Homomorphic Imperfect Election via Lightweight Design) directly selects one of the votes stochastically, therefore bypassing the complex and heavy argmax operation.What we lose in accuracy, we gain in DP guarantees and computational time.
SHIELD is an iterative algorithm that takes as input a list of votes for candidates.Informally, at each iteration, also called a try, some of the votes are randomly drawn and the algorithm checks if they are all equal.If yes, it outputs the common vote 3 , otherwise, the algorithm performs another try.The output of SHIELD can be any of the candidates for whom there is at least one vote but the more votes a candidate got, the more likely this candidate will be the output.By tuning the parameters (especially the number of tries and the number of selected votes at each try), we can increase the probability of getting the candidate with the most votes.
Let us now formally describe SHIELD.First of all, SHIELD is meant to be computed in the homomorphic domain.Here are some notations we will use to describe its homomorphic behavior.Enc and Dec respectively denote the encryption and decryption functions of some homomorphic encryption system defined on Z 2 .⊕ and ⊗ respectively represent the homomorphic addition and multiplication.When these operators are applied on vectors, they denote the element-wise corresponding operations.Note that the negation of  ∈ Z 2 is homomorphically performed via Enc(1) ⊕ Enc() and the homomorphic or operator, denoted ∨ ⃝, between  ∈ Z 2 and  ∈ Z 2 is performed via [Enc() ⊕ Enc()] ⊕ [Enc() ⊗ Enc()] and will be written Enc() ∨ ⃝ Enc() in the following.In this paper, for any  ∈ N, [] denotes the set {1, ..., } (which is, by convention, the empty set if  = 0).Let  be the number of classes of the classification problem.Let  be the number of voters (or teachers for SPEED) and, given  ∈ [], let   be the number of voters who voted for class .
Definition 4.1.Let  ∈ N * .A vector  ∈ (Z 2 )  is said to be a one-hot encoding (vector) if there exists  0 ∈ [] such that   0 = 1 and, for all  ∈ [] \ { 0 },   = 0.In this case, we say that  codes for the class  0 or that  is the one-hot encoding of the class  0 .
Let (, ) ∈ (N * ) 2 .Let S , denote SHIELD operator with parameters  and , that we define in the following.
Let (, ) ∈ (N * ) 2 that we consider fixed in the remainder of this section.Let  = (Enc( ( ) ))  ∈ [] be a list of  encrypted one-hot encoding -dimensional vectors, some of these vectors being possibly equal (it is necessarily the case for some vectors when  < ). models the list of the votes, each vote being represented as a one-hot encoding.Then S , ( ) is an encryption of one of the  ( ) , and with high probability (see Section 8 for quantitative results) S , ( ) is an encryption of the most frequent of the one-hot encoding vectors of  .S , is formally defined in Algorithm 1 where, for the sake of clarity, we do not explicitly write the encryption function (e.g. =  ( 0 ) instead of  = Enc( ( 0 ) )). S , performs  iterations (or tries) and, at each iteration, it draws  vectors of  with replacement in a uniformly random manner and multiply them.The resulting vector  is an encryption of the one-hot encoding of the class  0 , with  0 ∈ [], if all the  drawn encrypted vectors code for the same class  0 .Otherwise,  is the null vector of (Z 2 )  .If a non-null vector was already found in a previous iteration, the current  is ignored (since the bit  __ has been set to 1).Of course, since the algorithm is computed in the encrypted domain, it has to run until the end of the for loop (all the  iterations) but everything works as if the algorithm repeated this operation until it gets a non-null vector and then ignored the remaining product vectors.This first non-null vector is the output of S , .If no non-null vector was produced after  iterations, a null vector is output and we say that S , failed.

Algorithm 1: SHIELD
Input : number of vectors , number of classes , list of encrypted votes  , number of multiplications , number of terms  Output : = S , ( ) =  ( 0 ) where  0 ∈ [] Draw a vector  of  uniformly at random; The parameter  being fixed, the choice of  must consider the trade-off between, on one hand, the accuracy of the operator, i.e. the probability of getting the most frequent vector (see the considered accuracy metrics in Section 7.2), and, on the other hand, the probability of avoiding a failure and the computational complexity.Indeed, when  increases, the probability of getting a null vector (and then failing) increases, as well as the computational complexity, but the probability of getting the most frequent vector, knowing that the algorithm did not fail, increases too.

Multi-degree SHIELD
We can imagine a parameter  that decreases as the iterations run, as if it adapted to the vote distribution.Indeed, on one hand, a high  for the first iterations ensures (with high probability) that we get the most frequent vector if getting a non-null vector is easy (i.e.probable), which happens if a vast majority of the vectors code for the same class (i.e. a vast majority of voters agree on one candidate).On the other hand, if the first iterations failed, which suggests that getting a non-null vector is not so probable, the number  of multiplications decreases, making the production of a non-null vector easier.In this framework, our SHIELD operator can be represented by a polynomial  =1     with positive integer coefficients, where  is the highest value taken by ,  =1   =  and some coefficients   which may be null.We call There is indeed a bijection between the set of operators and N[ ] since the order of the terms of different degrees is constrained to be the one of decreasing degrees.Nevertheless, the analogy seems to stop here since the algebraic structure of N[ ] does not apply to the set of operators (think about a factorization like  2  =0     −2 , that would draw for once two vectors and use them for all the  terms/tries, whereas we here want to independently draw the vectors for each try).
Note that we can easily ensure that multi-degree SHIELD does not fail by imposing  1 = 1.Indeed, when we draw only one onehot encoding vector, without multiplying it with others, we cannot get a null vector.Besides,  1 > 1 is useless since the first draw of a single vector will succeed.
It is easily seen that multi-degree SHIELD is a generalization of SHIELD and, as such, in the remainder of this article, multi-degree SHIELD will simply be referred to as SHIELD.

Offset parameter
The SHIELD operator as defined above cannot always provide finite DP guarantees.Let us consider two adjacent databases  and  ′ such that, in , a class  was chosen by no voter and, in  ′ ,  was chosen by one voter.Then, with input , SHIELD will never output  because it cannot pick a one-hot encoding for , the probability of outputting  is then null.On the contrary, with input  ′ , there is a non-null probability (even if it is small) of outputting .Hence, the ratio of probabilities of outputting  is not bounded and we get an infinite privacy cost 4 .
To avoid this problem, we force all the classes to have at least one vote by creating a dummy one-hot encoding for each class.More generally,  dummy one-hot encodings can be created for each class, where  is another parameter of SHIELD, called the offset.
Algorithm 2 gives the pseudocode of the multi-degree version of SHIELD with the offset parameter.
In our experiments, we fixed  to 1, letting the optimization of this parameter for further work.It is nevertheless intuitive that the greater , the worse the accuracy because, when  is large, the distribution of the votes is flattened and the probability of outputting the true argmax is lower.

Exponential argmax operator
As an inherently stochastic mechanism that does not resort to noise addition but rather outputs a value with a probability that is an increasing function of its utility (if we deem that the vote frequency of a class constitutes its utility), SHIELD can be compared to the exponential mechanism (introduced in [30]) which samples its output following the softmax distribution of the utility.However, the sampling in the encrypted domain constrains the shape of the probability distribution and introduces a dependency of the practically implementable distributions with the computational efficiency of the operator.Note that softmax has been approximately implemented in FHE through polynomial approximation [28] but this requires a quite high multiplicative depth (with a polynomial of degree 12 for approximating the exponential function and even more for approximating the inverse function) and results in a significant computational overhead.Moreover, using such an implementation would still require additional homomorphic operations like comparisons to actually sample the output according to this distribution.
Rather, a method of sampling that follows the exponential distribution by construction, in the spirit of SHIELD as presented in this paper, would be more seducing.Sampling each vote independently with a fixed probability would actually yield an output distribution that exponentially depends on the vote frequencies but it seems that the probability of failing by not outputting any class would be quite high for practical parameters.We let further work on this question as a perspective.

FHE IMPLEMENTATION OF SHIELD
Algorithm 2 is a generic version of SHIELD that actually needs to be adapted for an implementation using an HE cryptosystem.
We consider two kinds of encodings, depending on the encryption scheme that we use: • With the BFV cryptosystem, we use batching i.e., encode a number of values together in a single polynomial which is then encrypted.A single operation on a ciphertext thus leads to the same operation applied to all values encoded (slots) inside the ciphertext in a Single Instruction, Multiple Data (SIMD) fashion.On the downside, BFV performances are very sensitive to multiplicative depth and inter-slot operations are costly.• With the TFHE cryptosystem [12], we use a single ciphertext to encrypt a single value i.e. proceed in a Single Instruction, Single Data (SISD) fashion.This is a priori less efficient than using SIMD but the performances are decoupled from the multiplicative depth of Algorithm 2 and costly operations such as slot rotations are avoided.
To sum up, we implement SHIELD with two separate methods: one uses the BFV cryptosystem with SIMD operations (with results provided in Section 5.1); the other uses the TFHE cryptosystem with SISD operations (with results provided in Section 8.3).

Implementing SIMD-SHIELD
Although using BFV allows us to speed up SHIELD considerably by batching different samples together in the same ciphertext, some constraints require adapting parts of Algorithm 2 for them to work.a. Multiplicative depth.As it is the case for other similar HE schemes, we need to set the parameters of BFV according to the multiplicative depth of the computation.The higher the multiplicative depth, the larger the parameters, and the less efficient the overall computation.For this reason, some parts of the algorithm, like Line 9, need to be changed.We can store all of the values for Enc() over the loop and multiply them in a classic tree-based approach (instead of multiplying them sequentially) which reduces the multiplicative depth of the computation from  to log 2 ().
The same change is applied everywhere it is needed, that is to say at Lines 11 and 13 of Algorithm 2.
b. Selecting the teacher.Selecting the voter, also called teacher because of SPEED application case (see Section 6), at lines 8 and 9 of Algorithm 2 is easy enough when SHIELD algorithm is called for a single sample at once.However, in order to speed the algorithm up and fully make use of the SIMD property of the BFV cryptosystem, we actually run SHIELD algorithm for a number of samples at a time.
For instance, if  ( ) is the  vector of  values for sample , then the actual vector encoded in the ciphertext for the packed algorithm would be , This allows us to use the full size of the polynomials we encrypt.These polynomials have degrees in the order of ≈ 2 15 while  is usually less than 1000.
Therefore the teacher selection step has to be modified.The new encoding of teacher 's vote for sample  is:  this new encoding of the teacher's vote.Algorithm 3 presents the process for teacher selection and creation of the  vector using this new encoding.At step 9 a mask   is updated but no detail is given for clarity.
Algorithm 3: Teacher selection.With  the total number of teachers, this algorithm describes the actual steps for selecting the teachers that get to vote in the SIMD encoding paradigm.For every teacher , the mask   is a plaintext vector that contains 0s in the place of samples for which the teacher votes and 1s in the place of samples for which the teacher does not vote.As an example, for  = 2 and  = 4, if teacher  votes for samples 1 and 3, then   = (0, 0|1, 1|0, 0|1, 1).
This mask is then added to   before the multiplication to the  vector so that all the samples that are not voted on do not impact the result: their slots are filled by ones.If the mask is not used, then all non-selected slots will be filled with 0s and therefore would set everything to 0 after the multiplication.
One can see how, using log 2 () rotations and sums, we can obtain   ( )  in the first coordinate of the  ( ) vector.The question marks ? represent values that are rotated over from the next slot, (recall the complete form of  in Equation 1).
Therefore, we cannot control the values in the rest of the coordinates.And this is not enough.For Line 13 to work, we need to have a vector where all coordinates  ( )  are filled with   ( )  , not just the first one.To obtain this, we have to multiply by a plaintext with values (1, 0, 0, . . . ) to select only for the first coordinate of  and then re-populate the rest of the coordinates using rotations and sums exactly in the opposite way as used for the computation of the sum of the  ( )  values.
d. Packing the polynomial rounds together.Up until now, for clarity, we presented a version of our algorithm that packed all or some of the  samples together in a single ciphertext.In practice, to speed up the computation further, we also pack the polynomial rounds together.What we mean by "polynomial rounds" is the two for loops at Lines 1 and 2 in Algorithm 3. We can remove these for loops and compute them in parallel in a single ciphertext.

AN APPLICATION CASE: SPEED 6.1 PATE workflow
Our SHIELD operator is very well adapted to a learning protocol called SPEED, from [20], itself inspired from PATE [33].SPEED method is illustrated by Figure 1, inspired from [20].Assuming the existence of a public unlabeled database Δ (we will keep this notation throughout the paper), PATE enables several data-owners, called teachers, to collaboratively train a classification model without outsourcing their data that are considered private.The idea is to label Δ and use it to train the final classification model, called the student model or simply the student.To do so, each teacher is asked to train a model beforehand for the same task as the student's target task with its own data only.For each sample of Δ to label, every teacher infers a label through its model and sends this label to an aggregation server.The server then counts the number of labels received for each class, also seen as votes, and outputs the class with the most votes which is sent to the student for training.
As it was described, the baseline PATE protocol does not protect the data from the actors of the process, namely the teachers, the server, the student and the end-users.All of them are considered honest-but-curious, which means that they execute their task correctly but may use the data they have access to to retrieve sensitive information about the teachers' data.The teachers do not actually  get any information other than their own data unless they are also end-users of the student model, which may be the case in many real-life scenarios.

Data protection in SPEED
To prevent the student and a fortiori the end-users (by post-processing) from discovering sensitive information by attacks such as e.g.model inversion or membership inference, we apply DP.The teachers noise their votes before sending them to the server.
One could argue that the noise added by the teachers would also blur the sensitive information to the server.Nevertheless, the added noise is precisely scaled so that it protects the output of the aggregation, i.e. the class with the most votes, without harming the student accuracy too much.If the individual votes sent to the server were to be protected by DP before aggregation, thus achieving what is called local DP ( [13,25,26]), this would require much more noise, too much noise to ensure a reasonable accuracy for the student model.As a consequence, the votes need to be protected from the server another way.This is where homomorphic encryption makes its entrance.After noising their votes, the teachers encrypt them.The server then receives the encrypted votes and perform their aggregation (sum and argmax) in the homomorphic layer.Finally, the output of the aggregation is sent to the student that owns the decryption key and is therefore able to decrypt it.
A real-life scenario could involve hospitals that own patients' medical data and aim at training a global model that would help the early diagnosis of a specific disease.In this case, the end-users would be the hospitals themselves.

Faster SPEED with SHIELD
Our SHIELD operator can be used to replace the sum and argmax computations on the server side in SPEED (represented by the gear wheel in Figure 1).After receiving all the votes from the teachers without noise, the server randomly picks some vectors with replacement as described in Section 4.1.Note that, being honestbut-curious, the server is trusted to compute SHIELD without mistake.Interestingly, the rest of SPEED protocol remains unchanged, except the sending of dummy one-hot encodings by some teachers, according to the offset parameter (see Section 4.3).

ANALYSIS OF SHIELD 7.1 Computational complexity of SHIELD
Compared with previous argmax HE computation methods, SHIELD is unique in that its complexity depends only linearly on the number of classes for the chosen machine learning problem.Indeed, the main impact of an increase in the number of classes is that the encoding space increases by the same amount (and therefore the time overhead is linear).A secondary impact is the logarithmic increase in the number of rotations needed for the computation of   ( )  as seen in Section 5.1.All previous works use one (or a combination) of two methods to evaluate a homomorphic argmax over a number of values: a tournament method or a league method.We refer the reader to [9,23] for specific implementation details.
Here we focus on their complexity with respect to the number of classes.
• a league is a system of comparison where every value is compared with every other value.The winner is the value that was greater than every other one.Think of a football league like French first division league ("Ligue 1") for this kind of system.The use of a league method yields a quadratic complexity in the number of classes.This leads to very high performance overheads as the number of classes increases.However, contrary to the tournament method, increasing the number of classes does not affect the multiplicative depth of the circuit to be evaluated.This is what makes this method useful in the homomorphic domain in spite of its complexity.• a tournament is a system where values are compared two-bytwo and the losers are discarded at every round.Think of the FIFA World Cup for this kind of system.Using a tournament method has a -theoretical -linear complexity in the number of classes.In practice, this is not the case.As the number of classes increases, the comparison tree used for the evaluation increases in depth logarithmically.For leveled homomorphic schemes such as BFV or BGV (those we use in this article) used in [23], this means an increase in parameter size to match the multiplicative depth of the new tree.In turn, this impacts the performance of the overall scheme on top of the theoretical linear increase.After a given point, the increase in parameter size becomes prohibitive and one needs to resort to finishing the computation using a league method as they do in [23].Compared to all other existing works therefore, ours scales much better with the number of classes and therefore fits particularly well with use-cases with high numbers of classes.

A priori accuracy metrics
The ultimate accuracy that we want to maximize in SPEED application case is obviously the testing accuracy of the student model.Nevertheless, it could be interesting to measure the accuracy of the argmax operator itself, independently of the student training.Also, even if this depends on the teachers' votes and thus on the used dataset, this enables us to evaluate polynomial parameterizations without performing the student training, which is much faster and allows to test much more parameterizations.We call such an accuracy an a priori accuracy.
The most straightforward way to define the argmax accuracy is to consider the probability of getting the exact argmax.Nevertheless, this approach treats any mistake the same way.It could be argued that outputting, say, the class that received the second greatest number of votes is better than outputting the least preferred class.Taking such a concern into account in our metric would also give a better hint about the student accuracy.Indeed, while the most preferred class (i.e. the class with the most votes, or equivalently the exact argmax) is not always the actual class of the samplecalled the ground truth class -, a class with a lot of votes is more likely to be the ground truth class.
We could then make the assumption that the frequency of votes for a class is proportional to the probability of this class being the ground truth class of the sample (which is not necessarily the most preferred class).This would correspond to an assumption of wellcalibrated vote distributions.In this context, we introduce another a priori accuracy metric which is the probability of outputting the ground truth class of the sample.We call this metric the ground truth accuracy, since it does not focus on outputting the exact argmax but rather the ground truth class.For a given sample ,   being the probability of SHIELD outputting class  and  () denoting the ground truth class of 5 , the law of total probability gives the following expression for the ground truth accuracy for , written GTA(): Of course, both metrics must be averaged on all the samples sent to the teachers.

Differential privacy analysis
Since the student model training requires many requests to the teachers and, indirectly, to their private datasets, we use, as in [20], the moments accountant technique [1] to get a better privacy cost over composition.
We here consider that two databases  and  ′ are adjacent if they are the concatenations of the datasets from the same number of teachers and only one teacher differs from one database to the other.This implies that either all the  ′  , counts for database  ′ , for  ∈ [], are equal to the   , counts for database , in which case the corresponding moments accountant is null, or the  ′  differ from the   only for two values of , say  1 and  2 , such that  ′ Unlike most DP mechanisms, the stochastic behavior of our operator does not come from an additional random noise, since the operator is inherently probabilistic.This is this very property of our operator that we leverage to ensure DP.The moments accountant measures the discrepancy between the output distributions associated with two adjacent databases 6 .Computing the privacy cost of the training, as well as the a priori accuracy, requires thus knowing the probabilities of outputting each class.

Computing the probability distribution of the output
We compute the probability distribution of the output of the algorithm SHIELD with a given polynomial parameterization in a recursive manner.For a sample  of Δ, let M , be the mechanism that takes the whole database (concatenation of the teachers' datasets) as input and outputs the class sent to the student i.e. the output of SHIELD, with the polynomial parameterization  ∈ N[ ].
Let  be the database composed of the teachers' data.Let  be a class of the problem.
If Using these expressions, we simply compute the moments accountant for each query by taking the maximum over all pairs (,  ′ ) such that  is the database constituted by the concatenation of the teachers' database and  ′ is a database adjacent to .We then derive the overall privacy cost using the moments accountant's composition properties.
Note that the obtained DP guarantees are data-dependent since we explored only the pairs of adjacent databases such that one of them is the actual database given by our application.The very values  and  of these guarantees then reveal some information about the training data.In a real-life scenario, these values should be sanitized before being published, as in [34] for instance, but this is beyond the scope of this work.

The differential privacy analysis does not apply to the server
When we compute the probabilities of outputting a class, we do not suppose anything about whose votes are drawn i.e. we do not condition the probabilities on some particular drawing event.This amounts to assume that the adversary only sees the output class, and does not know, in particular, which teachers were selected in the sampling.This assumption cannot apply to the server since it draws the one-hot encodings itself and knows which teachers they come from, for having receiving the encodings one by one from the teachers.Appendix A gives an insight of why this subtlety may be problematic.This observation shows that we need to constrain the server not to see the student model once it is trained.Note that the information leakage induced by the server's knowledge may not jeopardize much the data privacy in practice.We only argue here that our DP analysis does not allow us to derive DP guarantees from the point of view of the server, which might be possible with a more involved (and likely quite complex) analysis, although with probably worse guarantees.

Choice of the polynomial parameterization
We tested SPEED with SHIELD on MNIST dataset [27].While the offset parameter has been set to 1, a key aspect of our experiments is the choice of a polynomial parameterization that realizes a good trade-off between model accuracy, DP guarantees and computational efficiency.Since the computational time overall depends on the sum of coefficients and the degree of the polynomial parameterization, we proceeded by constraining the maximum degree and the maximum value for the sum of coefficients of the polynomials.We fixed the maximum degree to 4 to achieve a reasonable balance between the multiplicative depth, the number of tries and the argmax accuracy.For several integer values (6,12,17,32), we considered all the polynomials of degree at most 4 whose sum of coefficients is less than this value.We do not go beyond a sum of coefficients equal to 32 to keep the computational time low.We then computed the DP guarantee , with  = 10 −5 being fixed, for each polynomial, as well as its GTA score that acts as a proxy for the student model accuracy which could not be determined in reasonable time for so many polynomials.Finally, we focused on the polynomials belonging to the Pareto front for these two criteria -DP guarantee  and GTA -and picked the ones that yielded among the best DP guarantees without harming the accuracy too much.In practice, as it can be seen on Figure 2 the DP guarantee guided more our choice because the GTA, besides being only a heuristic for the actual student model accuracy, did not vary much among the polynomials of the Pareto front.Note that the GTA of the exact argmax is 72.35%.The chosen polynomials are respectively 2 3 + 3 2 +  , 2 4 + 6 3 + 3 2 +  , 6 4 + 6 3 + 4 2 +  and 8 4 + 6 3 + 4 2 +  for a sum of coefficients of at most 6, 12, 17, 32 7 .We did not display the Pareto front for a sum of coefficients of at most 6 because it only contains one polynomial.
Table 1 displays the GTA, the student model accuracy and the DP guarantee  for the chosen polynomial parameterizations,  = 10 −5 being fixed.The GTA and DP guarantee are averaged on the whole set of 8000 samples used for semi-supervised training, 7 The chosen polynomial among the ones with a sum of coefficients at most 32 has a sum of coefficients equal to 19 only.This is good news for computational complexity because it allows us to batch all samples into a single ciphertext and therefore optimize the computation.the DP guarantee being remultiplied by 100, the number of actual queries to the teachers.The student model accuracy is averaged over ten runs, each of which used a different random subset of 100 samples as labeled samples.The table also displays the number of correctly labeled samples (comparing to the ground truth label) out of the 8000 samples.Note that, the observed non-mononotonous relationship between the model and argmax accuracy may deserve more investigations from an ML point of view.The variance of the model accuracy among the runs is quite important and may explain why the accuracy surprisingly does not increase when the polynomial is better in terms of both GTA and number of correct labels.The polynomial 2

SIMD SHIELD with BFV
For our implementation of the SIMD SHIELD algorithm, we use the BFV cryptosystem in the openFHE library [4].The chosen parameters are: log 2 () = 540 ;  = 65537 ;  = 65536 ;  = 32768.They achieve a security level of  = 128 bits with a standard deviation of 3.2.Our implementation was tested on a machine with an AMD Opteron(tm) Processor 6172 using a single thread.
We achieve performances presented in Table 2 for a set of different polynomial parameterizations.Although we tested using the MNIST data set, the performance of an HE algorithm does not depend on the underlying data by construction.Otherwise one could infer something on the data from seeing the computation happen in the encrypted domain.For our implementation, we need to run the SHIELD algorithm over 100 samples.In the table however, we also present computation times for the case whereby we optimize the batching space with a higher number of samples to give an idea of what computation times could be achieved by optimizing parameters further.For now, these optimizations are not yet possible in keeping with the Homomorphic Encryption Security Standard [3] which recommends the use of power-of-two cyclotomic polynomials.A new standard is reported to be in the works which would open applications to the secure use of non-power-oftwo cyclotomic polynomials.Table 2 also compares our method with previous existing methods for non-stochastic argmax computations.Among these methods, the one presented in [20] as well as its later improvement in [9] perform overall for all parameterizations that we tested.It is important to note that these methods do not (and cannot) use batching by construction.Therefore the time per sample is fixed and does not depend on the amount of samples processed.Times in Table 2 for [23] are taken from their Table 4 because it most closely matches our use-case.However important differences remain: we report their timings (amortized over 100 samples) extrapolated from 8 (1.52 s) to 10 (1.9 s) classes (this is conservative as the complexity of their method is at least linear in the number of items); additionally their timings are for a minimum computation, which is less time-consuming than an argmin computation, but no times are given for an argmin in [23].On top of that, [23] makes heavy use of batching and also report additional amortized timings.However, by construction, they are constrained to batching sizes much higher than ours for security purposes, therefore the amortized time of 0.03s per sample reported in [23] cannot be obtained over 100 samples.

Bitwise SHIELD with Cingulata
To show the interest of the batching approach, we also implemented the basic version of SHIELD, as described in Algorithm 2, with Cingulata crypto-compiler and its TFHE backend.
Let us remind that Cingulata, formerly known as Armadillo [8], is a toolchain and run-time environment (RTE) for implementing applications running over HE.Cingulata provides high-level abstractions and tools to facilitate the implementation and the execution of privacy-preserving applications expressed as Boolean circuits, a representation which is natural for SHIELD.
Table 3 shows the execution times of SHIELD for different polynomial parameterizations when performed in a SISD fashion with TFHE and Cingulata.The experiments were performed with a single thread on an Intel Xeon processor with 16 GB of memory and Ubuntu 20.04 operating system.As shown in the table, the execution time of SHIELD increases with the degree of the polynomial and the sum of the polynomial coefficients.As expected, the overall performances are highly below the ones obtained when using BFV and its batching capabilities.

CONCLUSION AND PERSPECTIVES
We proposed SHIELD, a homomorphic stochastic operator whose design, necessary for fast homomorphic computations, yields DP as a natural "by-product".This work reconciliates two complementary but usually independent -or even mutually constraining -privacy tools in an all-in-one operator whose inaccuracy is a crucial feature.We hope this work will encourage new ones on the design of private algorithms where FHE (or other cryptographic primitives) and DP leverage the advantages of each other.For instance, developing algorithms that would be useful in other settings than an election and broaden the scope of machine learning applications seems promising.In this perspective, an argmax algorithm that takes an histogram of the votes as input rather than the "physical" votes represented as vectors would have a more general applicability.
Testing SHIELD on more difficult datasets and especially datasets with numerous classes could reveal its full potential.Besides, a more thorough theoretical study to get results that may lead us through the choice of the parameters (polynomial, offset) is desirable.Other versions including sampling without replacement (Appendix A) or an exponential version of SHIELD (Section 4.4) would also deserve theoretical and experimental analyses.Studying SHIELD in terms of strategy-proofness and fairness could be interesting too and would extend the added value of SHIELD to the area of computational social choice and voting rules [11].

Definition 3 . 1 .
A randomized mechanism M with output range R satisfies (, )-DP if for any two adjacent databases  and  ′ and for any subset of outputs  ⊂ R one has P [M () ∈ ] ≤   P M ( ′ ) ∈  + .

Figure 1 :
Figure 1: SPEED learning protocol (PATE uses the same protocol but in the plaintext domain) e. the differing teacher votes for  1 in  and  2 in  ′ ).

Figure 2 :
Figure 2: Pareto fronts of the polynomials for a fixed maximum sum of coefficients.The polynomials we chose for the student model training are indicated by rededged diamonds.

Table 1 :
4+ 6 3 + 3 2 +  being the one that yields the best model accuracy, we shall focus on this polynomial parameterization when comparing our computational efficiency with the state of the art (see Section 8.2).Accuracy and DP guarantee (with  = 10 −5 ) obtained with several polynomial parameterizations.

Table 2 :
That would allow us to optimize our parameters further.Performance for the SIMD implementation of SHIELD (for 10 classes) for different polynomial parameterizations compared with previous work implementing exact argmax computations.

Table 3 :
Performance for the Cingulata with TFHE implementation of SHIELD