Cryptographic Solutions for Credibility and Liability Issues of Genomic Data

In this work, we consider a scenario that includes an individual sharing his genomic data (or results obtained from his genomic data) with a service provider. In this scenario, (i) the service provider wants to make sure that received genomic data (or results) in fact belongs to the corresponding individual (and computed correctly), (ii) the individual wants to provide a digital consent along with his data specifying whether the service provider is allowed to further share his data, and (iii) if his data is shared without his consent, the individual wants to determine the service provider that is responsible for this leakage. We propose two schemes based on homomorphic signature and aggregate signature that links the information about the legitimacy of the data to the consent and the phenotype of the individual. Thus, to verify the data, each party also needs to use the correct consent and phenotype of the individual who owns the data.


INTRODUCTION
With the rapid decrease in the cost of whole genome sequencing and genotyping, today, genomic data is widely used in healthcare, research, and even in recreational genomics.However, benefits due to this wide use of genomic data come along with potential threats against individuals' privacy.Genomic data of an individual includes privacysensitive data about him such as his physical characteristics, predisposition to diseases, and family members.Therefore, it is crucial to protect privacy of an individual's genomic data while allowing him to utilize his data to receive certain healthcare or recreational services.As a result, there has been significant amount of research efforts on privacypreserving processing and secure storage of genomic data.However, the credibility and liability issues on genomic data have not been widely considered in the literature.
Lots of individuals share their (anonymized) genomic data for research purposes.Such donations are very important for the research community as researchers need large amounts of genomic data samples to increase the statistical power of their studies.Similarly, some service providers make computations on genomic data of individuals and they are only interested in the results of such computations (rather than the raw genomic data).However, researchers (or service providers) want to make sure that either (i) a donated genome indeed belongs to a particular individual, or (ii) the results of a genetic test is indeed computed from the correct data of the particular individual.In this work, • E. Ayday  we study this credibility issue and propose cryptographic techniques that would enable a researcher (or a service provider) to verify the credibility of a donated genome (or a computed genetic test).Furthermore, as an individual donates his genomic data for research (to a particular entity) or undergoes a genetic test from a service provider, he would like to make sure that neither his genomic data nor his genetic test results are going to be observed by other individuals.Privacy leakage occurs when genomic data of the individual or his genetic test results are publicly shared by the service providers that collect such data at the first place.In such incidents, it is important to understand whom to keep liable due to such a leakage.Thus, (i) the individual wants to provide a digital consent along with his data specifying whether the service provider is allowed to further share his data, and (ii) if his data is shared without his consent, the individual wants to determine the service provider that is responsible for this leakage.
Our main assumption is that the service provider (which receives genomic data or genetic test results from an individual) should prove the legitimacy of the data when sharing it with other entities.Otherwise, credibility of the shared data is not guaranteed, and hence data is not valuable.Under this assumption, if the service provider makes the data public (without the consent of the individual), it will be detected by the individual.Similarly, if the service provider tries to share the data offline with another (non-malicious) entity, that entity will understand that the corresponding data is being shared without the consent of the data owner.Note however that if the unauthorized offline sharing of genomic data is between a malicious service provider and other malicious service providers, there is no technical solution to detect this leakage.
A real life example highlighting the use of the proposed technique may be described as follows.Alice obtains her sequenced genomic data from a certified institution.At some point, Alice wants to share a part of her genomic data with a research institution or a pharmaceutical company (e.g., in order to enrol in a research endeavour in return of some compensation).The research institution, both due to the accuracy of the research and for the sake of the compensation paid, wants to make sure that data received from Alice indeed belongs to Alice (with a certain phenotype).One contribution of our proposed system is to prove to the research institution that data really belongs to the individual who provides the data (either anonymously or by revealing the identity).Furthermore, Alice, after she provides her data to the research institution, would still want to have control on her data.In other words, Alice wants to have control on further re-sharing of her data by the research institution and she wants to detect a malicious research institution in case of a re-sharing of her data without her consent.Another contribution of our proposed system is to make sure such unconsented re-sharings of data will be detected and the corresponding malicious research institution will be kept liable due to this behavior.

Contribution
In a nutshell, we propose two schemes to share genomic data and genetic test results, respectively.The proposed schemes are based on both homomorphic signature and aggregate signature that links the information about the legitimacy of the data to the consent and the phenotype (or the identity) of the individual.Thus, in order to verify the data, a party also needs to use the correct consent and phenotype of the individual who owns the data.
One proposed scheme allows the service providers to check the validity of individuals' genomic data.The other proposed scheme allows service providers to conduct genetic tests on individuals' data and be assured that the test is conducted accurately.The adoption of homomorphic signature enables the individual to honestly share any subset of the authenticated data or the test results without interacting with the authority.Moreover, it guarantees that the individual does not leak unnecessary information when sharing the test results.The adoption of aggregate signature efficiently prevents illegal (or unauthorized) sharing of genomic data by the service providers.In such a case, either the entity which receives the data understands that data is shared without the consent of the data owner, or the data owner can understand which service provider leaked his data without his consent, and hence he can hold that party liable of the leakage.
We note that the main novelties of the proposed work are the proposed system, combination of homomorphic and aggregate signatures, and application of the proposed system for genomic data.We use existing cryptographic primitives to build the proposed system (namely homomorphic and aggregate signatures), however, the proposed system is not a straightforward use of such cryptographic tools.In general, sharing privacy-sensitive data between entities is an emerging research area.The main differences of genomic data with respect to other types of sensitive data can be summarized as follows: (i) includes privacy-sensitive information such as predisposition to serious diseases, (ii) includes information about the family members, (iii) it is not revokable (and hence, it is crucial to make sure that it is not leaked), (iv) it is typically shared partially (as different parts of it or different computations on it is requested by different parties), (v) its credibility is very important for the parties that use it (e.g., for research).The proposed system brings solutions for many of the aforementioned unique characteristics of genomic data.That is, we bring a solution for the liability and credibility issues that may raise during sharing of genomic data by developing a novel application of both homomorphic and aggregate signatures.
We emphasize that the proposed schemes can be easily adopted by existing works on privacy-preserving processing of genomic data in order to have a complete pipeline.The rest of the paper is organized as follows.In the next section, we discuss the related work on security/privacy of genomic data and content ownership techniques.In Section 3, we briefly provide background information on homomorphic signatures, aggregate signatures, and genomics.In Section 4, we introduce our system and threat models.In Section 5, we provide the details of the solution for sharing genomic data along with the security analysis.In Section 6, we describe the protocol for sharing the results of a genetic test.In Section 7, we discuss the security properties of the solution and evaluate the practicality of the proposed scheme.Finally, in Section 8, we conclude the paper.

RELATED WORK
There have been several works on security and privacy of genomic data.However, as mentioned, credibility and liability issues of genomic data have not been considered in previous work.We briefly summarize the existing efforts on security/privacy of genomic data in the following.
One line of investigation is represented by works focusing on private clinical genomics.Baldi et al. presented efficient algorithms for privacy-preserving testing on full genomes, including paternity and ancestry testing, and the testing of point mutations (single nucleotide polymorphisms -SNPs) for partner compatibility and personalized medicine [1].Ayday et al. proposed a scheme to protect the privacy of users' genomic data yet enable medical units to access the genomic data in order to conduct medical tests or to develop personalized medicine methods [2].Karvelas et al. proposed using the oblivious RAM mechanisms to access genomic data (that is stored at a third party) and secure twoparty computation protocols to compute various functionalities on the data [3].Recently, Wang et al. proposed private edit distance protocols to find similar patients (e.g., across several hospitals) [4].To provide secure storage and retrieval of genomic data, Ayday et al. proposed techniques for the privacy-preserving storage and retrieval of raw-genomic data [5], and Huang et al. proposed a scheme that would guarantee long-term security (in an information-theoretical sense) for genomic data [6].
Another area of interest addresses the problem of protecting genomic privacy and still allowing for both basic and translational medical research on the data.It has been shown that standard anonymization techniques are ineffective on genomic data [7].It has also been shown that the identity of a participant of a genomic study can be revealed by using a second sample, that is, part of the DNA information from the individual and the results of the corresponding clinical study [8].Furthermore, Humbert et al. evaluated the genomic privacy of an individual threatened by his or her relatives revealing their genomes [9].As a response to these threats, a few solutions have been proposed.These can be put in three main categories: (i) techniques based on differential privacy, in which a controlled noise is added to the result of a query (to a genomic database) [10], (ii) techniques based on cryptography, in which the use of homomorphic encryption, secure hardware, or secure multiparty computation are proposed for privacy-preserving genomic research [11], [12], and (iii) techniques based on optimization, in which the goal is to maximize the amount of publicly shared genomic data and also comply to the privacy preferences of individuals.
There have also been many attempts to prove the credibility (or authenticity) of a given message or document.The most common tools to provide this functionality are digital signatures [13].Digital signatures are widely used for software distribution, financial transactions, and in other cases in which it is important to detect forgery or tampering.However, using a digital signature to prove the credibility of a genome has two main disadvantages: (i) digital signature can reveal the identity of the genome donor, and (ii) genomic data is usually shared or donated partially, but the signature is typically computed over the whole data at the data generator side (e.g., sequencing facility).On the other hand, liability issues of a digital content are typically addressed by using a watermarking technique on the document [14].However, (i) digital watermarking techniques are proved to be functional for multimedia content, but not for informative text, (ii) watermarking techniques typically include injecting some level of noise to the data, which might not be tolerated for health-related data, and (iii) a watermark is typically included on the whole file (e.g., image), but genomic data can be partially shared.

PRELIMINARIES
In this section, we provide background information for homomorphic and aggregate signatures (which are the main building blocks of our proposed schemes) and genomics in general.

Signature Schemes
Homomorphic signatures.Similar to homomorphic encryption scheme which enables computation on encrypted data, homomorphic signature scheme enables computation on signed data.Suppose a user Alice has a set of messages {m 1 , • • • , m k }.She can (independently) sign each data element and store the signatures at a cloud server.Later, Alice can ask the server to compute authenticated functions of the signed data (e.g., a signature for the mean value of the messages), solely based on the individual signatures.Given the mean value and the signature from the server, any user can verify the signature.Many homomorphic signature schemes have been proposed in the literature, as surveyed in [15].Next, we briefly introduce the Boneh-Freeman linearly homomorphic signature scheme (Setup, Sign, Verify, Evaluate) from [16] that we will use in this work 1 .The scheme is detailed in Appendix A.
• Setup(1 n , k).On input a security parameter n and a dataset size k, this algorithm outputs a public/private key pair (pk h , sk h ).The parameter k defines how many signatures can be involved in the homomorphic operation.The message space is F n p , where p is a prime number, and signatures are short vectors in where F includes all F plinear functions on k-tuples of messages in F p .
• Sign(sk h , τ, m, i).On input a secret key sk h , a tag τ ∈ {0, 1} n , a message m ∈ F n p , and an index i, this algorithm outputs a signature σ.Note that τ can be considered as an identifier of the dataset that m belongs to, while i is the index of m in this dataset.item Verify(pk h , τ, m, σ, f ).On input a public key pk h , a tag τ ∈ {0, 1} n , a message m ∈ F n p , a signature σ ∈ Z n , and a function f ∈ F, this algorithm outputs 1 (accept) or 0 (reject).
Two security properties are defined for homomorphic signature: unforgeability and context-hiding.Informally, the unforgeability property implies that an attacker will not be able to forge a signature for a new message with an existing tag or any message under a new tag τ (generated by the attacker himself).Moreover, the attacker will not be able to forge a signature for a message which is not equal to the evaluation of f on the existing signed messages.Suppose that The contexthiding property implies that the signature σ, namely the output of Evaluate, does not leak more information about To improve the efficiency of cascaded sharing of SNPs and test results, we also use aggregate signatures.Suppose there are N users, denoted as {U 1 , . . ., U N }, of an aggregate signature scheme (Setup, KeyGen, Sign, Verify, Aggregate).Suppose that each user U i , with the key pair (pk a i , sk a i ), generates a signature σ i = Sign(sk a i , m i ) for message m i .Then, given σ i (1 ≤ i ≤ N ) values from all users, any entity can run Aggregate to aggregate them into a single signature σ agg .With pk a i , m i (1 ≤ i ≤ N ) and σ agg , any entity can verify whether these signatures are valid or not.In this paper, we use the Boneh-Lynn-Shacham aggregate signature scheme [17], which achieves standard unforgeability property.The scheme is detailed in Appendix B.

Genomics Background
The human genome is encoded in double stranded DNA molecules consisting of two complementary polymer 1.We note that other similar homomorphic signature schemes can also be used apparently.
chains.Each chain consists of simple units called nucleotides (A,C,G,T).Even though most of the DNA sequence is conserved across the whole human population, around 0.5% of each person's DNA (which corresponds to several millions of nucleotides) is different from the reference genome, owing to genetic variations.Single nucleotide polymorphism (SNP) is the most common DNA variation.A SNP is a position in the genome holding a nucleotide that varies between individuals and there are approximately 4 million SNPs in each individual.Multiple Genome Wide Association Studies (GWAS) performed in recent years have shown that a patient's susceptibility to particular diseases can be (partially) predicted from sets of his SNPs.Thus, leakage of SNPs often poses a significant threat to individual privacy.
Each SNP position includes two alleles (i.e., two nucleotides) and everyone inherits one allele of every SNP position from each of his parents.If an individual receives the same allele from both parents, he is said to be homozygous for that SNP position.If, however, he inherits a different allele from each parent (one minor and one major), he is called heterozygous.Depending on the alleles the individual inherits from his parents, the content of a SNP position can be simply represented as the number of minor alleles it possesses, i.e., 0, 1, or 2. A service provider may run various linear tests on the SNPs of an individual.For example, a service provider may compute the predicted susceptibility of patient P for disease X, S X P , by using weighted averaging [2] as follows: where, ϕ X includes the indices of SNPs that are relevant for disease X and w i j (X) represents the contribution of different states of SNP j (i.e., 0, 1, or 2) for disease X.

SYSTEM AND SECURITY MODELS
Here we describe the system model, threat model, and the initialization for the proposed scheme.Frequently used symbols and notations are presented in Table 1.

The System Model
We assume the existence of multiple certified institutions (CIs), individuals, and service providers (SPs) in the system.For the sake of simplicity, we will describe the proposed scheme using a single CI, individual (Alice), and SP.Our proposed system model is also illustrated in Fig. 1.
The CI is mainly responsible for sequencing, encrypting, and signing the sequenced data.In this work, we do not consider encryption at the CI, as it is not the main focus of the paper.However, there has been several works in the literature that cover such encryption techniques.Our proposed scheme can easily be adopted by one of such schemes to provide a complete pipeline.Furthermore, it is worth noting that a certified institution for sequencing has been proposed in many existing works on genomic privacy [2].Having such a CI is also unavoidable in today's sequencing technology.In practice, the SP can be a medical institution, a genetic researcher, or a direct-to-customer (DTC) service provider.The SP is mainly interested in receiving a portion It has been shown that the results of such genetic tests are particularly important to determine (i) the predisposition of an individual for different diseases, or (ii) the exact dose of a drug that will be prescribed to an individual.Alice, on the other hand, is interested in either (i) enrolling in a genetic research initiative by donating a part of her genome (e.g., a subset of her SNPs), (ii) sharing a part of her genome with a medical institution for treatment, or (iii) receiving a service based on the result of a genetic test that will be run on her genome.In all these scenarios, Alice wants to share her data either anonymously (without her real identity) or with her real identity.Furthermore, she also wants to provide a consent denoting whether the SP can further share the genomic data it received from Alice with other entities (either anonymously or with the real identity of Alice).When the system is set up, we assume the following keys have been generated and certified by a certificate authority (CA).

•
The CI generates a key pair (sk h CI , pk h CI ) for the Boneh-Freeman homomorphic signature scheme.During the key generation, the CI should set the parameters according to the pre-defined sequencing tasks.Suppose the set of SNPs for Alice is G with the size |G|, then the k parameter (number of signatures that can be involved in the homomorphic operation) should be |G| + 2, required by the proposed protocols.The parameter p (in Section 3.1) should be selected such that it makes equality (3), defined in Section 5.1, hold with very small probability.
• Alice generates a key pair (sk a A , pk a A ) for the Boneh-Lynn-Shacham aggregate signature scheme.

•
The SP generates a key pair (sk a SP , pk a SP ) for the Boneh-Lynn-Shacham aggregate signature scheme.
As a standard practice, we assume the CA generates a certificate for every public key and is responsible for all maintenance issues.For simplicity, we omit the details here.With respect to Alice's public key pk a A , we assume the associated certificate Cert A does not contain the Alice's real identity ID A because we want to allow Alice to anonymously share her data (when desired).However, we require the CA to issue a specific certificate Cert pk−id−A to link ID A and pk a A .
( The identity of the SP C A,SP (t) The actual consent vector ({"do not share", "share anonymously", "share non-anonymously"}) Message format for the consent, Vector representing Alice's phenotype R A Anonymization factor for anonymous sharing,

Threat Model
To be realistic and avoid single point of failure, we assume there are two trust anchors in the system.First, all parties trust the CA(s) to certify the public keys used to protect genomic data, as shown in Fig. 2. In reality, the CA(s) can be government agencies or entities endorsed by such agencies.We could even require the CI to be certified by more than one CAs.For simplicity, we assume there is only one CA in our discussion.Second, all parties trust the CI to generate genomic data (via sequencing) and link the generated data to individual users, as shown in Fig. 3.That is, the CI does the sequencing of the individual by taking a biological sample from the individual when the individual is physically present at the CI.The sequencing part of the pipeline is the less secure part as admitted by many existing work.Thus, one has to be physically present at the CI for sequencing.If physical presence is not needed for sequencing, anyone can send anyone else's sample, which is not desired at all.Thus, this physical presence requirement at the CI guarantees that the user cannot provide incorrect data (that does not belong to herself) during the protocol.Currently sequencing centers do not sequence anyone without physical presence.One exception is the direct-to-consumer (DTC) service providers, but (i) DTC providers do not do full sequencing, and (ii) the reliability of their data is questionable.Once the CI takes the sample for sequencing, it also does the verification of the phenotype of the sequenced individual.
Since we want to focus on the credibility and liability issues, we simply assume there are secure communication channels between all parties.Therefore, an outside attacker will neither learn the genomic data and test results (confidentiality) nor modify them (integrity).Under these assumptions, we mainly consider two types of attacks in our security evaluation.(or a pharmaceutical company), and a malicious SP may forward modified genomic data or test results to another SP to mislead the latter.
• Liability attack.A malicious party (e.g., a SP or CI) may try to forge a user's consent in order to share his/her genomic data or test results with another honest party.As mentioned before, if two malicious parties want to share a user's data at their hands, we do not have technical way to stop it and should resort to other countermeasures.
We note that in neither of our proposed schemes, we require the SP to play by the book.That is, the SP can be a malicious institution that wants to (i) modify Alice's genomic data and share it with other parties, or (ii) share Alice's genomic data publicly or with other parties without the consent of Alice and still get away with this behavior.

Initialization
We have two message formats in the proposed scheme representing the SNPs and the consent.

•
The message format of SNP i of Alice is denoted as an n-tuple M s i = (ID A , g i , 0, • • • , 0), where ID A is the Alice's identity and g i is the value of SNP i (i ∈ G and g i ∈ {0, 1, 2}).The (n − 2) 0s in M s i are to meet the message format of the Boneh-Freeman homomorphic signature scheme.

•
The message format of consent is represented as , where ID SP is the identity of the SP for the corresponding transaction, and C A,SP (t) represents the actual consent.In its simplest form, C A,SP (t) can be {"do not share", "share anonymously", "share non-anonymously"}, and can be defined freely.We assume C A,SP (t) = (c 1 , c 2 , c 3 ), where c i ∈ {0, 1}, and at any instant, C A,SP (t) vector includes a single "1" (i.e., only one of the c i values is equal to "1" and the others are "0").
After the setup, Alice and the CI interact as follows for Alice to register at the CI. 1) Alice sends her identity ID A , her phonotype P A , her public key pk a A and associated certificate Cert A , and Cert pk−id−A to the CI.
2) The CI validates the following facts: Alice owns the phonotype P A , the certificate Cert A for pk a A is correct, and Cert pk−id−A is valid and links ID A and pk a A .If the validation passes, the CI selects τ A ∈ {0, 1} n and sends it to Alice.Note that n is the security parameter of the Boneh-Freeman homomorphic signature scheme.At the end, the CI establishes a record (ID A , P A , pk a A , Cert A , Cert pk−id−A , τ A ) for Alice.The CI publishes pk a A , τ A so that any entity can see the link between them.
At any time, Alice provides her biological sample to the CI, which will then sequence her genome and sign the results.As discussed before (due to the current sequencing policies), the sequencing operation requires Alice to be physically present at the CI and provide her biological sample.During this process, the CI also verifies the phenotype of Alice (P A ) and adds this information to Alice's record as well.In more detail, Alice and the CI perform the following protocol shown in Fig. 4.

1) Alice sends her biological sample along with ID A
and P A to the CI. 2) The CI does the sequencing and determines the SNPs in G.

3) The CI constructs
4) The CI selects the anonymization factor R A = ( A , 0, • • • , 0) where A $ ← Z p which means A is chosen from Z p uniformly at random.The anonymization factor is used when Alice wants to share her data anonymously.

5) The CI constructs anonymized SNPs
6) The CI signs each anonymized SNP message using homomorphic signature scheme and sk h CI to obtain S i = Sign(sk h CI , τ A , M s i , i) for every i ∈ G. 7) The CI signs the anonymization factor R A to obtain The CI verifies Alice's phenotype (that Alice indeed has the phenotype P A ).This process is also done while Alice is physically present at the CI.
We assume that the P A vector is of size α and it is represented as , where p i A ∈ {0, 1}.That is, each vector entity represents the existence of a particular phenotype and if Alice has the corresponding phenotype, that entry is marked as "1".once the phenotype is verified, the CI also adds P A to Alice's record. 9) The CI signs the ID of Alice along with her phenotype information to obtain The CI sends anonymized SNPs, corresponding signatures (i.e. S i values), the anonymization factor (i.e.R A ), T A , and D A to Alice.11) Alice verifies all received signatures.
To facilitate the following discussions, we define a message vector − → M and a signature vector − → σ with |G| + 2 elements as follows.

PROTOCOL FOR SHARING SNPS
If Alice wants to share her SNPs with the SP nonanonymously, they engage in the protocol shown in Fig. 5.In more detail, the protocol takes the following steps.
1) The SP sends the indices of the SNPs it requests, denoted by

2) Alice retrieves the corresponding anonymized SNPs
M s j (j ∈ I) along with the corresponding anonymity factor R A .
3) Alice generates |G| + 2 random coefficients to construct a function f which has the encoding The generation of f is detailed below.
Let PF be a Hash function, which outputs  as input.Then, she sets and sets f x = 0 for other x (i.e., for the SNPs that are not in I).Thus, any entity, including the SP, can validate f is generated in this manner.4) Alice generates a combined signature using the homomorphic properties of the digital signature scheme.

5) Alice sends ID
and σ * to the SP.In addition, Alice should also sends pk a A and Cert pk−id−A .6) The SP validates f (as coefficients in f are publicly verifiable) and verifies σ * .7) The SP requests the consent from Alice.

8) Alice
generates the consent M c and signs it using her private key to obtain 9) Alice sends M c and σ to the SP.10) The SP verifies the signature.The use of aggregate signature for further re-sharing of the same data (assuming Alice has consent for re-sharing) is further discussed below.
Suppose that SP (0) has been authorized by Alice to further share her SNPs data.If SP (0) wants to share the SNPs with SP (1) then it will generate a signature σ 0→1 for a consent of the form M c ||τ A ||inf o||ID SP (1) .Similarly, SP (1)  can generate a signature σ 1→2 for a consent of the form M c ||τ A ||inf o||ID SP (2) to share the SNPs with SP (2) .This process can continue, and form a chain of delegated consents: σ 0→1 , σ 1→2 , • • • , σ N −1→N .SP (N ) can aggregate the signatures into a single one σ 0→•••→N .When SP (N ) wants to share Alice's data with Bob, it provides the following information.
Bob can then validate all the signatures in the chain to see whether SP (N ) has obtained the permission or not.
Moreover, Bob can validate the SNPs data by validating σ * .Note that the SNPs data can be obtained from the inf o parameter.

Security Analysis
As to security, the homomorphic signature scheme guarantees that the signature σ * is computed based on the signed SNPs by the CI, while the aggregate signature scheme guarantees that the consent is actually given by the owner.The tag τ A links the two signatures together.In the proposed protocol, the generation of challenge f plays a key role in preventing credibility attacks, because it randomly links the homomorphic signature to the original signed SNPs and forbids malleability.We discuss two cases.
• Alice tries to cheat SP.In this case, some of the SNPs information from Alice, namely M s ij (1 ≤ j ≤ t) and R A , is different from what has been signed by the CI.The unforgebility property of the homomorphic signature scheme guarantees that f • − → M T is computed correctly by Alice, and the corresponding signature σ * is valid.Otherwise, we will have a forgery for the signature scheme.As such, Alice can only successfully mount an attack when the following equality holds.
where the modified message vector is denoted by Based on the generation of f , it is straightforward to show that the equality holds with negligible probability with reasonable parameters if we assume PF to be a random oracle.Therefore, it is infeasible for Alice to mount the attack.collusion does not give Alice any additional advantage.
The unforgebility property of the Boneh-Lynn-Shacham aggregate signature scheme guarantees that the SP has been authorized by Alice to use SNPs and has the privileges specified in the consent M c .The inf o parameter links the signature σ to the shared SNPs data.

Anonymous Sharing
In order to stay anonymous, Alice follows the same protocol, shown in Fig. 5, except the following.After all the changes, the security analysis remains the same.

PROTOCOL FOR SHARING TEST RESULTS
If Alice wants to share the genetic test results with the SP, they engage in the protocol shown in Fig. 6.The protocol has the following steps.
1) The SP sends the weights of the test to Alice (to be general, we assume all SNPs to be used in the test).
2) Alice constructs the first |G| values of f based on the weights and sets f |G|+1 = f |G|+2 = 0.
3) Alice computes the result of the test m * = f • − → M T using her SNPs and the received weights.4) Alice generates a combined signature σ * using the homomorphic properties of the digital signature scheme.If Alice wants to share her phenotype information P A with the SP, then she can send R A , T A , ID A , P A and D A to the SP, which can verify the signatures T A and D A independently.In addition, she should send Cert pk−id−A as well, which links ID A to pk a A .If Alice wants to stay anonymous, she should not share these information.Moreover, Alice should replace ID A with τ A in the consent M c .
The unforgebility property of the homomorphic signature scheme guarantees that the test result m * is faithfully computed based on Alice's data, while the context hiding property guarantees that the signature σ * does not leak more information than m * about Alice's SNPs.The unforgebility property of the aggregate signature scheme guarantees that the SP has been authorized by Alice to use test results and has the privileges specified in the consent.If the test results are going to be shared further with other SPs, the workflow is the same as that of sharing SNPs.

DISCUSSION
In this section, we provide more discussion with respect to security and performance about the proposed solutions.

Security
In general, all signatures (on data, ID, and phenotype) are generated by the CI.Using the homomorphic properties of the digital signature scheme (as discussed in Section 3.1), Alice linearly combines such signatures (depending on the type of the query) and generates a valid signature that can be verified by using the public key of the CI.As discussed in Section 5.1, Alice cannot cheat an SP by providing incorrect SNP data.We assume that the SP, when sharing Alice's data with other entities, needs to show proof that the data is legitimate.This proof is the digital signature that SP receives from Alice (signed using the aggregate signature scheme and Alice's private key).As discussed, the signature can only be verified by using the correct consent of Alice.Therefore, the SP will be detected if it tries to share Alice's data without her consent.A malicious SP may try to modify the consent of Alice in order to share her data with other entities (along with a valid signature).However, since the consent is signed by Alice's private key at the first place, such an attack is also not possible.
A malicious SP may also publicly share Alice's SNP data without her consent.We assume that such a sharing also includes the signature to prove the credibility of the shared data.In such a scenario, the f values in the corresponding signature would reveal the identity of the malicious SP that leaked Alice's data without her consent.This property of the proposed scheme brings a solution for the liability issues on case of unauthorized sharing of genomic data (since the values in f are generated using the public key of the SP, as discussed in Section 5).
One drawback of the proposed scheme is that it does not prevent an SP from linking the anonymous identity of Alice to her real identity.Assume Alice shares a set of SNPs with a particular SP in a non-anonymous way.Then, if Alice shares another set of SNPs on a public database in an anonymous way, the SP can deanonymize Alice's identity as it possesses the R A value of Alice from the previous transaction.We will further study this issue in future work.
Another drawback of the proposed scheme is that the scheme does not provide a solution in the case of unconsented sharing of data between two malicious institutions.For example, assume Alice shares her genomic data with a malicious SP 1 with the consent C A,SP1 (t) = (1, 0, 0) (i.e., Alice does not want further sharing of her data, and hence "do not share" bit is set in the consent).Then, if SP 1 publicly shares Alice's data or tries to share the data with a non-malicious SP, it will be detected.However, SP 1 can share Alice's data with another malicious SP 2 without being detected.To the best of our knowledge, there is no technical solution for this problem.

Performance
Note that genome sequencing is an operation that only needs to be done once, and the sharing of genomic data and genetic results is a frequent operation that individual or organization will do in practice.Therefore, computational complexity will not be a major concern.Nevertheless, we believe the solutions are in fact quite efficient.
We also briefly remark on the performance of the proposed solutions.First, we recap the implementation results of the Boneh-Lynn-Shacham aggregate signature scheme due to Barreto et al. [18].Suppose that the implementation is based on a super-singular curve.For a computer with PIII 1 GHz CPU, signing takes 3.57 milliseconds, while verification takes 53 milliseconds.The aggregation algorithm Aggregate only incurs multiplications in the source group, and each multiplication takes less than 14 microseconds.Verifying an aggregate signature with k individual signatures takes roughly 53 • k milliseconds.Second, we remark on the homomorphic signature scheme.The most costly function for the homomorphic signature scheme is the Sign algorithm, whose main complexity comes from the SamplePre routine which is basically a sampling algorithm for Gaussian distribution.According to the implementation of Lyubashevsky and Prest [19], based on an Intel Core i5-3210M laptop with a 2.5GHz CPU and 6GB RAM, a Gaussian sampling takes about 115 milliseconds.We also note that the signing SNPs only need to be done once by the CI.The Verify and Evaluate algorithms are much more efficient because they only incur linear operations and has no exponentiations.On the same platform, the complexity of these operations will be (at most) at the magnitude of microseconds.This means that, from the perspective of the user (e.g.Alice), the solutions are extremely efficient.As a future work, we will build a proof-of-concept prototype and have the precise performance numbers.It also make sense to integrate the proposed solutions into other privacypreserving solutions, so that we achieve a wide range of security properties.

CONCLUSIONS
In this work, we proposed two cryptographic schemes to share genomic data and genetic test results.The proposed schemes are between a data owner and a service provider.Using the proposed schemes, on the one hand, a service provider can check the validity (or legitimacy) of genomic data it receives from a data owner (individual).On the other hand, the individual, via a digital consent, can make sure that the service provider will not further share his data without his permission.The proposed schemes are based on homomorphic signatures and aggregate signatures, and these cryptographic primitives enable us to link the information about the legitimacy of the data to the consent and the identity of the individual.We also discussed the security and practicality of the proposed schemes.The proposed schemes can be easily adopted by existing works on privacypreserving processing of genomic data.

Fig. 1 .
Fig. 1.Proposed System Model Signature on the anonymized SNP i, S i = Sign(sk h CI , τ A , M s i , i) T A Signature on the anonymization factor R A , T A = Sign(sk h CI , τ A , R A , |G| + 1) D A Signature on the identity (ID A ) and phenotype (P A ) of Alice, D A = Sign(sk h CI , τ A , (ID A , P A , 0, • • • , 0), |G| + 2) σ * Combined signature on S i values, T A , and D A generated by Alice using the homomorphic properties of the Boneh-Freeman homomorphic signature scheme σ Signature generated by Alice on her consent (M c ) by using the Boneh-Lynn-Shacham aggregate signature scheme (w1, • • • , w |G| ) Weights for the genetic test on Alice's SNPs m * Result of the genetic test on Alice's SNPs

Fig. 2 .
Fig. 2. Trust model between the certificate authority (CA), the user, the certified institution (CI), and the service provider (SP).

Fig. 3 .
Fig. 3. Trust model between the certified institution (CI), the user, and the service provider (SP).

••
Alice should not include R A ||IDA ||P A in Step 3), and should set f |G|+1 := 0, f |G|+2 := 0 in generating f .Alice should not transmit R A , ID A , P A , and Cert pk−id−A to the SP in step 5).• Alice should not include R A ||ID A ||P A in Step 8), and should replace ID A with τ A in the consent M c
and A. Yilmaz with the Computer Engineering Department, Bilkent University, Ankara, Turkey.Q. Tang is with Luxembourg Institute of Science and Technology, L-4362 Esch-sur-Alzette, Luxembourg.E-mail: erman@cs.bilkent.edu.tr,qiang.tang@list.lu• Erman Ayday is supported by a funding from the European Unions Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 707135 and by the Scientific and Technological Research Council of Turkey, TUBITAK, under Grant No. 115E766.Qiang Tang is supported by a junior CORE grant from the National Research Fund, Luxembourg.
sk h CI , pk h CI ) Public/private key pair of the CI for the Boneh-Freeman homomorphic signature scheme (sk a A , pk a A ) Public/private key pair of Alice for the Boneh-Lynn-Shacham aggregate signature scheme (sk a SP , pk a SP ) Public/private key pair of the SP for the Boneh-Lynn-Shacham aggregate signature scheme G Set of SNPs for Alice ID A Alice's real identity Cert A Certificate associated to Alice's public key pk a A (does not contain Alice's real identity) Cert pk−id−A Certificate issued by the CA to ID A to pk a The value of SNP i, i ∈ G and g i ∈ {0, 1, 2} ID SP

TABLE 1
Symbols and notations used in this work.