Assuring Certified Database Utility in Privacy-Preserving Database Fingerprinting
Description
Descriptions
The source code consists of three parts: dataset extraction and encoding, UtiliClear implementation, and robustness and utility testing. The code for dataset extraction and encoding, as well as robustness and utility testing is written in Python, while the implementation of UtiliClear is written in C++. The execution sequence and input-output relationships of the code files are demonstrated in figure 'Code_Overview.jpg'.
The original dataset can be downloaded in the address Amazon Review Data. Executing all the processes, including dataset division, encoding, preprocessing, fingerprinting, verification, and so on, requires significant runtime. Therefore, we directly provide one encoded subset (186 MB) along with its corresponding dictionary files to evaluate the functionality and reproducibility. Moreover, executing all the experiments in the paper takes approximately several days. Therefore, testing can be done on a smaller dataset to estimate the resource consumption on a larger dataset, as the algorithm's consumptions are linearly related to the dataset size. In addition, the new testing environment may not align exactly with that used in the paper's experiments, so the consumptions may vary, but the overall trend will remain consistent.
Dataset extraction and encoding
Dataset extraction
The code file “Remove_NULL.py” is used to remove invald data from the dataset. Before dividing the dataset into random subsets, execute this code file, where the 'input_folder' contains the files of the origianl dataset and the 'output_folder' is the folder to store the output files.
The code file “Divide_Dataset.py” is used to randomly extract two subsets (8GB and 16GB) from the original dataset randomly. Before running the code, the dataset input folder ('datadir') and output paths ('db1_file', 'db2_file', and 'db3_fil3') should be adjusted according to the specific dataset storage directory (make sure that the input folder only contains the csv files of the dataset).
Path | Content |
'datadir' | The folder storing the csv files from the original dataset. |
'db1_file' | One file to store the DB1 (8GB). |
'db2_file' | One file to store the DB2 (16GB). |
'db3_file' | One file to store the DB3 (34GB). |
Dataset encoding
The code file “Encode_Dataset.py” is used to encode the attribute values of these datasets into bit-strings. Categorical attributes are encoded into word vectors. Text sentences are encoded inot one-hot vectors. Numeric attributes are encoded into corresponding binary bit-strings. Before running the code, the dataset input and output paths should be adjusted according to the specific dataset storage directory. After obtaining the encoded datasets, you should manually build corresponding databases on your device and import the datasets into the MySQL-5.1.51-Win32 database (Either Navicat tool or command-line interface can be used to import encoded datasets into MySQL). The code file “Encode_Dataset.py” additionally outputs dictionary files for the cloumns containing words. The dictionary files should be sent to the recipient for further decode the binary bit-strings to the words and sentences.
Path | Content |
'input_file' | The file storing the original dataset (DB1, DB2, or DB3). |
'output_file' | The file storing the encoded dataset. |
'dirct_2_file' | The dictionary file for the second column. |
'dict_4_file' | The dictionary file for the fourth column. |
'dict_5_file' | The dictionary file for the fifth column. |
UtiliClear implementation
The UtiliClear implementation consists of five parts, i.e., setup, preprocessing, fingerprint embedding, verification, and fingerprint extraction. Before running the programs, the libraries of OpenSSL-3.0.4-Win32, PBC-0.5.14-Win32, and GMP-Win32 should be installed and included in the programs.
Setup
During the setup process, the recipient and the database owner interact to exchange public parameters, including the public parameters of database owner, database information (e.g., the number of columns), the number of insignificant bits, the number of groups, the maximum number of modified bits. The code files “Setup_DO.cpp” and “Setup_Recipient.cpp” serve as the setup programs for the database owner and recipient, respectively (The 'Parameter_path' of the code file “Setup_Recipient.cpp” is used to store the parameters in the recipient side).
Note that before running the code, the database owner side should manually build three tables in MySQL databse to store the parameters of the database owner, the dataset information, and the parameter specified by the recipient.
The do_parameters table:
p | q | N | g | h |
The recipient_parameters table:
Name | Num_InsignBits | Num_Group | Fingerprint |
The db_info table: (Attr6 is additionally added, as the primary key to unique index a record. It cannnot be modified)
DB | Row_NuDB3 | Col_Num | Attr1 | Attr2 | Attr3 | Attr4 | Attr5 | Attr6 |
You can also modify the table headers and styles, but you must update the corresponding lines in the code to adjust the database operation-related logic accordingly. If the table settings above are followed, no changes to the database operations in the code are required.
Preprocessing
During the preprocessing process, the database owner first groups the database based on the specified number of groups and the number of insignificant bits determined by the recipient.
Database grouping
The code file “DB_Grouping_DO.cpp” is used to execute database grouping. It takes the encoded database file ('inputFile') as the input and outputs the grouping results of each column. Each conlum corresponds to two output files: one for the grouping of significant bits ('outputDir1') and the other for the grouping of insignificant bits ('outputDir2').
Path | Content |
'inputFile' | The encoded dataset. |
'outputDir1' | The group results of significant bits (each conlum corresponds to one file). |
'outputDir2' | The group results of insignificant bits (each conlum corresponds to one file). |
Commitment for significant bit group
The code file “Com_SignBits_DO.cpp” is used to compute commitments for significant bit groups. Before executing the code, make sure that the group files for significant bits are stored in an identical folder, and on other files are stored in this folder. Then you should modify the folder 'inputDir' in the code file correspond to the folder you have selected. You should also modify the folder 'outputDir' to store the output commitment files.
Folder | Content |
'inputDir' | The group results of significant bits (each conlum corresponds to one file). |
'outputDir' | The commitment results of significant bits (each conlum corresponds to one file). |
Locking for insignificant bit group
Generate Cx: The code file “Lock1_Gen_CX_DO.cpp” enables the DO to compute the GM ciphertext for insignificant bits. Before executing the program, the grouped insignificant bit results (each conlum croresponds to a file) should be contatined within an identical folder 'inputDir', and on other files are stored in this folder. You should also modify the folder 'outputDir' to store the output GM ciphertext files (each conlum croresponds to a file).
Folder | Content |
'inputDir' | The group results of insignificant bits (each conlum corresponds to one file). |
'outputDir' | The Cx results of insignificant bits (each conlum corresponds to one file). |
Generate Cy: The code file “Lock2_Gen_CY_Recipient.cpp” enables the recipient to embed perturbation to the GM ciphertext of insignificant bits. Before executing the program, the GM ciphertext files (each conlum croresponds to a file) should be contatined within an identical folder 'inputDir1', and on other files are stored in this folder. You should also modify the 'outputDir1' and 'outputDir2' to store the pertrubed GM ciphertext and the used perturbations (each conlum croresponds to a file), respectively.
Path | Content |
'inputDir' | The folder storing Cx (each conlum corresponds to one file). |
'Para_path' | The file storing the parameters of the recipient. |
'output1' | The folder to store the perturbed ciphertexts Cy (each conlum corresponds to one file). |
'output2' | The folder to store the perturbations (each conlum corresponds to one file). |
Generate y: The code file “Lock3_Gen_Y_DO.cpp” enables the DO to recover y from the perturbed ciphertexts received from the recipient. Before executing the program, the perturbed ciphertext files (each conlum croresponds to a file) should be stored in an identical folder, and on other files are stored in this folder. Then you should modify the "inputDir" in the code file correspond to the folder you have selected. You should also modify the "outputDir" to store the decrypted result y (each conlum croresponds to a file).
Folder | Content |
'inputDir' | The perturbed ciphertexts Cy (each conlum corresponds to one file). |
'outputDir' | The decrypted results y (each conlum corresponds to one file). |
Fingerprint embedding and extraction
Fingerprint embedding:
The DO executes the “Fingerprinting_DO.cpp” code file, while the recipient executes the “Fingerprinting_Recipient.cpp” code file. The recipient sends its specific fingerprint to the DO. The DO recevies the fingerprint and embeds it into the database.
Path | Content |
'DB_path' | The original encoded dataset. |
'output_path' | The fingerprinted dataset. |
Fingerprint extraction:
The code file “Fingerprint_Extraction_DO.cpp” is executed by the DO to exract the fingerprint from the fingerprinted database.
Path | Content |
'DB_path' | The original encoded dataset. |
'FDB_path' | The fingerprinted dataset. |
Verification
Fingerprinted database group:
The code file “FDB_Grouping_Recipient.cpp” is used to group the fingerprinted database following the rule of the “DB_Grouping_DO.cpp” code file. It takes the fingerprinted database file 'inputFile' as input and outputs the grouping results of each column. Each conlum corresponds to two output files: one for the grouping of significant bits ('outputDir1') and the other for the grouping of insignificant bits ('outputDir2').
Path | Content |
'inputFile' | The fingerprinted dataset. |
'outputDir1' | The group results of significant bits (each conlum corresponds to one file). |
'outputDir2' | The group results of insignificant bits (each conlum corresponds to one file). |
Commitment verification for signifincat bit group:
The code file “Com_Verify_Recipient.cpp” is used to verify commitments for significant bit groups. Before executing the code, make sure that the grouping results of significant bits for all the columns are are stored in an identical folder 'inputDir1', and on other files are stored in this folder. Make sure that the commitment files for distinct conlums are stored in another identical folder 'inputDir2', and on other files are stored in this folder.
Path | Content |
'inputDir1' | The group results of significant bits (each conlum corresponds to one file). |
'inputDir2' | The commitment results of insignificant bits (each conlum corresponds to one file). |
'Par_path' | The file storing the parameters of the recipient. |
Verification for insignificant bit group:
The code file "InsignificantBits_Verify1_Recipient.cpp" is used by the recipient to embed the insignificant bits and the random bit-strings used during the locking process to random selected codewords. It takes two folders, i.e., "inputDir1" and "inputDir2", as inputs, and output files to two folders, i.e., "outputDir1" and "outputDir2" (Each column croresponds to one file in every folder).
Folder | Content |
'inputDir1' | Grouped insignificant bits of the fingerprinted database. |
'inputDir2' | The random bit-strings selected during the locking process. |
'outputDir1' | The results z in the paper. |
'outputDir2' | The original messages (m) of the selected codewords |
The code file “InsignificantBits_Verify2_DO.cpp” is executed by the DO to extract the codewords cw selected by the recipient and decode these codewords to obtain messages (m'). It takes two folders, i.e., "inputDir1" and "inputDir2", as inputs, and output files to a folder, i.e., "outputDir" (each column croresponds to one file in every folder).
Folder | Content |
'inputDir1' | The results (z) received from the recipient. |
'inputDir2' | The locked bit-strings (y) during the locking process. |
'outputDir' | The decoded messages (m'). |
The code file “InsignificantBits_Verify3_Recipient.cpp” is executed by the recipient to verify whether its original messages (m) equal to the decoded messages (m') received form the DO. It takes two folders, i.e., "inputDir1" and "inputDir2", as inputs (each column croresponds to one file in every folder), and outputs the verification result (success or failure).
Folder | Content |
'inputDir1' | The original messages (m) selected by the recipient. |
'inputDir2' | The locked decoded messages (m') received from the DO. |
Robustness and utility testing
Robustness
Fingerprint removal attack:
The code file “Fingerprint_Remove.py” is executed by the recipient to reomve the fingerpring from the dataset. It takes the fingerprinted database as input, and outputs three files corresponding to the results of flipping attack, superset attack and subset attack, respectively.
Path | Content |
'input_file' | The fingerprinted database. |
'output_file1' | The result database of flipping attack. |
'output_file2' | The result database of subset attack. |
'output_file3' | The result database of superset attack. |
The parameters 'modifiable_bits_percentage', 'deletion_percentage', and 'superset_percentage' control the ratio of flipping bits, deleting records and adding records, respectively.
Robustness checking:
Using the code file “Fingerprint_Extraction_DO.cpp” and modify the parameter 'FDB_path' to point to the databases under distinct attacks, we can obtain the ratio of matched fingerprint bits.
Utility
Database Query:
Before checking the query accuracy, convert the binary fingerprinted database to the dataset retaining format of the original dataset (e.g., integers, words, and sentences) by executing the code file “Decode_Dataset.py”. The input and output of this code are presented as follows.
Path | Content |
'input_file' | The fingerprinted binary dataset. |
'restored_output_file' | The decoded dataset. |
'dict_2_file' | The dictionary file for the second column. |
'dict_4_file' | The dictionary file for the fourth column. |
'dict_5_file' | The dictionary file for the fifth column. |
Manually import the decoded fingerprint dataset and the original dataset into the MySQL database, and then test the SQL query results.
Classifier training:
The code file “Classifier_Training.py” is uesd to training and evaluate the classifiers (i.e., KNN, LR, and SVM), where the 'file_path' storing the encoded dataset.
You can manually execute the code files of UtiliClear according to the roadmap above, or use the following execution commands. Before running the commands, please modify the source code files according to the aforementation roadmap, ensuring that parameters, paths and folders are correctly configured. Then compile all the code files, generate executable (.exe) files, and store them together with the required dynamic link libraries (DLLs) in the same folder. Execute the following commands in sequence, one after another.
- cd "Your folder storing all compiled exe files and corresponding dll files"
- cmd /c start Setup_DO.exe && start Setup_Recipient.exe
- cmd /c start DB_Grouping_DO.exe
- cmd /c start Com_SignBits_DO.exe
- xcopy "DO's folder storing commitment for significant bits\*" "Recipient's folder storing commitment for significant bits \" /s /e
- cmd /c start Lock1_Gen_CX_DO.exe
- xcopy "DO's folder storing CX for insignificant bits\*" "Recipient's folder storing CX for insignificant bits\" /s /e
- cmd /c start Lock2_Gen_CY_Recipient.exe
- xcopy "Recipient's folder storing CY for insignificant bits\*" "DO's folder storing CY for insignificant bits\" /s /e
- cmd /c start Lock3_Gen_Y_DO.exe
- cmd /c start Fingerprinting_DO.exe && start Fingerprinting_Recipient.exe
- xcopy "DO's path storing Fingerprinted database" "Recipient's path storing Fingerprinted database"
- cmd /c start Fingerprint_Extraction_DO.exe
- cmd /c start DB_Grouping_Recipient.exe
- cmd /c start InsignificantBits_Verify1_Recipient.exe
- xcopy "Recipient's folder storing Z for insignificant bits\*" "DO's folder storing Z for insignificant bits\" /s /e
- cmd /c start InsignificantBits_Verify2_DO.exe
- xcopy "DO's folder storing m' for insignificant bits\*" "Recipient's folder storing m' for insignificant bits\" /s /e
- cmd /c start InsignificantBits_Verify3_Recipient.exe
- cmd /c start Com_Verify_Recipient.exe
Files
Code_Overvoew.jpg
Additional details
Dates
- Accepted
-
2025-01-24
Software
- Repository URL
- https://github.com/MYSong6/UtiliClear
- Programming language
- C++, Python
- Development Status
- Active