Replication Package and Evaluation Results of Judge
Authors/Creators
Description
Judge
This document describes the structure of the Judge open-source repository and how to use it. In the code, some parts still retain the original project name "Navigator". To avoid introducing new errors during the renaming process, we have temporarily kept these names. We plan to gradually replace them in the future.
The "Code" Folder
The Code folder contains the source code of our Judge tool, which is divided into three parts: the data and our merging algorithm, Judge-len, and Judge-set.
Data and merging algorithm
Data folder: We provide the three datasets used in our Evaluation section in our repository. These datasets are originally provided by NDStudy (https://github.com/NDStudyICSE2019/NDStudy), containing labeled page-pair data, and the calculated results of threshold-based baseline approaches. We do not provide raw HTMLs here because the raw dataset is very large (36.9 GB). Please see https://zenodo.org/records/3385377 to get the raw data.
Structure merging folder: We provide our structure merging algorithm, together with the processed result in this folder. To apply our structure merging algorithm to process HTMLs, please first run structure_merging.py, which merges sub-trees with the same structure, and then run extract_tags.py, which removes non-standard HTML tags, text content, and HTML attributes. The file all_DS_merged_htmls_remove_duplicate_clean.txt is used for training our contrastive learning model.
Judge-len
Judge-len-model folder: We provide our source code for training our contrastive learning model, together with our pre-trained model here (in the model_result folder). For re-training the model, please change the file_paths in the code, modify the config.py file, and simply run the train.py to train the model. This model takes LongFormer as a base encoder, we also provide the encoder files in our repository.
newLongFormerTokenizer and newLongFormerModel folder: These folders contain model files saved from HuggingFace, we specially add html tags into the tokenizer to shorten the tokenized input length.
Judge-len-classifier folder: This folder contains the source code for training an SVM classifier for final classification. To save time, the training process follows two stages: First, load the trained encoder model for embedding (run run_all_process.py, which calls load_and_embed.py). Second, load the embeddings to train an SVM for classification (run SVM.py). We also provide our pre-trained SVM classifier in a .pkl format.
Judge-set
This module is similar to Judge-len, only changing the input processing method and the base encoder (from LongFormer to BERT).
The "RQs" Folder
The RQs folder contains essential code to reproduce our evaluation result. We also provide our reproduction of baseline approaches here.
RQ1-classification-result
Since the classification result of our approach can be easily obtained through running the SVM.py in our provided classifier source code folders, this folder contains our reproduction of GraphMAE, webembed, FragGen, and threshold-based baselines. All intermediate results (e.g., embeddings), and trained classifiers are also provided.
RQ2-merge-effectiveness
This folder provides the code to obtain the results presented in our RQ2, together with detailed classification results in the .pkl format.
RQ3-model-effectiveness
This folder contains the source code of our ablation baselines of the contrastive learning model. We also provide the code for calculating the effectiveness using different thresholds for the tag_num_count baseline.
RQ5-guide-AWGT
This folder contains five sub-folders, which are essential for reproducing the AWGT guidance result in our work.
The crawljax_master folder contains the source code of Crawljax adopted from FragGen (https://zenodo.org/records/5981993). We modify Crawljax to add other page-pair classification approaches (e.g., WebEmbed and Judge) for state abstraction. We further modify Crawljax to add auto login scripts, auto restart strategy, and ban some elements related to logout in each app.
The Judge-set-service and Judge-len-service folders contain our Flask service for Judge, as Crawljax calls our service through the network. We provide our pretrained models and classifiers here again for ease of use. Simply run the service.py can use our Judge service.
The coverage_data folder contains our obtained coverage result, the scripts to process the Crawljax log, and the scripts to plot and print the final results.
The test_suite folder contains our test suite for testing our service, and our structure merging algorithm. We further examine our contrastive learning model through single-step debugging, as deep-learning models are not suitable to be tested using unit test cases.
Using, Re-training, and Replication
1. Environment Setup
Create a Conda virtual environment & install Python (3.10) dependencies
conda create -n Judge python=3.10
pip install -r requirements.txt
(Optional, required only if you need to reproduce our RQs) Follow the instructions of NDStudy (https://github.com/NDStudyICSE2019/NDStudy) to download raw data.
2. Deploying Judge
Take Judge-len as an example. If Judge-set is required, please refer to the Judge-set folder with the same structure as Judge-len.
-
Navigate to this directory:
"./RQs/RQ5-guide-AWGT/Judge-len-service" -
Run
python service.py
This will deploy a Flask service of Judge, with the route "/merge" to apply our structure-merging to a given HTML, and the route "compare" to classify two merged HTMLs to "Duplicate" or "Distinct".
3. Re-training Judge
Please note to change the directories in all files to your directories.
-
Prepare your datasets (HTMLs) and save them in a folder
-
Navigate to:
"./Code/structure-merging" -
Run
python structure_merging.pyto merge HTML subtrees -
Run
python extract_tags.pyto extract tags from merged HTMLs -
Run
python process_data.pyto process merged HTMLs into a single .txt file for model training -
Navigate to:
"./Code/Judge-len-model" -
Run
python train.pyto train your model -
Navigate to:
"./Code/Judge-len-classifier" -
Run
python run_all_process.py, which generates the embeddings of all three datasets in NDStudy (RS, TS, SS) using your newly trained model -
Run
python SVM.pyto train your classifier and check the classification result
4. Run Judge with Crawljax and Reproduce the RQ5 Results
-
Navigate to: "./RQs/RQ5-guide-AWGT/crawljax-master" -
Follow the instructions in the README.md for Crawljax setup
-
Run
python run.pyto run experiments
Files
Judge.zip
Files
(6.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:3e1f2c12c933836e3855ddf3bdfe6e3b
|
6.5 GB | Preview Download |