Replication Package and Evaluation Results of Judge

Liu, Chenxu; Wang, Junheng; Yang, Wei; zhang, ying; Xie, Tao

doi:10.5281/zenodo.15223709

Published April 15, 2025 | Version 1.0

Other Open

Replication Package and Evaluation Results of Judge

1. Peking University
2. The University of Texas at Dallas
3. University of Illinois at Urbana-Champaign

Judge

This document describes the structure of the Judge open-source repository and how to use it. In the code, some parts still retain the original project name "Navigator". To avoid introducing new errors during the renaming process, we have temporarily kept these names. We plan to gradually replace them in the future.

The "Code" Folder

The Code folder contains the source code of our Judge tool, which is divided into three parts: the data and our merging algorithm, Judge-len, and Judge-set.

Data and merging algorithm

Data folder: We provide the three datasets used in our Evaluation section in our repository. These datasets are originally provided by NDStudy (https://github.com/NDStudyICSE2019/NDStudy), containing labeled page-pair data, and the calculated results of threshold-based baseline approaches. We do not provide raw HTMLs here because the raw dataset is very large (36.9 GB). Please see https://zenodo.org/records/3385377 to get the raw data.

Structure merging folder: We provide our structure merging algorithm, together with the processed result in this folder. To apply our structure merging algorithm to process HTMLs, please first run structure_merging.py, which merges sub-trees with the same structure, and then run extract_tags.py, which removes non-standard HTML tags, text content, and HTML attributes. The file all_DS_merged_htmls_remove_duplicate_clean.txt is used for training our contrastive learning model.

Judge-len

Judge-len-model folder: We provide our source code for training our contrastive learning model, together with our pre-trained model here (in the model_result folder). For re-training the model, please change the file_paths in the code, modify the config.py file, and simply run the train.py to train the model. This model takes LongFormer as a base encoder, we also provide the encoder files in our repository.

newLongFormerTokenizer and newLongFormerModel folder: These folders contain model files saved from HuggingFace, we specially add html tags into the tokenizer to shorten the tokenized input length.

Judge-len-classifier folder: This folder contains the source code for training an SVM classifier for final classification. To save time, the training process follows two stages: First, load the trained encoder model for embedding (run run_all_process.py, which calls load_and_embed.py). Second, load the embeddings to train an SVM for classification (run SVM.py). We also provide our pre-trained SVM classifier in a .pkl format.

Judge-set

This module is similar to Judge-len, only changing the input processing method and the base encoder (from LongFormer to BERT).

The "RQs" Folder

The RQs folder contains essential code to reproduce our evaluation result. We also provide our reproduction of baseline approaches here.

RQ1-classification-result

Since the classification result of our approach can be easily obtained through running the SVM.py in our provided classifier source code folders, this folder contains our reproduction of GraphMAE, webembed, FragGen, and threshold-based baselines. All intermediate results (e.g., embeddings), and trained classifiers are also provided.

RQ2-merge-effectiveness

This folder provides the code to obtain the results presented in our RQ2, together with detailed classification results in the .pkl format.

RQ3-model-effectiveness

This folder contains the source code of our ablation baselines of the contrastive learning model. We also provide the code for calculating the effectiveness using different thresholds for the tag_num_count baseline.

RQ5-guide-AWGT

This folder contains five sub-folders, which are essential for reproducing the AWGT guidance result in our work.

The crawljax_master folder contains the source code of Crawljax adopted from FragGen (https://zenodo.org/records/5981993). We modify Crawljax to add other page-pair classification approaches (e.g., WebEmbed and Judge) for state abstraction. We further modify Crawljax to add auto login scripts, auto restart strategy, and ban some elements related to logout in each app.

The Judge-set-service and Judge-len-service folders contain our Flask service for Judge, as Crawljax calls our service through the network. We provide our pretrained models and classifiers here again for ease of use. Simply run the service.py can use our Judge service.

The coverage_data folder contains our obtained coverage result, the scripts to process the Crawljax log, and the scripts to plot and print the final results.

The test_suite folder contains our test suite for testing our service, and our structure merging algorithm. We further examine our contrastive learning model through single-step debugging, as deep-learning models are not suitable to be tested using unit test cases.

Using, Re-training, and Replication

1. Environment Setup

Create a Conda virtual environment & install Python (3.10) dependencies

conda create -n Judge python=3.10
pip install -r requirements.txt

(Optional, required only if you need to reproduce our RQs) Follow the instructions of NDStudy (https://github.com/NDStudyICSE2019/NDStudy) to download raw data.

2. Deploying Judge

Take Judge-len as an example. If Judge-set is required, please refer to the Judge-set folder with the same structure as Judge-len.

Navigate to this directory: "./RQs/RQ5-guide-AWGT/Judge-len-service"
Run python service.py

This will deploy a Flask service of Judge, with the route "/merge" to apply our structure-merging to a given HTML, and the route "compare" to classify two merged HTMLs to "Duplicate" or "Distinct".

3. Re-training Judge

Please note to change the directories in all files to your directories.

Prepare your datasets (HTMLs) and save them in a folder
Navigate to: "./Code/structure-merging"
Run python structure_merging.py to merge HTML subtrees
Run python extract_tags.py to extract tags from merged HTMLs
Run python process_data.py to process merged HTMLs into a single .txt file for model training
Navigate to: "./Code/Judge-len-model"
Run python train.py to train your model
Navigate to: "./Code/Judge-len-classifier"
Run python run_all_process.py, which generates the embeddings of all three datasets in NDStudy (RS, TS, SS) using your newly trained model
Run python SVM.py to train your classifier and check the classification result

4. Run Judge with Crawljax and Reproduce the RQ5 Results

Navigate to: "./RQs/RQ5-guide-AWGT/crawljax-master"
Follow the instructions in the README.md for Crawljax setup
Run python run.py to run experiments

Files

Judge.zip

Files (6.5 GB)

Name	Size	Download all
Judge.zip md5:3e1f2c12c933836e3855ddf3bdfe6e3b	6.5 GB	Preview Download

Additional details

Programming language: Python , Java

	All versions	This version
Views	70	70
Downloads	60	60
Data volume	832.2 GB	832.2 GB

Replication Package and Evaluation Results of Judge

Authors/Creators

Description

Judge

The "Code" Folder

The "RQs" Folder

Using, Re-training, and Replication

1. Environment Setup

2. Deploying Judge

3. Re-training Judge

4. Run Judge with Crawljax and Reproduce the RQ5 Results

Files

Judge.zip

Files (6.5 GB)

Additional details

Software