Published September 4, 2019 | Version v1
Dataset Open

Near Duplicate Study Crawls

Description

Contains 3 parts

GroundTruths contain the crawls used for creating SubjectSet SS.db

RQ3Crawls contains all 5 minute and 30 minute crawls for all subjects used for RQ3 

DS_Crawls contains 1065 crawls that was used to create dataset DS.db

All the Crawls used for Near Duplicate Study. Each Crawl is done by crawljax with Google Chrome latest browser. Configuration includes the State Abstraction Function (SAF) used, threshold for the SAF and the time allotted for the crawl. The name of the folder contains all three configuration parameters. They can also be found from result.json as well as index.html.   

Files

Files (36.9 GB)

Name Size Download all
md5:1b8da0b6953620c64da799574e24ce38
36.9 GB Download