WEBIS-WEBSEG-20 ALGORITHM SEGMENTATIONS ======================================= Dataset of segmentations and evaluation results on the Webis-WebSeg-20 dataset. - Check for up-to-date resources: https://webis.de/data.html#webis-webseg-20-algorithm-segmentations - The employed Webis-WebSeg-20 dataset: https://webis.de/data.html#webis-webseg-20 / https://doi.org/10.5281/zenodo.3354902 - WARC files for the web pages in the Webis-WebSeg-20 dataset are contained in the webis-web-archive-17: https://webis.de/data.html#webis-web-archive-17 / https://doi.org/10.5281/zenodo.1002203 Contents -------- All ZIP archives extract to a common directory hierarchy of `webis-webseg-20/` that blends with the structure of the Webis-WebSeg-20. For convenience, we split the resources into different parts. - webis-webseg-20-algorithm-segmentations.zip Contains the segmentation of each algorithm. Uses the file format of original Webis-WebSeg-20, see https://github.com/webis-de/cikm20-web-page-segmentation-revisited-evaluation-framework-and-dataset - webis-webseg-20-algorithm-evaluations.zip Contains two files for each page: - evaluation.csv Lists the achieved BCubed Precision, Recall, and F1 for each combination of algorithm and atomic element type. - num-segments.csv Lists the number of segments for each algorithm - webis-webseg-20-meier-models.zip Contains the models trained for the approach of Meier et al., see https://github.com/webis-de/ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms#meier-et-al - webis-webseg-20-4096px.zip Contains the intermediate data created for the approach of Meier et al., see https://github.com/webis-de/ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms#meier-et-al In addition, the dataset contains the following files: - evaluation.csv The averaged values from the evaluation.csv files in webis-webseg-20-algorithm-evaluations.zip, as used in the plots of the paper. - num-segments.csv The averaged values from the num-segments.csv files in webis-webseg-20-algorithm-evaluations.zip, as used in the plots of the paper. - webis-webseg-20-folds.txt Contains for each cross-validation fold (0 through 9) the contained page/task IDs (only used for the approach of Meier et al.) Authors ------- - Johannes Kiesel, johannes.kiesel@uni-weimar.de - Lars Meyer, lars.meyer@uni-weimar.de - Florian Kneist, fkneist@gmail.com - Benno Stein, benno.stein@uni-weimar.de - Martin Potthast, martin.potthast@uni-leipzig.de