WEBIS-WEBSEG-20 =============== Dataset of crowdsourced annotations for web page segmentations. - Check for up-to-date resources: https://webis.de/data.html#webis-webseg-20 - WARC files for the web pages in this dataset are contained in the webis-web-archive-17: https://webis.de/data.html#webis-web-archive-17 Contents -------- All ZIP archives (with the exception of webis-webseg-20-000000) extract to a common directory hierarchy of `webis-webseg-20/`. For convenience, we split the resources into different parts. - webis-webseg-20-000000.zip Contains all resources for the first page (also contained in the archives below) if you just want to have a look at one page without downloading all. - webis-webseg-20-dom-and-nodes.zip Contains the HTML DOM and location of all nodes at the time the screenshots were taken: dom.html, nodes.csv, nodes-texts.csv - webis-webseg-20-screenshots.zip Contains the screenshots of the web pages: screenshot.png - webis-webseg-20-screenshots-edges.zip Contains the edges that were detected in the screenshots for fine and coarse parameter settings: screenshot-edges-fine.png, screenshot-edges-coarse.png - webis-webseg-20-annotations.zip Contains for each web page the human annotations, before (annotations.json) and after they were fitted to DOM nodes (fitted-annotations.json) in the segmentation format detailed below: annotations.json, fitted-annotations.json - webis-webseg-20-ground-truth.zip Contains the fused ground truth in the segmentation format detailed below: ground-truth.json Segmentation Format ------------------- Each segmentation file comes as a JSON file with the attributes - id: The page ID - height: Height of the web page screenshot in pixels - width: Width of the web page screenshot in pixels - segmentations: An object of segmentations Each segmentation is described as a list of multipolygons, to allow for maximum flexibility. Multipolygons are described according to the simple feature access standard [1] as a list of polygons, which are described as a list of rings (from the second ring onwards, rings define holes in the polygon), which are described as a list of coordinates. Each coordinate is a list of two entries: X and Y coordinate (in pixels). The coordinates give the corners of the ring, and the last coordinate has to be the same as the first. [1] https://en.wikipedia.org/wiki/Simple_Features Authors ------- - Johannes Kiesel, johannes.kiesel@uni-weimar.de - Florian Kneist, fkneist@gmail.com - Lars Meyer, lars.meyer@uni-weimar.de - Kristof Komlossy, kristof.komlossy@kritten.org - Benno Stein, benno.stein@uni-weimar.de - Martin Potthast, martin.potthast@uni-leipzig.de