Visual-Aware Representation of Web Pages for Machine Learning Applications

Burget, Radek; Hranický, Radek

doi:10.5281/zenodo.18674233

Published February 17, 2026 | Version 1.0.0

Dataset Open

Visual-Aware Representation of Web Pages for Machine Learning Applications

1. Brno University of Technology
2. CESNET
3. ICT Pro

This repository contains a sample data set that demonstrates the use of web pages as a data source for visual-aware machine learning applications using the FitLayout framework.

The dataset captures the rendered pages from the imaginary bookstore available at https://books.toscrape.com/. For each book page in the book store, the data set contains two FitLayout artifacts:

A Page that directly describes the rendered page at the box level.
An AreaTree that provides abstraction over the rendered page in the form of a tree of visual areas, where significant areas (e.g., book title and price) are annotated with the corresponding tags.

The artifacts have been exported from the FitLayout RDF repository in the N-QUADS format, which allows easy importing them to another repository.

Contained files

book_urls.txt -- the source URLs of the rendered pages.
books_artifacts.zip -- the RDF graph describing all the artifacts serialized in N-QUADS format.

Files

book_urls.txt

Files (623.9 MB)

Name	Size	Download all
book_urls.txt md5:8fe47fd41bfca21c5bb19804f5a2d243	90.1 kB	Preview Download
books_artifacts.zip md5:ab8f1bb6e0ab4f8f580eb0dac797222c	623.9 MB	Preview Download
README.md md5:acf4aaf4765e85b32bb19fb17fb0dc51	1.2 kB	Preview Download

	All versions	This version
Views	22	22
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Visual-Aware Representation of Web Pages for Machine Learning Applications

Authors/Creators

Description

Contained files

Files

book_urls.txt

Files (623.9 MB)