Published February 17, 2026
| Version 1.0.0
Dataset
Open
Visual-Aware Representation of Web Pages for Machine Learning Applications
Authors/Creators
- 1. Brno University of Technology
- 2. CESNET
- 3. ICT Pro
Description
This repository contains a sample data set that demonstrates the use of web pages as a data source for visual-aware machine learning applications using the FitLayout framework.
The dataset captures the rendered pages from the imaginary bookstore available at https://books.toscrape.com/. For each book page in the book store, the data set contains two FitLayout artifacts:
- A Page that directly describes the rendered page at the box level.
- An AreaTree that provides abstraction over the rendered page in the form of a tree of visual areas, where significant areas (e.g., book title and price) are annotated with the corresponding tags.
The artifacts have been exported from the FitLayout RDF repository in the N-QUADS format, which allows easy importing them to another repository.
Contained files
book_urls.txt-- the source URLs of the rendered pages.books_artifacts.zip-- the RDF graph describing all the artifacts serialized in N-QUADS format.