Published February 17, 2026 | Version 1.0.0
Dataset Open

Visual-Aware Representation of Web Pages for Machine Learning Applications

  • 1. Brno University of Technology
  • 2. CESNET
  • 3. ICT Pro

Description

This repository contains a sample data set that demonstrates the use of web pages as a data source for visual-aware machine learning applications using the FitLayout framework.

The dataset captures the rendered pages from the imaginary bookstore available at https://books.toscrape.com/. For each book page in the book store, the data set contains two FitLayout artifacts:

  • A Page that directly describes the rendered page at the box level.
  • An AreaTree that provides abstraction over the rendered page in the form of a tree of visual areas, where significant areas (e.g., book title and price) are annotated with the corresponding tags.

The artifacts have been exported from the FitLayout RDF repository in the N-QUADS format, which allows easy importing them to another repository.

Contained files

  • book_urls.txt -- the source URLs of the rendered pages.
  • books_artifacts.zip -- the RDF graph describing all the artifacts serialized in N-QUADS format.

Files

book_urls.txt

Files (623.9 MB)

Name Size Download all
md5:8fe47fd41bfca21c5bb19804f5a2d243
90.1 kB Preview Download
md5:ab8f1bb6e0ab4f8f580eb0dac797222c
623.9 MB Preview Download
md5:acf4aaf4765e85b32bb19fb17fb0dc51
1.2 kB Preview Download