Published 2024 | Version 1.0.1
Dataset Open

PPORTAL: Public domain Portuguese-language literature Dataset

  • 1. Universidade Federal de Minas Gerais

Description

Combining human expertise with information from book-consumer digital data may generate what it takes to face the following changes in such a critical market. Along with the publishing industry, researchers rely on book-related data to develop tools and applications, drawing constructive conclusions to make better informed and faster decisions. Such solutions range from best-selling prediction models to natural language processing to classify raw text. Besides require complex Artificial Intelligence (AI) methods, all of them are essentially data-dependent, mainly book-related data-dependent.

Data, and more specifically data growth, is essential for developing and performing such AI-powered applications. None of these efforts can be achieved without a preliminary collection of data on literary works, readers, and their reading habits. Therefore, it is critically important to build and make available datasets that fully comprise the essential elements of the book industry ecosystem. Although some efforts have been made for English language books, little has been done regarding other lesser-spoken languages, such as Portuguese. The evaluation of specific data is of fundamental importance for literature analysis, as Portuguese has its own literary peculiarities. Hence, we present PPORTAL, a Public domain PORTuguese-lAnguage Literature dataset. PPORTAL's contributions are summarized as follows:

  • Data integration of numerous public domain works from three digital libraries;
  • Enriched metadata for works, authors and online reviews extracted from Goodreads;
  • Feature engineering on the metadata to create meaningful additional features; and
  • Unrestricted access in two formats (SQL database and compressed .csv files

Files

gender_representation.zip

Files (50.5 MB)

Name Size Download all
md5:e8804e0d729418097a5a70ec82bdf124
6.8 MB Preview Download
md5:12f7cc8940d33c9f5edc84580d866a22
35.5 MB Download
md5:a945c58fc7b13124b28a9eca19092c27
8.0 MB Preview Download
md5:9c5b2851c1d986f465afcc4dea2e8c3f
248.1 kB Preview Download