Published September 2, 2019 | Version 1.0
Dataset Open

Automatic TEI encoding of manuscripts catalogues with GROBID-Dictionaries

  • 1. ENC
  • 2. UniNE
  • 3. Paris VII/INRIA
  • 4. INRIA

Description

Manuscript Sales Catalogues (MSC) are highly important for authenticating documents and studying the reception of authors. Their regular publication throughout Europe since the beginning of the 19th c. has consequently raised the interest around scaling up the means for automatically structuring their contents. 

Following successful first encoding tests with GROBID-Dictionaries on a single MSC collection, we aim in this paper to present the results of more advanced tests of the system’s capacity to handle a larger corpus with MSC of different dealers, and therefore multiple layouts.  Four different types of catalogues published between the middle of the 19th c. and the beginning of the 20th c. have been tested.

Files

Scaling_up_trainingData_TEI2019.zip

Files (198.1 MB)

Name Size Download all
md5:52d28fec5905266e42fdc027c7fb74e7
198.1 MB Preview Download

Additional details

Related works

References

  • Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, et al.. Automatically Encoding Encyclopedic-like Resources in TEI. The annual TEI Conference and Members Meeting, Sep 2018, Tokyo, Japan.
  • Mohamed Khemakhem, Luca Foppiano, Laurent Romary. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands.
  • Mohamed Khemakhem, Axel Herold, Laurent Romary. Enhancing Usability for Automatically Structuring Digitised Dictionaries. GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan. 2018.