EB-KG: Knowledge Graph of the first 8 eiditions Encyclopaedia Brittanica (1768-1860)
Description
This Knowlege Graph represents the information of the first eight editions of Encyclopaedia Brittanica (years: 1768 to 1860) in RDF (ttl format).
The raw dataset is provided by the NLS in this link , and it comprises of eight editions and a total of 195 volumes with a total size of 44GB. It uses two XMLs schemas: METS for descriptive, structural, technical and administrative metadata (Title, Author, Publisher, etc); and ALTO for encoding the OCR text of a page.
In this work, we have extracted the information from METS and ALTO XMLS using defoe tool and developed novel information extraction heuristics. With the extracted information, we created the EB-KG Knowlege Graph, which uses the EB Ontolgy, to represent such information. Furthermore, during the information extraction phase, we have employed several techniques to mitigate two common OCR errors: long-S and the line-break hyphenation.
The EB-KG contains 1,638,239 RDF triples. It has information from 8 editions. Each edition can have several Volumes, references to Books, Supplements; it also has an Editor and a Publisher, which can be a Person or an Organization. A Volume has several Pages, which can contain several Terms. And a Term can be either a Topic (a term described across several pages, often combining text, pictures, and tables.) or an Article (a description of the term in one- or two-paragraph long text (similar to an entry in a dictionary)). The data model of the EB-KG can be found here.
The original ALTO files do not indicate the start and end of each EB term, the first part of our work involved the
automated extraction of all terms (along with their metadata) across editions, so they can be analysed independently without the surrounding text.
Notes
Files
Files
(638.1 MB)
Name | Size | Download all |
---|---|---|
md5:a013c43a1e1eab1fc9044df5453e5e01
|
638.1 MB | Download |