Dataset Open Access

EconBiz Images for Text Extraction from Scholarly Figures

Böschen, Falk; Scherp, Ansgar

DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="" xmlns="" xsi:schemaLocation="">
  <identifier identifierType="DOI">10.5281/zenodo.2843254</identifier>
      <creatorName>Böschen, Falk</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="">0000-0003-4223-5353</nameIdentifier>
      <affiliation>Kiel University</affiliation>
      <creatorName>Scherp, Ansgar</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="">0000-0002-2653-9245</nameIdentifier>
      <affiliation>Kiel University</affiliation>
    <title>EconBiz Images for Text Extraction from Scholarly Figures</title>
    <subject>Text extraction</subject>
    <subject>Scholarly figures</subject>
    <date dateType="Issued">2019-05-15</date>
  <resourceType resourceTypeGeneral="Dataset"/>
    <alternateIdentifier alternateIdentifierType="url"></alternateIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.2843253</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf"></relatedIdentifier>
    <rights rightsURI="">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
    <description descriptionType="Abstract">&lt;p&gt;Scholarly figures are data visualizations like bar charts, pie charts, line graphs, maps, scatter plots or similar figures. Text extraction from scholarly figures is useful in many application scenarios, since text in scholarly figures often contains information that is not present in the surrounding text. This dataset is a corpus of 121 scholarly figures from the economics domain evaluating text extraction tools. We randomly extracted these figures from a corpus of 288,000 open access publications from &lt;a href=""&gt;EconBiz&lt;/a&gt;. The dataset resembles a wide variety of scholarly figures from bar charts to maps. We manually labeled the figures to create the gold standard.&lt;/p&gt;

&lt;p&gt;We adjusted the provided gold standard to have a uniform format for all datasets. Each figure is accompanied by a TSV file (tab-separated values) where each entry corresponds to a text line which has the following structure:&lt;/p&gt;

	&lt;li&gt;X-coordinate of the center of the bounding box in pixel&lt;/li&gt;
	&lt;li&gt;Y-coordinate of the center of the bounding box in pixel&lt;/li&gt;
	&lt;li&gt;Width of the bounding box in pixel&lt;/li&gt;
	&lt;li&gt;Height of the bounding box in pixel&lt;/li&gt;
	&lt;li&gt;Rotation angle around its center in degree&lt;/li&gt;
	&lt;li&gt;Text inside the bounding box&lt;/li&gt;

&lt;p&gt;In addition we provide the ground truth in JSON format. A schema file is included in each dataset as well. The dataset is accompanied with a ReadMe file with further information about the figures and their origin.&lt;/p&gt;

&lt;p&gt;If you use this dataset in your own work, please cite one of the papers in the references.&lt;/p&gt;</description>
    <description descriptionType="Other">{"references": ["B\u00f6schen, F. &amp; Scherp, A. Amsaleg L., Gu\u00f0mundsson G., Gurrin C., J\u00f3nsson B., Satoh S. (Ed.) A Comparison of Approaches for Automated Text Extraction from Scholarly Figures Proceedings Part I, Multimedia Modeling - 23rd International Conference MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Springer, 2017, 15-27", "B\u00f6schen, F. &amp; Scherp, A. Bergmann, R.; G\u00f6rg, S. &amp; M\u00fcller, G. (Ed.) Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB, Trier, Germany, October 7-9, 2015.,,2015, 1458, 20-31", "B\u00f6schen, F. &amp; Scherp, A. Vanoirbeek, C. &amp; Genev\u00e8s, P. (Eds.) Multi-oriented Text Extraction from Information Graphics Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, September 8-11, 2015, ACM, 2015, 35-38"]}</description>
      <funderName>European Commission</funderName>
      <funderIdentifier funderIdentifierType="Crossref Funder ID">10.13039/501100000780</funderIdentifier>
      <awardNumber awardURI="info:eu-repo/grantAgreement/EC/H2020/693092/">693092</awardNumber>
      <awardTitle>Training towards a society of data-savvy information professionals to enable open leadership innovation</awardTitle>
All versions This version
Views 4040
Downloads 11
Data volume 10.8 MB10.8 MB
Unique views 3636
Unique downloads 11


Cite as