Presentation Open Access

Automated Metadata Extraction: Challenges and Opportunities

Tyler Skluzacek; Kyle Chard; Ian Foster


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nam##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">metadata</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">data mining</subfield>
  </datafield>
  <controlfield tag="005">20221011022628.0</controlfield>
  <controlfield tag="001">7182583</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Chicago and Argonne National Lab</subfield>
    <subfield code="a">Kyle Chard</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Chicago and Argonne National Lab</subfield>
    <subfield code="a">Ian Foster</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">4240342</subfield>
    <subfield code="z">md5:5987566a9ea75f31e1bb3ec6d497889f</subfield>
    <subfield code="u">https://zenodo.org/record/7182583/files/error_22_presentation.pdf</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2022-10-10</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire</subfield>
    <subfield code="p">user-escience-2022</subfield>
    <subfield code="o">oai:zenodo.org:7182583</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Oak Ridge National Lab</subfield>
    <subfield code="a">Tyler Skluzacek</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Automated Metadata Extraction: Challenges and Opportunities</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-escience-2022</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Proper application of the FAIR data principles is what separates a vibrant data ecosystem, in which research data are frequently shared and reused, from a lifeless data graveyard. Automated metadata extraction systems have been proposed as a means of bolstering the findability, interoperability, and reusabil- ity of data repositories with little or no human intervention. These extraction systems mine metadata by crawling a repository and applying lightweight extractors that, for various types of file (e.g., image, CSV file), extract or synthesize relevant attributes. In practice, however, the automated creation of generally useful metadata is fraught with challenges. Data consumers may have different perspectives as to what metadata representations are useful, the standards for recording metadata tend to change over time, and the software model for processing updates can introduce unnecessary human and computational effort. Thus, generalizing extraction for a broad audience of data consumers is a difficult and relatively unsolved problem.&lt;/p&gt;

&lt;p&gt;In this work, we explore these challenges faced by extraction systems in the context of constructing our own extraction system for science data. We first define the metadata extraction problem and provide context to the issues faced in generalizing metadata. Additionally, we identify potential research directions to help alleviate many of these challenges for all automated extraction systems. Ultimately, this work represents a first step in designing ubiquitous metadata extraction systems that can maximize the value of research data while minimizing the human efforts required in doing so.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.7182582</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.7182583</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">presentation</subfield>
  </datafield>
</record>
10
16
views
downloads
All versions This version
Views 1010
Downloads 1616
Data volume 67.8 MB67.8 MB
Unique views 1010
Unique downloads 1515

Share

Cite as