Dataset Open Access
Patrice Lopez;
Caifan Du;
Hannah Cohoon;
James Howison
<?xml version='1.0' encoding='utf-8'?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:adms="http://www.w3.org/ns/adms#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:duv="http://www.w3.org/ns/duv#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:frapo="http://purl.org/cerif/frapo/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:gsp="http://www.opengis.net/ont/geosparql#" xmlns:locn="http://www.w3.org/ns/locn#" xmlns:org="http://www.w3.org/ns/org#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:vcard="http://www.w3.org/2006/vcard/ns#" xmlns:wdrs="http://www.w3.org/2007/05/powder-s#"> <rdf:Description rdf:about="https://doi.org/10.5281/zenodo.4961241"> <rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/> <dct:type rdf:resource="http://purl.org/dc/dcmitype/Dataset"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://doi.org/10.5281/zenodo.4961241</dct:identifier> <foaf:page rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0002-9959-9441"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0002-9959-9441</dct:identifier> <foaf:name>Patrice Lopez</foaf:name> <org:memberOf> <foaf:Organization> <foaf:name>science-miner</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0003-2538-607X"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0003-2538-607X</dct:identifier> <foaf:name>Caifan Du</foaf:name> <org:memberOf> <foaf:Organization> <foaf:name>University of Texas at Austin</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0002-3352-9766"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0002-3352-9766</dct:identifier> <foaf:name>Hannah Cohoon</foaf:name> <org:memberOf> <foaf:Organization> <foaf:name>University of Texas at Austin</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0002-5702-149X"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0002-5702-149X</dct:identifier> <foaf:name>James Howison</foaf:name> <org:memberOf> <foaf:Organization> <foaf:name>University of Texas at Austin</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:title>Softcite software mention extraction from the CORD-19 publications</dct:title> <dct:publisher> <foaf:Agent> <foaf:name>Zenodo</foaf:name> </foaf:Agent> </dct:publisher> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">2021</dct:issued> <dcat:keyword>text mining, software, scholar literature, CORD-19</dcat:keyword> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2021-06-16</dct:issued> <owl:sameAs rdf:resource="https://zenodo.org/record/4961241"/> <adms:identifier> <adms:Identifier> <skos:notation rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://zenodo.org/record/4961241</skos:notation> <adms:schemeAgency>url</adms:schemeAgency> </adms:Identifier> </adms:identifier> <dct:isVersionOf rdf:resource="https://doi.org/10.5281/zenodo.4784733"/> <owl:versionInfo>0.2</owl:versionInfo> <dct:description><p><strong>Softcite software mention extraction from the CORD-19 publications </strong></p> <p>This dataset is the first result of the extraction of software mentions from the set of publications of the CORD-19 corpus (<a href="https://allenai.org/data/cord-19">https://allenai.org/data/cord-19</a>) by the Softcite software recognizer, see <a href="https://github.com/ourresearch/software-mentions">https://github.com/ourresearch/software-mentions</a>.</p> <p>The CORD-19 version used for this dataset is the one dated <strong>2021-03-22,</strong> using the <em>metadata.csv</em> file only. We re-harvested the PDF with <a href="https://github.com/kermitt2/article-dataset-builder">https://github.com/kermitt2/article-dataset-builder</a> in order to also extract coordinates of software mentions in the PDF and to take advantage of the latest version of GROBID to produce better full text extraction from PDF.</p> <p><strong>Data format </strong></p> <p>The extraction consists of 3 JSON files:</p> <p><strong>annotations.json</strong> contains the individual software annotations including <em>software name</em> and possible attached attributes (<em>publisher</em>, <em>URL</em> and <em>version</em>). Each annotation is associated with coordinates expressed as bounding boxes in the original PDF. See <a href="https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/">Coordinates of structures in the original PDF</a>&nbsp; for more details on the coordinate format.</p> <p>The context of citation is the sentence where the software name and its attributes are extracted. It is added to the JSON structure (field <em>context</em>), as well as the identifier of the document where the annotation belongs (field <em>document</em>, pointing to entries available in <em>documents.json</em>) and a list of bibliographical references attached to the software name (field <em>references</em>, pointing to entries available in <em>references.json</em>, with the used reference marker string). See <a href="https://github.com/ourresearch/software-mentions">https://github.com/ourresearch/software-mentions</a> for more details on the extracted attributes.</p> <p>If the software name was sucessfully disambiguated against WikiData (&quot;entity linking&quot;), it appears in the field <em>wikidataId</em> as Wikidata entity identifier and in the field <em>wikipediaExternalRef</em> as a Wikipedia PageID from the English Wikipedia. Entity linking is realized with <a href="https://github.com/kermitt2/entity-fishing">entity-fishing</a>.</p> <p><strong>documents.json</strong> contains the metadata of the all the CORD-19 documents containing at least one software annotation. The metadata are given as a CrossRef JSON structure. The abstract should be included in the metadata most of the time, as well as some complements extracted by GROBID directly from the PDF. In addition, the size of the pages and the unique file path to the PDF can be found to allow annotations directly on the PDF (see <a href="https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/">Coordinates of structures in the original PDF</a> for more details on the PDF annotation display mechanism).</p> <p><strong>references.json </strong>contains the parsed reference entries associated to software mentions. These references are given in the field <em>tei</em> encoded in the XML TEI format of GROBID extraction. The extracted raw references have been matched against CrossRef to get a DOI and more complete metadata with <a href="https://github.com/kermitt2/biblio-glutton">biblio-glutton</a>.</p> <p><strong>Statistics</strong></p> <p>CORD-19 version: 2021-03-22</p> <p>- total Open Access full texts: 211,213<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - with at least one software mention: 76,448</p> <p>- total software name annotations: 318,138<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - with linked Wikidata ID: 117,193</p> <p>- associated field&nbsp;&nbsp;&nbsp;<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - publisher: 62,240&nbsp;<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - version: 105,661<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; - URL: 29,753</p> <p>- associated bibliographical references: 61,170<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - distinct references: 15,931<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; - distinct with matched DOI: 10,611<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - distinct with matched PMC ID: 6,435</p> <p><strong>License and acknowledgements</strong></p> <p>This dataset is licensed under a Creative Commons Attribution 4.0 International License.</p> <p>We thank Alfred P. Sloan Foundation for supporting this work.</p></dct:description> <dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/PUBLIC"/> <dct:accessRights> <dct:RightsStatement rdf:about="info:eu-repo/semantics/openAccess"> <rdfs:label>Open Access</rdfs:label> </dct:RightsStatement> </dct:accessRights> <dcat:distribution> <dcat:Distribution> <dct:license rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> <dcat:byteSize>236749509</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/4961241/files/annotations.json"/> <dcat:mediaType>application/json</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> <dcat:byteSize>477461239</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/4961241/files/documents.json"/> <dcat:mediaType>application/json</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> <dcat:byteSize>4039</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/4961241/files/readme.md"/> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.4961241"/> <dcat:byteSize>47163372</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/4961241/files/references.json"/> <dcat:mediaType>application/json</dcat:mediaType> </dcat:Distribution> </dcat:distribution> </rdf:Description> </rdf:RDF>
All versions | This version | |
---|---|---|
Views | 370 | 152 |
Downloads | 114 | 41 |
Data volume | 25.7 GB | 9.6 GB |
Unique views | 304 | 143 |
Unique downloads | 59 | 25 |