There is a newer version of this record available.

Software Open Access

HeLI-OTS 1.2 with Python examples

Jauhiainen, Tommi; Jauhiainen, Heidi


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.5853116</identifier>
  <creators>
    <creator>
      <creatorName>Jauhiainen, Tommi</creatorName>
      <givenName>Tommi</givenName>
      <familyName>Jauhiainen</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6474-3570</nameIdentifier>
      <affiliation>University of Helsinki</affiliation>
    </creator>
    <creator>
      <creatorName>Jauhiainen, Heidi</creatorName>
      <givenName>Heidi</givenName>
      <familyName>Jauhiainen</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-8227-5627</nameIdentifier>
      <affiliation>University of Helsinki</affiliation>
    </creator>
  </creators>
  <titles>
    <title>HeLI-OTS 1.2 with Python examples</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2022</publicationYear>
  <subjects>
    <subject>language identification</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2022-01-15</date>
  </dates>
  <language>en</language>
  <resourceType resourceTypeGeneral="Software"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/5853116</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.4780897</relatedIdentifier>
  </relatedIdentifiers>
  <version>1.2</version>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;HeLI off-the-shelf language identifier with language models for 200 languages.&lt;/p&gt;

&lt;p&gt;Usage:&lt;br&gt;
java -jar HeLI.jar -r &amp;lt;infile&amp;gt; -w &amp;lt;outfile&amp;gt;&lt;/p&gt;

&lt;p&gt;The program will read the &amp;lt;infile&amp;gt; and classify the language of each line as one of the 200 languages it knows&lt;br&gt;
and writes the results, one ISO 639-3 code per line, into file &amp;lt;outfile&amp;gt;.&lt;/p&gt;

&lt;p&gt;You can use the -c option to make the program print a confidence score for the identification after each language code.&lt;/p&gt;

&lt;p&gt;Usage:&lt;br&gt;
java -jar HeLI.jar -c -r &amp;lt;infile&amp;gt; -w &amp;lt;outfile&amp;gt;&lt;/p&gt;

&lt;p&gt;You can give the list of comma-separated ISO 639-3 identifiers for relevant languages after -l option.&lt;/p&gt;

&lt;p&gt;Usage:&lt;br&gt;
java -jar HeLI.jar -r &amp;lt;infile&amp;gt; -w &amp;lt;outfile&amp;gt; -l fin,swe,eng&lt;/p&gt;

&lt;p&gt;You can give the number of top-scored languages to print after the -t option. (overrides confidence)&lt;/p&gt;

&lt;p&gt;Usage:&lt;br&gt;
java -jar HeLI.jar -r &amp;lt;infile&amp;gt; -w &amp;lt;outfile&amp;gt; -l fin,swe,eng -t 2&lt;/p&gt;

&lt;p&gt;If you omit both of the filenames, the program will read the standard input one line at a time and write the result to standard output.&lt;/p&gt;

&lt;p&gt;It can identify c. 3000 sentences per second using one core on a 2021 laptop and around 3 gigabytes of memory.&lt;/p&gt;

&lt;p&gt;If you use this program in producing scientific publications, please refer to:&amp;nbsp;&lt;br&gt;
&amp;nbsp;@inproceedings{jauhiainen-etal-2017-evaluation,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;title = &amp;quot;Evaluation of language identification methods using 285 languages&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;author = &amp;quot;Jauhiainen, Tommi &amp;nbsp;and&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Lind{\&amp;#39;e}n, Krister &amp;nbsp;and&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Jauhiainen, Heidi&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;booktitle = &amp;quot;Proceedings of the 21st Nordic Conference on Computational Linguistics&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;month = may,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;year = &amp;quot;2017&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;address = &amp;quot;Gothenburg, Sweden&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;publisher = &amp;quot;Association for Computational Linguistics&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;url = &amp;quot;https://www.aclweb.org/anthology/W17-0221&amp;quot;,&lt;br&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;pages = &amp;quot;183--191&amp;quot;,&lt;br&gt;
&amp;nbsp;}&lt;/p&gt;

&lt;p&gt;Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft.&lt;/p&gt;</description>
    <description descriptionType="Other">{"references": ["Jauhiainen, Tommi et al. (2017). Evaluation of language identification methods using 285 languages. https://www.aclweb.org/anthology/W17-0221"]}</description>
  </descriptions>
</resource>
972
236
views
downloads
All versions This version
Views 972122
Downloads 23658
Data volume 3.0 GB573.2 MB
Unique views 72498
Unique downloads 11424

Share

Cite as