HeLI-OTS 2.0

Jauhiainen, Tommi; Jauhiainen, Heidi; Valosaari, Santtu

doi:10.5281/zenodo.10907468

Published April 2, 2024 | Version 2.0

Software Open

HeLI-OTS 2.0

1. University of Helsinki
2. University of Jyväskylä

# HeLI-OTS 2.0

HeLI off-the-shelf language identifier with language models for 220 languages.

# Performance

It can identify c. 600-1700 sentences (averaging c. 150 characters) per second from a file using one core and around 4,3 gigabytes of memory on a modern laptop.

# Requirements

Java

The software has been created and tested on MacOS and Windows 11.

# Setting up HeLI-OTS

The github repository does not include pre-compiled version of HeLI.jar. The .jar file can be downloaded from:

https://zenodo.org/doi/10.5281/zenodo.4780897

Note that you need Java Developement Kit 'JDK' in order to create .jar files. Java Runtime Enviroment 'JRE' does NOT include jar program.

The HeLI.jar can be created from command-line within the src folder using:

```

jar cmf HeLI.mf HeLI.jar HeLI.class HeLI.java languagelist LanguageModels confidenceThresholds

```

# Command line use

In order to use the language identifier, you need only to download the HeLI.jar file which is used as follows.

These examples are for the jar file. The program can be used directly as the GitHub version by leaving out ```-jar``` and ```.jar```.

Please note that loading the language models takes the same amount of time (up to one minute) regardless of the size of the text file.

Usage:

```

java -jar HeLI.jar -r <infile> -w <outfile>

```

The program will read the <infile> and classify the language of each line as one of the 220 languages it knows

and writes the results, one ISO 639-3 code per line, into file <outfile>.

You can use the -c option to make the program print a confidence score for the identification after each language code.

The lower the 'confidence score' the more sure the identification is.

Usage:

```

java -jar HeLI.jar -c -r <infile> -w <outfile>

```

You can give the list of comma-separated ISO 639-3 identifiers for relevant languages after -l option.

Usage:

```

java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng

```

You can give the number of top-scored languages to print after the -t option.

Usage:

```

java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -t 2

```

You can activate language set identification by -s. If a row contains longer passages in multiple languages, all the detected languages in the row will be returned. You must give the maximum number of resulting languages after -s option. (overrides confidence and printing several top-scored languages)

Usage:

```

java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -s 2

```

If you omit both of the filenames, the program will read the standard input one line at a time and write the result to standard output.

# Citations

If you use this program in producing scientific publications, please refer to:

```

@inproceedings{heliots2022,

title = "{H}e{LI-OTS}, Off-the-shelf Language Identifier for Text",

author = "Jauhiainen, Tommi and

Jauhiainen, Heidi and

Lind{\'e}n, Krister",

booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation",

month = june,

year = "2022",

address = "Marseille, France",

publisher = "European Language Resources Association",

url = "http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.416.pdf",

pages = "3912--3922",

language = "English",

}

```

HeLI-OTS-2.0.zip includes the complete source code for the software.

# Acknowledgements

Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft, by the Kone Foundation, by the European Union – NextGenerationEU instrument, and by the Research Council of Finland under grant number 358720 (FIN-CLARIAH – Developing a Common RI for CLARIAH Finland).

Files

HeLI-OTS-2.0.zip

Files (106.2 MB)

Name	Size	Download all
HeLI-OTS-2.0.zip md5:87fe0b1f17bc7ab97a765def903a0a82	52.8 MB	Preview Download
HeLI.jar md5:639fee51f7e069e68fe0500210c82a19	53.3 MB	Download

Additional details

Programming language: Java

Jauhiainen, Tommi et al. (2022). HeLI-OTS, Off-the-shelf Language Identifier for Text. http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.416.pdf
Jauhiainen, Tommi et al. (2017). Evaluation of language identification methods using 285 languages. https://www.aclweb.org/anthology/W17-0221
Jauhiainen, Tommi et al. (2015). Language set identification in noisy synthetic multilingual documents. https://doi.org/10.1007/978-3-319-18111-0_48

	All versions	This version
Views	1,015	1,015
Downloads	267	267
Data volume	15.0 GB	15.0 GB

HeLI-OTS 2.0

Authors/Creators

Description

Files

HeLI-OTS-2.0.zip

Files (106.2 MB)

Additional details

Software

References