Published April 2, 2024
| Version 2.0
Software
Open
HeLI-OTS 2.0
Creators
Description
# HeLI-OTS 2.0
HeLI off-the-shelf language identifier with language models for 220 languages.
# Performance
It can identify c. 600-1700 sentences (averaging c. 150 characters) per second from a file using one core and around 4,3 gigabytes of memory on a modern laptop.
# Requirements
Java
The software has been created and tested on MacOS and Windows 11.
# Setting up HeLI-OTS
The github repository does not include pre-compiled version of HeLI.jar. The .jar file can be downloaded from:
https://zenodo.org/doi/10.5281/zenodo.4780897
Note that you need Java Developement Kit 'JDK' in order to create .jar files. Java Runtime Enviroment 'JRE' does NOT include jar program.
The HeLI.jar can be created from command-line within the src folder using:
```
jar cmf HeLI.mf HeLI.jar HeLI.class HeLI.java languagelist LanguageModels confidenceThresholds
```
# Command line use
In order to use the language identifier, you need only to download the HeLI.jar file which is used as follows.
These examples are for the jar file. The program can be used directly as the GitHub version by leaving out ```-jar``` and ```.jar```.
Please note that loading the language models takes the same amount of time (up to one minute) regardless of the size of the text file.
Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile>
```
The program will read the <infile> and classify the language of each line as one of the 220 languages it knows
and writes the results, one ISO 639-3 code per line, into file <outfile>.
You can use the -c option to make the program print a confidence score for the identification after each language code.
The lower the 'confidence score' the more sure the identification is.
Usage:
```
java -jar HeLI.jar -c -r <infile> -w <outfile>
```
You can give the list of comma-separated ISO 639-3 identifiers for relevant languages after -l option.
Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng
```
You can give the number of top-scored languages to print after the -t option.
Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -t 2
```
You can activate language set identification by -s. If a row contains longer passages in multiple languages, all the detected languages in the row will be returned. You must give the maximum number of resulting languages after -s option. (overrides confidence and printing several top-scored languages)
Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -s 2
```
If you omit both of the filenames, the program will read the standard input one line at a time and write the result to standard output.
# Citations
If you use this program in producing scientific publications, please refer to:
```
@inproceedings{heliots2022,
title = "{H}e{LI-OTS}, Off-the-shelf Language Identifier for Text",
author = "Jauhiainen, Tommi and
Jauhiainen, Heidi and
Lind{\'e}n, Krister",
booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation",
month = june,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.416.pdf",
pages = "3912--3922",
language = "English",
}
```
HeLI-OTS-2.0.zip includes the complete source code for the software.# Acknowledgements
Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft, by the Kone Foundation, by the European Union – NextGenerationEU instrument, and by the Research Council of Finland under grant number 358720 (FIN-CLARIAH – Developing a Common RI for CLARIAH Finland).
Files
HeLI-OTS-2.0.zip
Files
(106.2 MB)
Name | Size | Download all |
---|---|---|
md5:87fe0b1f17bc7ab97a765def903a0a82
|
52.8 MB | Preview Download |
md5:639fee51f7e069e68fe0500210c82a19
|
53.3 MB | Download |
Additional details
References
- Jauhiainen, Tommi et al. (2022). HeLI-OTS, Off-the-shelf Language Identifier for Text. http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.416.pdf
- Jauhiainen, Tommi et al. (2017). Evaluation of language identification methods using 285 languages. https://www.aclweb.org/anthology/W17-0221
- Jauhiainen, Tommi et al. (2015). Language set identification in noisy synthetic multilingual documents. https://doi.org/10.1007/978-3-319-18111-0_48