Evaluating Automated Machine Translation of LGBTQ+ Terms: Towards Multilingual Homosaurus
Authors/Creators
Description
## Introduction
This repository corresponds to the abstract and the presentation at the Global Digital Humanities Symposium 2023 at the Michigan State University (MSU).
This repository is rooted in the bachelor's thesis of Anna-Maja in 2003. The script generates the translation of Homosaurus terms in English to 4 languages: Dutch, Spanish, Polish and German.
The translations were then selected to be evaluated by volunteers. Each volunteer evaluates the translation of 200 terms. The findings and results are summarized as below.
- Not all translators are free. It can cause some problems if the translator is not free or free but has limited use per day.
- Since the Dutch translation is the only available language so far, we used it for evaluation. The (naive) translation results regarding the Dutch language is poor. All translators has an accuracy of less than 40%.
- Different strategies have little improvement on the results.
- The upper bounds of accuracy for the naive translations are very low (less than 50% accuracy).
- A lot of repeated mistakes were found. Some manual scripts to refine the result can improve the accuracy more significantly than the strategies.
## Naive translation
| Tool/Strategy | match with prefLabel | match with altLabel | no match |
|-----------|-----------|-----|----------------------------------|
| Google | 31.39% | 1.07% | 67.53% |
| Azure | 39.51% | 1.71% | 58.75% |
| DeepL | 38.79% | 1.48% | 59.72% |
| Amazon | 35.32% | 1.12% | 63.55% |
| Strategy: Uniform Majority | 40.31% | 1.66% | 58.01% |
| Strategy: Weighted majority | 41.39% | 1.73% | 56.86% |
| Strategy: 1 | 39.50%| 1.53% | 56.86% |
| Strategy: 2 | 40.22% | 1.73% | 58.03% |
| Strategy: 3 | 41.09% | 1.83% | 57.06% |
| Strategy: 4 | 40.27% | 1.73% | 57.98% |
| At least one translator | 49.41% | 1.78% | 48.80% |
In the table above, the 'Uniform strategy' is about taking equal weights for each translator. In the case of a tie, the correct one counts as a half (or a third, depending on the situation).
In addition, four (manually constructed) strategies were proposed inspired by [1]:
- Strategy 1: if Azure agrees with DeepL, then take the translation. Otherwise, if DeepL agrees with Amazon, take the agreed translation. Otherwise, if Amazon agrees with Google, take the agreed translation. Otherwise, take the translation by Google.
- Strategy 1: if Azure agrees with DeepL, then take the translation. Otherwise, if DeepL agrees with Amazon, take the agreed translation. Otherwise, if Amazon agrees with Google, take the agreed translation. Otherwise, take the translation by Google.
- Strategy 2: if Amazon agrees with Google, then take the translation. Otherwise, if DeepL agrees with Amazon, take the agreed translation. Otherwise, if Amazon agrees with DeepL, take the agreed translation. Otherwise, take the translation by Azure.
- Strategy 3: if all the translator agrees, then take the agreed translation. Else, if Google, Amazon, and DeepL agree on the same translation, take the agreed translation. Otherwise, if DeepL, Google, Azure agree, take the agreed translation. Otherwise, if DeepL, Azure, Azure agree, take the agreed translation. Otherwise, if Amazon, Google, Azure agree, take the agreed translation. If DeepL agrees with Amazon, take the agreed translation. Otherwise, take the translation by Azure.
- Strategy 4: if Google and Amazon agree with Azure, then take the translation. Otherwise, if Google and Amazon agree with DeepL, take the agreed translation. Otherwise, if Google and Amazon agree with Azure, take the agreed translation. Otherwise, take the translation by Azure.
As a proof of concept, we also provides a weighted majority where Google takes the weight 0.3139, Azure takes 0.3950, DeepL takes 0.3879, Amazon takes 0.3532. These weights were assigned in proportion to their individual performance. It worth noticing that this is not scientific enough, since we are using the results for the design of the weights.
Another baseline is 'at least one translator': if one of the translators captures the correct translation, the result is then counted towards the correct answer. This gives the upper bound of this approach. Unfortunately, the upper bound is less than 50% (as low as 49.41%).

You can find the corresponding Python script in the folder /Script. The file is ```compute_statistics.py```.
## Refine translation for improvement of accuracy
After manually checking the translation, we found some repeatative mistakes. These could be fixed using a manual script (make minor changes on the translated results). More specificall, the script makes the following changes. The following rules are in the file ```statistics_survey.py```.
- If `LGBTQ` is in the original English term, then replace `LGBTQ` by `LHBTQ` in the translated term.
- If `transgender people` is in the original English term, then replace `transgenders` by `transgender personen` in the translated term.
- If the term `biseksuele mensen` is in the original English term, then replace `biseksuele mensen` by `biseksuelen` in the translated term.
- In other cases apart from the above mentioned two, if the term `people` is in the original English term, then replace `mensen` by `personen` in the translated term.
- If the term `lesbians` is in the original English term, then replace `lesbiennes` by `lesbische vrouwen` in the translated term. Same for capital cases.
- If the term `Gay victims` is in the original English term, then replace `Homoseksuele slachtoffers` by `Homoslachtoffers` in the translated term. Same for small cases.
| Tool/Strategy | match with prefLabel | match with altLabel | no match |
|-----------|-----------|-----|----------------------------------|
| Google | 49.15% | 1.07% | 49.20% |
| Azure | 58.49%| 1.99% | 39.50% |
| DeepL | 56.76% | 1.83% | 41.39% |
| Amazon | 52.11% | 1.58%| 46.29% |
| Strategy: Uniform Majority | 59.87% | 2.04% | 38.08% |
| Strategy: Weighted majority | 60.49% | 2.04% | 37.46% |
| Strategy: 1 | 58.34% | 1.84% | 39.81% |
| Strategy: 2 | 59.31%| 2.09% | 38.59% |
| Strategy: 3 | 60.23%| 2.09% | 37.67% |
| Strategy: 4 | 59.31% | 2.04% | 38.64% |
| At least one translator | 70.23%| 2.39% | 27.36% |
This work gains insights into automated translation of terms in Homosaurus. The lessions learned are as follows:
- By introducing these manually designed refinement rules, some of these repeated mistakes can be fixed. The upper bound improves to 70.23%.
- However, the performance is limited (around 60%). This is still not as reliable as we wanted in real practice. The strategies as introduced before shows some improvement in the result but not significantly.
- Adding more rules as suggested glossary is to be considered for further research.
## A crowd sourcing approach for refinement
Some volunteers performed a round of study over 200 terms. The analytical results are included in the thesis. The results show that a crowd sourcing approach is not preferred due to its high cost in time, unreliability in volunteer's knowledge and unknown of their biases. We also observed that in many cases, especially Polish and German, some terms cannnot be directly translated since there is no such terms that corresponds to the original English terms. There are many such cases in Dutch but this is not very well captured in Polish, German, and Spanish based on the study.
-----
We used six (1.16.0) and google translate (3.0.0). Please install the packages:
- pip install six
- pip install google-cloud-translate
- pip install Reverso-API deepl deep_translator
- pip install boto3
- pip install rdflib
To reproduce, recreate the updated translations, you will need to create a file called .tokens in the directory where you want to run the scripts. The file consists of lines such as the following:
```deepl=DEEPL_KEY_HERE google=GOOGLE_AUTH_FILE_HERE azure=AZURE_KEY_HERE```
## Other notes
- You can get more info about generating these APIs here:
https://translatepress.com/docs/automatic-translation/generate-google-api-key/
- The presentation at the MSU Digital Humanities Conference is here:
https://www.youtube.com/watch?v=56r8MlAFBVg&ab_channel=MSUDigitalHumanities
- Licence: CC-BY 4.0
Contact:
- Anna-Maja Kazarian: kazarianannamaja@gmail.com
- Shuai Wang: shuai.wang@vu.nl
The source code and data remain closed and is available upon request.
References:
[1] Combining strategies for tagging and parsing Arabic, Maytham Alabbas and Allan Ramsay.
Files
Additional details
Related works
- Is derived from
- Software: https://github.com/Multilingual-Homosaurus/MSU-GlobalDH (URL)