Machine translation quality estimation literature review
Creators
Description
Machine translation (MT) systems enable the automated translation of text from a source to a target language. Even though the performance of MT systems has improved significantly in recent years they are not immune to errors. Errors that alter the meaning of a translation may be rare, but they remain critical. Additionally, increased fluency of MT output might make it harder to identify errors. This motivates the requirement to assess the quality of translations, particularly in use cases where the impact of an incorrect translation can be significant.
Quality estimation (QE) predicts the quality of a translation given only the source language textand the target machine-translated text. QE is usually carried out at eitherword-level, predicting abinary ‘OK’ or ‘BAD’ tag for each word in the translated text, or atsentence-level, predicting a singlescore that represents the overall quality of the translated sentence. The attraction of QE is that it can be conducted at run-time without the need for a gold-standard reference translation.
The task is immediately challenging as defining translation quality can be subjective and requiresweighing several dimensions against each other, such as fluency, style and accuracy. Additionally, error severity is likely use case dependent. If an application of MT is to communicate on a professional level with customers or clients, then stylistic or minor grammatical errors might be considered major,even when they do not affect the meaning of the text. Quality scores can also differ between individual annotators given the same annotation guidelines, even when using professional translators and linguists. It is common practice to use protocols for reaching annotator consensus and to derive a single ground-truth score.
The majority of state-of-the-art (SOTA)QE systems are ensembles of models that fine-tune anencoder-only pre-trained language model (PLM) with QE data. These models treat the MT systemas a black-box as they do not require any information from the MT model, other than the output translation. In the last year, there have also been some QE systems that use a decoder-onlylargelanguage model (LLM) to make predictions from prompts. This approach is not dominant but may attract more attention in future years as it has the advantage of not requiring large amounts of training data.
The performance of QE systems is typically measured using correlation between the predictions and the ground-truth scores on a test set. The performance of SOTA QE models can produce relatively high correlations for some language pairs, even with little or no training data. In recent years, fine-tuned QE models have found performance gains by scaling up the size of the PLM used as a basemodel although the trade-off between size and performance is not clear. As the amount of data available for training such models is limited, many approaches also find additional performance by augmenting the existing datasets with synthetic data, in particular, to represent more records that contain critical errors, which alter the meaning of the original text.
Alternative formulations of the QE task exist from predicting a binary sentence-level label (indicating the presence of a critical error) to producing explainable output, identifying error spans with severity labels in the text. Such formulations may be capable of producing more user-friendly outputs than a single score. Uncertainty quantification methods can also be used to output an uncertainty score or confidence interval with the prediction. All of these tasks remain fairly underdeveloped compared to the core QE task.
In conclusion, QE can produce relatively high correlations with ground truth scores in many situations. However, high performance is not guaranteed across all tasks and language-pairs and there are many considerations that are relevant to deploying a QE system in a real-world scenario. Test data needs to accurately reflect the task at hand and sufficiently represent the language pairs and text domains of interest. Additionally, it needs to be annotated with the specific use case in mind, following the appropriate definition of quality and providing the required granularity of outputs.
Files
knight-et-al-2024.pdf
Files
(1.4 MB)
Name | Size | Download all |
---|---|---|
md5:aa8eab8ee4f278a8b0c8269be3c9184f
|
1.4 MB | Preview Download |
Additional details
Biodiversity
- Basis of record
- Technical Reports
- Catalog number
- Turing Technical Report No. 1