Views: 9 | Downloads: 11
This PhD dissertation focuses on improving terminology extraction and alignment for applications
in the translation industry. It explores three key use cases where these techniques
benefit language professionals: creating client-specific terminology lists from large parallel
corpora (i.e. translation memories), building domain-specific terminology resources from
comparable corpora, and identifying important domain-specific terms in source documents
prior to translation.
The research starts with the task of bilingual terminology alignment from parallel corpora
found in translation memories. The main contribution is a novel approach called
Phrase-Table-Based Alignment, which uses phrase tables from statistical machine translation
to align terms with greater accuracy at the sub-sentence level. Additionally, the
dissertation introduces TermEnsembler, a terminology extraction and alignment system
developed for an industry partner which utilizes an ensemble learning approach combining
seven different alignment methods via an evolutionary algorithm to find the best combination
for optimal results. TermEnsembler was tested on three industry-specific domains
(Financial, Information Technology, and Automotive) using a precision-focused evaluation
of the top-ranked outputs.
Next, the dissertation addresses bilingual terminology alignment from comparable corpora.
We start by replicating an existing machine-learning approach and then incorporate
additional dictionary-based (features that leverage bilingual dictionaries or word alignments)
and cognate-based (features that leverage words with shared etymological origins
that exhibit similarities across languages) features. Later work expanded on this by implementing
word alignments based on cross-lingual word embeddings and sentence embeddings.
The methods were evaluated on the Eurovoc thesaurus, a multilingual thesaurus of
EU-related terminology, using standard evaluation metrics, as well as adapted for keyword
alignment in the media industry.
Finally, the dissertation explores two approaches to monolingual terminology extraction
from specialized corpora with a special focus on the Slovenian language. The first
is a machine-learning method combining statistical, linguistic, and contextual features derived
from contextual embeddings. It employs feature engineering to capture termhood
and unithood characteristics, resulting in improved precision and recall metrics over traditional
approaches. The second approach, which surpasses the performance of the machinelearning
approach, utilizes transformer-based models for sequence labeling, assigning a
label to each token in a text sequence. Both approaches were evaluated on the RSDO5
dataset, a specialized dataset with four domains created specifically for terminology extraction
evaluation.
These findings provide practical improvements for the translation industry. By adopting
these new approaches, the industry can effectively leverage the existing language resources
at their disposal to achieve more accurate and consistent specialized terminology.
Overall, they offer a step forward in making terminology extraction and alignment more reliable
and efficient, supporting better outcomes for language professionals and their clients.