Views: 7 | Downloads: 8
Automatic terminology extraction, also known as automatic term extraction (ATE), is
a natural language processing (NLP) task that identifies specialized terminology from
domain-specific corpora. ATE is often used for terminographic tasks (e.g., the creation
of specialized dictionaries) and contributes to several complex downstream tasks (e.g.,
machine translation and information retrieval). Over the last forty years, considerable
progress has been made in automatic terminology extraction, however, several major challenges
persist.
At the beginning of our studies, most ATE systems relied on either shallow machine
learning (ML) or deep neural networks (e.g., transformers). While the earlier techniques
suffered from time-consuming feature engineering steps and difficulties in generalizing to
new unseen domains, the later approaches solved these problems by introducing the task as
a binary sequence classification problem with transformers variants. However, generating
all possible n-grams from each sentence across all documents for training purposes still
poses a computational challenge. Moreover, current systems only focus on developing a
system to extract the non-nested terms in fully supervised environments, leaving a gap in
capturing nested terms and handling scenarios where there is not enough data. Therefore,
in our thesis, we address these challenges in ATE.
We focus on the following aspects. First, in scenarios where sufficient annotated data is
available for fully supervised settings, we investigate the improvement of neural approaches
by introducing the task as a token classification problem (so-called sequence labeling),
using transformers as a base model with additional representation (e.g., label semantics)
and modified layers (e.g., mixture of experts or MoE, RNN). Furthermore, we build on
the current systems by introducing NOBI, a novel annotation system to better capture the
nested terms. Second, in scenarios where the well-annotated data from the same language
is limited but the data from other languages is suitable for fully supervised settings, we
propose cross-lingual and multilingual learning to exploit the potential of transfer learning
between languages, especially for languages with fewer resources. Third, in scenarios where
both well-annotated data and computational resources are limited, we propose a novel
pipeline called LlamATE that uses large language models (LLMs) as a predictor to query
the candidate terms without additional fine-tuning steps, using only demonstrations with
few shot and self-verification steps.
Our study comes to the following conclusions. First, token classification approaches
(e.g., using XLMR) are valid and promising methods for fully supervised learning in terminology
extraction, as they outperform non-sequential and binary sequence classifiers
and reduce the computational challenges mentioned in the benchmarks. Integrating a
MoE layer on top of the deep neural model (e.g., (m)DeBERTA) improves performance
compared to the baseline with a dense token classification head. Using NOBI annotation
regimes for the token classifier trained on the datasets with a significant amount
of nested terms shows a visible improvement on single-word terms. Second, our results
on token classification on limited annotated data for a given language demonstrate the
promising effect of multilingual and cross-lingual cross-domain learning, which is particularly
important when transferring from the rich- to lesser-resourced languages. Finally,
our pipeline, LlamATE, suggests the potential of LLMs with few-shot demonstrations and
self-verification in learning from a few examples in the same domain even without explicitly
naming the domain, as well as the potential in transferring knowledge from well-covered
languages (i.e., English) to less-represented languages in pre-trained LLMs (i.e., French,
Dutch). Even though the LLM-based approaches with few-shot demonstrations are not
a substitute for fully supervised models, they can provide solutions with relatively good
performance without the need for manually annotated data.
Keywords: Automatic terminology extraction, Transformers, Token Classification, Seq2seq,
Large Language Model, Prompt Engineering, In-context Learning, Llama2, ChatGPT.
automatic terminology extraction transformers token classification natural language processing