Repository > Results

Doctoral dissertation

Combining neural and symbolic representations in natural language processing

Author(s): Matej Martinc (Author), Senja Pollak (Supervisor)

Thesis defense date: 07.06.2022

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13882

Download main file (6.7 MB)

Views: 5 | Downloads: 10

Abstract

The thesis addresses a novel representation learning framework, combining neural and symbolic
text representations, and demonstrates its utility for tackling diverse natural language
processing problems. The proposed approach, avoiding the deficiencies of purely symbolic
and purely neural methods, can be applied for the generation of efficient text representations.
Its usefulness is demonstrated on three use cases: author profiling, readability
detection and keyword extraction.
First, we focus on the problem of author profiling and argue that semantic modelling
of existing state-of-the-art approaches, which still in most cases rely on extensive feature
engineering, could be improved by employing two strategies. The first one involves adding
symbolic semantic features based on word taxonomies to bag-of-n-grams features. This
approach shows good results when tested on a number of author profiling tasks, that is,
predicting the gender, age and personality of the author of the text. The second strategy
consists of combining the bag-of-n-grams features with neural features derived from the
convolutional neural network and is tested on the task of language variety detection. While
both approaches manage to outperform state-of-the-art methods, we argue that the second
hybrid neuro-symbolic strategy is preferred since it does not require external resources and
is therefore easier to employ on less resourced languages other than English.
We next shift our focus to the problem of readability prediction, where we propose
a novel Ranked Sentence Readability Score, in which statistics derived from the neural
language model are combined with shallow symbolic readability indicators that consider
simple text statistics. The main novelty of the approach is the use of a neural language
model in an unsupervised way, as a standalone unsupervised readability predictor. We
argue this is possible since neural language models tend to capture much more information
than traditional n-gram models and also model long-term dependencies. Through language
model statistics, the proposed readability formula also considers background knowledge and
discourse cohesion and therefore avoids the reductionism of traditional readability formulas.
And since neural language models can be trained, the formula can also be adapted to a
specific language and domain. We show that this results in a robust performance of the
formula, which offers good correlation with gold standard readability scores across different
genres and languages.
The final task we tackle is keyword extraction. We propose a transfer learning technique,
in which a transformer-based keyword tagger is first pretrained as a language model
on a large corpus and then fine-tuned on a small-sized corpus with manually labelled keywords
in order to decrease the amount of required labelled data for successful training of
the model. We propose several modifications to the transformer architecture in order to
adapt it to task at hand and improve performance. We show that the proposed model
offers performance comparable to the state-of-the-art neural models while requiring only a
fraction of manually labelled data. Finally, we combine the neural model with a symbolic
unsupervised TF-IDF-based keyword detector in order to improve the recall and make the
system appropriate for usage as a recommendation system in the media house environment.

Metadata

Work Type	Doctoral dissertation
Language	English
Organization	MPŠ - Mednarodna podiplomska šola Jožefa Stefana
PID	20.500.12556/ReVIS-13882
COBISS ID	123834115
UDK	81'322:004.8(043.3)
Thesis defense date	07.06.2022

Attachments

Attachment - academic_work_attachments/Matej_Martinc… (6.7 MB) MD5: e644197cb01c4c327409d05384718995

Cite this work

Citation style:

Back to Search View in ReVIS View in COBISS

REPOSITORY > RESULTS

Combining neural and symbolic representations in natural language processing

Abstract

Metadata

Attachments

Cite this work