Views: 5 | Downloads: 10
The thesis addresses a novel representation learning framework, combining neural and symbolic
text representations, and demonstrates its utility for tackling diverse natural language
processing problems. The proposed approach, avoiding the deficiencies of purely symbolic
and purely neural methods, can be applied for the generation of efficient text representations.
Its usefulness is demonstrated on three use cases: author profiling, readability
detection and keyword extraction.
First, we focus on the problem of author profiling and argue that semantic modelling
of existing state-of-the-art approaches, which still in most cases rely on extensive feature
engineering, could be improved by employing two strategies. The first one involves adding
symbolic semantic features based on word taxonomies to bag-of-n-grams features. This
approach shows good results when tested on a number of author profiling tasks, that is,
predicting the gender, age and personality of the author of the text. The second strategy
consists of combining the bag-of-n-grams features with neural features derived from the
convolutional neural network and is tested on the task of language variety detection. While
both approaches manage to outperform state-of-the-art methods, we argue that the second
hybrid neuro-symbolic strategy is preferred since it does not require external resources and
is therefore easier to employ on less resourced languages other than English.
We next shift our focus to the problem of readability prediction, where we propose
a novel Ranked Sentence Readability Score, in which statistics derived from the neural
language model are combined with shallow symbolic readability indicators that consider
simple text statistics. The main novelty of the approach is the use of a neural language
model in an unsupervised way, as a standalone unsupervised readability predictor. We
argue this is possible since neural language models tend to capture much more information
than traditional n-gram models and also model long-term dependencies. Through language
model statistics, the proposed readability formula also considers background knowledge and
discourse cohesion and therefore avoids the reductionism of traditional readability formulas.
And since neural language models can be trained, the formula can also be adapted to a
specific language and domain. We show that this results in a robust performance of the
formula, which offers good correlation with gold standard readability scores across different
genres and languages.
The final task we tackle is keyword extraction. We propose a transfer learning technique,
in which a transformer-based keyword tagger is first pretrained as a language model
on a large corpus and then fine-tuned on a small-sized corpus with manually labelled keywords
in order to decrease the amount of required labelled data for successful training of
the model. We propose several modifications to the transformer architecture in order to
adapt it to task at hand and improve performance. We show that the proposed model
offers performance comparable to the state-of-the-art neural models while requiring only a
fraction of manually labelled data. Finally, we combine the neural model with a symbolic
unsupervised TF-IDF-based keyword detector in order to improve the recall and make the
system appropriate for usage as a recommendation system in the media house environment.