Repository > Results

Doctoral dissertation

Using machine learning methods for analyzing sequential textual data

Author(s): Erik Novak (Author), Dunja Mladenić (Supervisor)

Thesis defense date: 29.10.2024

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13685

Download main file (8.4 MB)

Views: 4 | Downloads: 10

Abstract

This thesis investigates machine learning methods for analyzing sequential textual data.
Sequential textual data refers to text collected in a specific order where the sequence is
significant. Examples include (1) sentences, where word usage and order must adhere to the
rules of grammar, (2) news reporting on events happening at different times, and (3) texts
related to financial stock prices, reporting on the financial market dynamics. Analyzing
this data helps us understand patterns, make predictions, and enhance decision-making for
various tasks.
We first focus on sentence similarity using sentence structure, where we propose two
methods, Language Model Earth Mover’s Distance (LM-EMD) and Order-Preserving Wasserstein
Score (OPWScore), that consider the word’s contextual meaning and position in
the sentence. The LM-EMD model was developed for cross-lingual information retrieval
and measures the relevance of a document to a given query. The OPWScore metric evaluates
text generation approaches, such as machine translation models. Both methods use
language models to create the word’s contextual embedding, which was then utilized to
measure the texts’ similarity using optimal transport, specifically the Wasserstein distance.
OPWScore also restricts its distance calculation to words in similar positions, emphasizing
the role of word placement. The methods are evaluated and compared with other models
and metrics that consider sentence structure. The results show that sentence structure contributes
to fluency-based performance while preserving the adequacy-related performance
of the models.
Next, we explore the use of neural network models for online news clustering, considering
the news’ publication time. We first introduce a methodology for creating novel
news clustering data sets. It automates the news collection and clustering process, thus
reducing the time and resources required for evaluators to annotate the data sets manually.
The methodology is used to create a cross-lingual news collection covering the 2021
Tokyo Olympics. The collection consists of articles written in different languages and labeled
based on their reporting events. We then develop a new online clustering algorithm
called Wasserstein-based news Article Clustering (WAC). This two-stage, distance-based
algorithm uses Wasserstein distance to analyze the contextual and temporal similarities
between clusters and decide when to merge them. The algorithm is tested on two data sets
and compared with other online news clustering algorithms, showing that WAC performs
comparably to the best-performing supervised algorithms without requiring any prior finetuning.
Finally, we define multi-modal data fusion of heterogeneous data sources. We focus on
predicting stock market dynamics by combining text and data streams. We develop four
methods for including text information in time series forecasting, which are then used to
predict the stock close and daily volatility dynamics. We also test using different textual
representations. The experiments show that including multi-dimensional text representation
can improve predictions when the input data is appropriately processed and the right
text inclusion strategy is used.

Metadata

Work Type	Doctoral dissertation
Language	English
Organization	MPŠ - Mednarodna podiplomska šola Jožefa Stefana
PID	20.500.12556/ReVIS-13685
COBISS ID	213612803
UDK	004.85(043.3)
Thesis defense date	29.10.2024

Keywords

machine learning analyzing sequential textual data algorithms

Attachments

Attachment - academic_work_attachments/Erik_Novak_Ph… (8.4 MB) MD5: 8b118a54d95ecc8bcce7902ec3fd93a1

Cite this work

Citation style:

Back to Search View in ReVIS View in COBISS