Feature construction, encompassing both feature engineering, which involves the manual design of features by domain experts, and representation learning, which refers to the automated discovery of useful data representations during model construction, is a fundamental aspect of machine learning. Its goal is to transform raw data into a more suitable …
This PhD dissertation focuses on improving terminology extraction and alignment for applications in the translation industry. It explores three key use cases where these techniques benefit language professionals: creating client-specific terminology lists from large parallel corpora (i.e. translation memories), building domain-specific terminology resources from comparable corpora, and identifying important domain-specific …
As ecological, agricultural, and biological disciplines face mounting challenges like biodiversity loss, food chain disruption, and climate change, leveraging machine learning (ML) to process complex and heterogeneous data becomes increasingly vital. This dissertation explores the potential of ML in combination with explainability approaches for enhancing research in life sciences, specifically …
The rapid evolution of sensor technologies, particularly within the Internet of Things (IoT) domain, has led to an era dominated by massive real-time data flows. This transition resulted in the need for novel data processing methodologies that facilitate the shift from traditional offline batch analytics to agile real-time predictions and …
In this thesis, we introduce novel methods for equation discovery (ED), based on the use of probabilistic grammars. ED and symbolic regression address the task of finding a symbolic mathematical model that best describes observed data. Models can be as simple as an algebraic equation or as complex as a …
Characterization of the indoor radio environment (RE) is a prerequisite for advances in the design and optimization of next-generation indoor wireless networks and for the construction of a digital twin of the building. The need for comprehensive and accurate indoor characterization will be evident in the future hyper-connected mixed real-virtual …
Contaminants of emerging concern (CECs), representing a subgroup of organic compounds of natural or synthetic origin, and their degradation and transformation products (TPs), with potentially harmful effects on humans, biota, and the environment, are the eco-exposome (EE) constituents of utmost importance. Their identification, quantification, and continued investigation into their environmental …
With the resurgence of neural network-based learning in the last decade, machine learning methods are becoming critical components of many real-life intelligent systems. However, while being able to learn effectively and at scale, such systems are often non-interpretable and unable to exploit existing symbolic background knowledge. The paradigm that offers …
Most machine learning, data mining and statistical methods rely on the assumption that the analyzed data are independent and identically distributed (i.i.d.). More specifically, the individual examples included in the training data are assumed to be drawn independently from each other from the same probability distribution. However, cases where this …
In language technologies, syntactic parsing represents one of the possible intermediate steps of text analysis in the applications such as machine translation, information extraction, question answering, etc. Syntactic trees are often used to demonstrate the structure of text. In the last decades, the dependency framework became a popular syntactic representation, …