REPOSITORY > RESULTS

Doctoral dissertation

Annotation of semi-polar organic contaminants by using gas chromatography coupled to mass spectrometry and machine learning

Author(s): Milka Ljoncheva (Author), Tina Kosjek (Supervisor), Sašo Džeroski (Co-Supervisor)

Thesis defense date: 03.10.2022

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13863

Views: 8 | Downloads: 7

Abstract

Contaminants of emerging concern (CECs), representing a subgroup of organic compounds of
natural or synthetic origin, and their degradation and transformation products (TPs), with
potentially harmful effects on humans, biota, and the environment, are the eco-exposome (EE)
constituents of utmost importance. Their identification, quantification, and continued
investigation into their environmental behavior significantly increase our knowledge of their
impact on the environment. These challenging tasks require the use of state-of-the-art analytical
techniques involving gas chromatography (GC) and liquid chromatography (LC) coupled with
mass spectrometry (MS).
While LC-MS is nowadays most commonly employed, GC-MS remains a powerful tool that
offers reproducible, sensitive, and relatively low-cost identification and quantification of a broad
array of structurally diverse compounds. The range of compounds amenable to GC can also be
significantly extended through the derivatization of semi-volatile compounds prior to analysis.
The most common derivatization method is silylation, which generates trimethylsilyl (TMS) or
tert-butyl dimethylsilyl (TBDMS) derivatives. These analytical techniques, together with
compound databases (DB), mass spectral libraries (MSL), computational workflows, and
cheminformatics approaches, provide accurate and reliable compound annotation (CA). In
contrast to LC-MS, however, the use of GC-MS analytical platforms in the de novo annotation of
CECs, the resulting spectral data in the cheminformatics-assisted annotation of CECs using MS
data, and the related challenges regarding method optimization and stability are not widely
researched.
This thesis investigates the annotation of semi-polar organic contaminants using both GCMS
and machine learning (ML) approaches. The thesis is divided into three parts. The first part
addresses the current state-of-art cheminformatics-assisted CA approaches. Here, we define
three crucial cheminformatics tasks in eco-exposome annotation (EEA): molecular formula (MF)
assignment, compound prioritization, and CA. A novel methodological classification of CA
approaches is provided, along with an assessment of their ability to annotate EE constituents.
The second part of the thesis addresses the generation of GC-electron impact ionization (EI)-
MS spectral datasets for developing, validating, and evaluating cheminformatics and ML-based
CA approaches. A comprehensive dataset of GC-EI-MS spectra of TMS and TBDMS derivatives
was derived from the National Institute of Standards and Technology (NIST) 17, Mass Spectral
Library [1] and filtered by relevance, molecular weight (Mw), and the quality of the GC-EI-MS
spectra. This classification resulted in two training datasets (1) 4,648 GC-EI-MS spectra of TMS
derivatives and (2) 1,883 GC-EI-MS spectra of TBDMS derivatives. Further, two test datasets of
GC-EI-MS spectra of about 100 TMS and 85 TBDMS derivatives of CEC were generated by using
in-house GC-MS analytical methods. This work was followed by applying a supervised ML
approach based on Input Output Kernel Regression (IOKR) for the annotation of CEC silyl
derivatives by using GC-EI-MS spectra. The IOKR approach correctly ranked 37% and 50% of the
tested CEC-TMS derivatives among the top 10 and 20 candidates. The satisfactory identification
rates show that the IOKR approach can be successfully employed in reliable and faster CA
compared to manual MSL search approaches.
The third part of the thesis investigates silylation procedures, particularly the stability of silyl
derivatives of CEC under different storage conditions and their associated measurement
uncertainty (MU). We optimized the derivatization conditions of 70 CEC using N-methyl-N-
(trimethylsilyl) trifluoroacetamide (MSTFA), N, O-bistrifluoroacetamide (BSTFA) and N, Obistrifluoroacetamide
+ 1% trimethylchlorosilane (BSTFA + 1% TMCS) in 36 different
temperature and duration experiments. Further, we tested their stability in a solvent and
artificial wastewater (AWW) extract under relevant storage conditions (25°C, 4°C, and -18°C) for
up to 20 weeks, along with five cycles of freezing and thawing. Significant stability issues were
revealed for TMS derivatives of polyhydroxy compounds and estrogen hormones, in addition to
derivatives degraded to ≤ 85% of their initial concentration after only two freezing and thaw
cycles.
The results of this thesis are gathered in two published papers and two manuscripts
submitted for peer review. They highlight the importance of silylation conditions in reliable CEC
annotation and quantification and provide insight into the stability profiles of TMS derivatives.
In addition, this thesis demonstrates the successful employment of ML and GC-EI-MS in
identifying CEC as silyl derivatives for the first time. The performed work resulted in the
generation of comprehensive datasets that are publicly available and of interest to the ML
community for further development of ML-based CA approaches.

Attachments

Cite this work