Views: 4 | Downloads: 6
The thesis addresses the development of novel knowledge discovery scenarios in a
modern data mining platform by utilising principles of service-oriented architecture
with web services, interactive scientific workflows, knowledge discovery ontologies
and automated construction of data mining workflows.
We present the developed Orange4WS platform which upgrades Orange, a mature
open-source data mining toolkit. Orange4WS enables seamless integration of web
services by implementing a widget code generator and provides tools for web service
development, which adhere to the contract-first design principle. These tools are
used to develop web services from different domains, including systems biology, data
mining, text mining, and natural language processing. Furthermore, Orange4WS
integrates the knowledge discovery ontology, an ontology which defines relationships
among the components of knowledge discovery scenarios, and a workflow planner
which enables automated construction of data mining workflows. The applicability
of the Orange4WS platform is demonstrated and evaluated on several use cases.
Two advanced data analysis methodologies from the domain of systems biology were
developed using Orange4WS: the SegMine methodology and a methodology for contrasting
subgroup discovery.
The SegMine methodology enables semantic analysis of gene expression data by
integrating a semantic subgroup discovery, interactive hierarchical clustering, and
probabilistic link discovery using the Biomine system. Components of the SegMine
methodology integrate publicly available Gene Ontology (GO), Kyoto Encyclopedia
of Genes and Genomes (KEGG), gene-gene interaction data and several other
public databases. SegMine enables advanced data interpretation and formulation
of research hypotheses by integrating the analysis of experimental data with publicly
available knowledge. The methodology is implemented in Orange4WS as a set
of interactive workflow components and evaluated on two data sets, a well-known
dataset from a clinical trial in acute lymphoblastic leukemia (ALL), and a dataset
about senescence in human mesenchymal stem cells (MSC). The experiments on
mesenchymal stem cells data set resulted in the formulation of three new scientific
hypotheses.
The developed contrasting subgroup discovery allows for mining of subgroups that
cannot be discovered using classical subgroup discovery. The methodology proposes a
three-step approach where subgroup discovery in the first and the last step is complemented
by the intermediate contrast definition step. The steps of the methodology,
the differences with standard subgroup discovery, and the examples of set-theoretic
functions for defining contrast classes are presented and illustrated on a simple use
case. The methodology was applied to a systems biology domain. More specifically,
the results of applying the methodology on a time series gene expression dataset for
virus-infected Solanum tuberosum (potato) plants are presented and evaluated by a
domain expert. The proposed methodology is implemented as a set of interactive
workflow components in the Orange4WS platform.
The thesis also contributes to open-source scientific software. The Orange4WS platform,
the implementations of the SegMine methodology and contrasting subgroup
discovery, as well as other Orange4WS use cases are available to general public. This
enables experiment reproducibility, as well as workflow adaptation and enhancement.