Views: 7 | Downloads: 7
The domain of data mining (DM) deals with analyzing different types of data. The data
typically used in data mining is in the format of a single table, with primitive datatypes
as attributes. However, structured (complex) data, such as graphs, sequences, networks,
text, image, multimedia and relational data, are receiving an increasing amount of interest
in data mining. A major challenge is to treat and represent the mining of different types
of structured data in a uniform fashion. A theoretical framework that unifies different
data mining tasks, on different types of data can help to formalize the knowledge about
the domain and provide a base for future research, unification and standardization. Next,
automation and overall support of the Knowledge Discovery in Databases (KDD) process is
also an important challenge in the domain of data mining. A formalization of the domain
of data mining is a solution that addresses these challenges. It can directly support the
development of a general framework for data mining, support the representation of the
process of mining structured data, and allow the representation of the complete process of
knowledge discovery.
In this thesis, we propose a reference modular ontology for the domain of data mining
OntoDM, directly motivated by the need for formalization of the data mining domain. The
OntoDM ontology is designed and implemented by following ontology best practices and
design principles. Its distinguishing feature is that it uses Basic Formal Ontology (BFO)
as an upper-level ontology and a template, a set of formally defined relations from Relational
Ontology (RO) and other state-of-the-art ontologies, and reuses classes and relations
from the Ontology of Biomedical Investigations (OBI), the Information Artifact Ontology
(IAO), and the Software Ontology (SWO). This will ensure compatibility and connections
with other ontologies and allow cross-domain reasoning capabilities. The OntoDM ontology
is composed of three modules covering different aspects of data mining: OntoDT, which
supports the representation of knowledge about datatypes and is based on an accepted ISO
standard for datatypes in computer systems; OntoDM-core, which formalizes the key data
mining entities for representing the mining of structured data in the context of a general
framework for data mining; and OntoDM-KDD, which formalizes the knowledge discovery
process based on the Cross Industry Standard Process for Data Mining (CRISP-DM) process
model.
The OntoDT module provides a representation of the datatype entity, defines a taxonomy
of datatype characterizing operations, and a taxonomy of datatype qualities. Furthermore, it
defines a datatype taxonomy comprising classes and instances of primitive datatypes, generated
datatypes (non-aggregate and aggregated datatypes), subtypes, and defined datatypes.
With this structure, the module provides a generic mechanism for representing arbitrarily
complex datatypes.
The OntoDM-core module formalizes the key data mining entities needed for the representation
of mining structured data in the context of a general framework for data mining.
These include the entities dataset, data mining task, generalization, data mining algorithm,
and others. More specifically, it provides a representation of datasets, and a taxonomy of
datasets based on the type of data. Next, it provides a representation of data mining tasks,
and proposes a taxonomy of data mining tasks, predictive modeling tasks and hierarchical
classification tasks. Furthermore, it provides a representation for generalizations, and
proposes a taxonomy of generalizations and predictive models based on the types of data
and generalization language. Moreover, it provides a representation of data mining algorithms,
proposes a taxonomy of data mining algorithms, predictive modeling algorithms,
and hierarchical classification algorithms, and generalizes the mechanism for representing
data mining algorithms to represent general algorithms in computer science. In addition,
the OntoDM-core module provides a representation of constraints and constraint-based data
mining tasks and proposes a taxonomy thereof. Finally, the module provides a representation
of data mining scenarios that includes data mining scenarios as a specification, data
mining workflows, and the process of executing a data mining workflow.
The OntoDM-KDD module supports the representation of data mining investigations.
It provides a representation of data mining investigation by directly extending classes from
the OBI and IAO ontologies. Furthermore, it models each of the phases in a data mining
investigation (such as application understanding, data understanding, data preparation,
modeling, DM process evaluation, and deployment), and their inputs and outputs.
The OntoDM ontology and its three modules OntoDT, OntoDM-core, and OntoDMKDD)
were evaluated in order to assess their quality. The evaluation was performed by
assessing the ontology against a set of design principles and best practices, and assessing
whether the competency questions posed in the design phase were implemented in the language
of the ontology. In addition, we provided a domain coverage assessment by comparing
the OntoDM data mining tasks taxonomy with the data mining topic ontology constructed
in a semi-automatic fashion from abstracts of articles from data mining conferences and
journals.
The developed ontology supports a large variety of applications. We demonstrate the
use and the application of the ontology by describing six use cases. The OntoDM ontology
is used for the annotation of data mining algorithms; for the representation of data mining
scenarios; for the annotation of data mining investigations; in cross domain applications to
support ontology-based representation of QSAR modeling for drug discovery, as a mid-level
ontology by the Expose ontology; and for the annotation of articles containing data mining
terms in combination with text mining tools.
The novelties that the OntoDM ontology introduces and what distinguishes it from other
related ontologies are the facts that it allows representation of mining of structured data
and the general process of data mining in a principled way, it is based on a theoretical
ontological framework and due to this it can be connected to other domain ontologies to
support cross-domain applications. The OntoDM ontology is also the first ontology that
supports the representation of the complete process of knowledge discovery.
In the future developments of the OntoDM ontology, we plan to focus on several aspects.
First, we would like to align and map of our ontology to other upper-level ontologies.
Second, we plan to extend the established ontological framework to represent entities about
components of data mining algorithms, such as distance functions and kernel functions.
Next, we plan to populate the ontology downward with instances. Furthermore, we plan to
extend the representational framework for representing experiments for mining structured
data in the context of experiment databases. Finally, we plan to include more contributors
from the domain of data mining into the development of OntoDM and apply the OntoDM
design principles to the development of ontologies for other areas of computer science.