Views: 4 | Downloads: 8
Data analysis with machine learning methods, when applied to large collections of text data, enables us to
discover new knowledge. This knowledge, once put together, might describe the still unknown connections
among phenomena and thus contribute to the formation of new hypotheses in different fields, medicine
including. Also, connectivity and computer-supported analysis of numerous large data sets, which include
text data, may contribute, in a methodological sense, to the development of e-science. Namely, information
that is related across different contexts is difficult to identify with conventional associative approaches. The
context-crossing associations, however, are the ones often needed for innovative discoveries. Such
associations are called bisociations.
Automated knowledge discovery based on text data sets in the field of medicine is an intriguing
challenge as it requires intensive collaboration with domain experts during the processes of both domainspecific
text analysis and evaluation. Hence an interactive approach is recommended when text mining and
decision support are combined. Also, it is beneficial to apply improved methods of literature mining,
searching indirect connections and bisociative knowledge discovery from extensive text databases such as
MEDLINE. The major aim here is to unravel the still hidden relations between the researched phenomena
and their potential causes. In the process, use of appropriate visualization on the part of the experts is
desirable as it supports knowledge discovery and interpretation of results.
The fundamental goal of this thesis is to develop a new methodology for knowledge discovery in text
databases that can improve the existing methods of exploring implicit relationships across different
domains of expertise by providing a more intuitive computer aided search of unexplored links in literature.
To contribute to the current state of the literature-based discovery we designed and implemented an
innovative literature mining method for semi-automated discovery of hidden relations that is based on rare
pieces of information in a given domain. When these relations are interesting from a medical point of view
and can be verified by medical experts, they represent new pieces of knowledge and can contribute to better
understanding of diseases.
The developed literature mining method called RaJoLink is intended to support biomedical experts in
both open and closed discovery process. In the open knowledge discovery process, hypotheses have to be
generated, while in the closed knowledge discovery process, given hypotheses are tested. By identifying
relations between biomedical concepts in disjoint sets of articles, the method implements the Swanson’s
ABC model approach. However, the RaJoLink method analyses such relations in a new way and expands
the Swanson's ABC model by suggesting how terms a can be determined in advance, as a result of the open
knowledge discovery process. The main novelty is a semi-automated suggestion of candidates for agents a
that might be logically connected with a given phenomenon c under investigation. The choice of candidates
for a is based on rare terms identified in the literature on the topic c. As rare terms are not part of the
typical range of information, which describe the phenomenon under investigation, such information might
be considered as unusual observations about the phenomenon c. If literatures on these rare terms have an
interesting term in common, this joint term is declared as a candidate for a. Linking terms b between
literature on a and literature on c are then searched for in the closed discovery process to provide
supportive evidence for uncovered connections.
We have applied the RaJoLink method to the scientific literature on autism and have used MEDLINE as
a source of data. Autism was selected as the problem domain due to its complexity, insufficient and partial
knowledge about its various causes, and because of the strong focus of current medical research towards
early diagnosis of this disorder. With the proposed approach we wanted to make a concrete contribution in
this direction. In the autism domain we discovered a relation between autism and calcineurin and between
autism and transcription factor NF-kappaB, which have been evaluated by a medical expert as relevant for
better understanding of autism. To assess the usefulness of RaJoLink in general, we evaluated the potential
of our method also in the migraine-magnesium experiment, which represents a gold standard for the
literature-based discovery. For all these purposes we also developed a software tool, which implements the
RaJoLink method and provides decision support to experts in the process of generating and testing of the
scientific hypotheses in biomedical domains.