Views: 6 | Downloads: 10
One of the prevailing tendencies in science is research over-specialization, resulting in deep but relatively isolated islands of knowledge. Due to the huge amounts of scientific information produced at an increasingly fast pace, it has become difficult to follow even the specific literature limited to a single domain of specialization. On the other hand, it is well known that many complex problems require knowledge from different domains to be combined in generating their solutions. However, scientific literature all too often remains limited and cited only within a single research sub-community, which makes searching for cross-domain scientific connections even harder. The problem of isolation of scientific domains on the one hand and the vast amount of available knowledge on the other hand are clear motivations for the work presented in this thesis.
The thesis proposes a solution to cross-domain knowledge discovery, which alleviates the stated problem of over-specialization by helping scientists to find promising pathways when combining different domains of interest. In particular, we deal with the problem of cross-domain knowledge discovery from text, when a scientist already assumes two topics or domains that need to be addressed jointly. In such cases, our methodology suggests the most adequate terms for bridging the two domains of interest and gives hints concerning the most promising pathways, potentially leading to new discoveries.
This work presents four main contributions to the problem of cross-domain knowledge discovery through text mining. First, an important text preprocessing step, word lemmatization, was significantly improved in terms of accuracy and efficiency. Second, a new CrossBee methodology was developed for discovering and ranking bridging terms, according to their potential to lead to new cross-domain scientific discoveries. Third, the CrossBee methodology was implemented in an executable workflow in a novel browser-based workflow construction and execution platform ClowdFlows, which enables methodology adaptation and reuse. Finally, yet importantly, an online web application was developed; it has an advanced user interface, which supports the end-user when applying the developed CrossBee methods to the domains of choice.
The first main contribution of this thesis is a novel multi-lingual lemmatization engine. We developed LemmaGen, a general purpose, accurate and very efficient publicly available lemmatizer trained on large lexicons of multiple languages. LemmaGen contains also a learning engine, which can be retrained to effectively generate lemmatizers for new languages.
The second contribution of this work is the CrossBee (Cross-Context Bisociation Explorer) methodology for cross-domain knowledge discovery by means of discovering and ranking domain bridging terms. We present specially designed heuristics that assign a quality estimate, a bisociation score, to each bridging term candidate. Furthermore, the methodology is ensemble-based, employing a set of heuristics to ensure robustness and stable performance across a variety of different datasets. Next, this ensemble is used for bridging term identification and ranking for cross-domain knowledge discovery from texts.
The third contribution of this thesis is the implementation of the LemmaGen and CrossBee modules as reusable software components in a complex ClowdFlows workflow. This enables experiment repeatability, software reuse, workflow adaptation and augmentation with new modules, ensuring the system’s sustainability and reuse by the interested research community.
The CrossBee methodology is implemented in an online system, which presents the fourth important contribution of this thesis. In addition to ranking bridging terms in accordance with the developed methodology, CrossBee web application provides additional functionality that helps the scientist not only to find bridging hypotheses but also to check supportive evidence for them. CrossBee user interface supports user’s creativity, as there are many different views and support tools the user can take advantage of when navigating the data.
The novel methodology for cross-domain knowledge discovery using text mining presented in this thesis and implemented in CrossBee has been trained by simulating the discoveries previously published in the literature-based discovery area, rediscovering the connections between migraine and magnesium domains. The approach was tested on the problem of rediscovering connections between autism and calcineurin domains. Finally, this work concludes with a critical evaluation of the developed methodology and interfaces, while discussing possible improvements and directions for further work.