Views: 6 | Downloads: 7
Microarrays are at the center of a revolution in biotechnology, allowing researchers to simultaneously monitor the expression of tens of thousands of genes. The final aim of a typical microarray experiment is to find a molecular explanation for a given macroscopic observation (e.g., which pathways are affected by the loss of glucose in a cell, what biological processes differentiate a healthy control from a diseased case); this is called functional interpretation of gene expression data.
This thesis presents two new methods for the functional interpretation of gene expression data that combine and use knowledge stored in different kinds of biological databases. The interpretation is done by identifying and describing gene sets that have significantly altered expression profile (e.g., over- or underexpressed). The search of the interesting gene sets is performed in the space of already defined gene sets (genes that have common annotation by predefined ontological terms) and in the space of newly generated gene sets that have predefined characteristics (e.g., the minimum number of member genes that are found to be differentially expressed). Three well established methods, Fisher's exact test, Gene Set Enrichment Analysis (GSEA), and Parametric Analysis of Gene set Enrichment (PAGE), were employed in order to identify gene sets with significantly altered expression profiles.
Both developed methods share the same mechanism of first-order (relational) feature construction, by using the Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology, gene annotations, and gene- gene interaction data. These features, constructed by the propositionalization mechanism of the Relational Subgroup Discovery algorithm (RSD), are used as generalized gene annotations.
The first method belongs to the class of threshold-based functional analysis methods. It is performed in two steps. In the first step, 'top' genes of interest are selected using gene differential expression as a selection criterion. The selection process does not take into account the fact that gene products are acting cooperatively in the cell and consequently, for better interpretation of the selected gene list, in the second step their behavior must be coupled to some extent by looking for their common description. The language used for describing the functionality of the genes is constructed from GO, gene annotations, and gene-gene interaction data. By using this background knowledge together with the paradigm of relational subgroup discovery we found common descriptions of gene sets differentially expressed in specific cancers. The descriptions of these gene sets can be straightforwardly used by the medical experts.
The second method is based on threshold-free functional analysis. This method is also performed in two steps. In the first step, genes are ranked by using their differential expression values when comparing predefined classes (e.g., tumor vs. healthy controls) by means of an appropriate statistical test (e.g., the t-test). In the second step, the positions of the members of the predefined gene sets (e.g., defined by GO and KEGG Orthology terms) in the ranked list are analyzed using appropriate statistical tests (e.g., the Kolmogorov-Smirnov test). Gene sets, whose members are predominantly found at the top of the list, are considered enriched and responsible for the phenotype difference (e.g., the tumor vs. normal). Our contribution to this methodology is a development of an efficient algorithm, inspired by the RSD first-order features construction, for the construction of new, potentially enriched, gene sets. New gene sets are defined by conjunctions of relational features constructed from the background knowledge.
The two developed methods have proved to be of interest to medical experts. The extracted knowledge turns out to be consistent with the relevant literature, and proves to have the potential for guiding the biomedical research and generating new hypotheses that explain microarray measurements.
Also, a by-product of the thesis is an easy to use relational database that integrates several sources of biological knowledge (GO, KEGG Orthology, gene annotations and gene-gene interaction data) in a unified format. This database is now publicly available to a wider scientific community.