REPOSITORY > RESULTS

Doctoral dissertation

Considering autocorrelation in predictive models

Author(s): Daniela Stojanova (Author), Sašo Džeroski (Supervisor)

Thesis defense date: 20.12.2012

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13613

Views: 7 | Downloads: 12

Abstract

Most machine learning, data mining and statistical methods rely on the assumption that the analyzed data
are independent and identically distributed (i.i.d.). More specifically, the individual examples included
in the training data are assumed to be drawn independently from each other from the same probability
distribution. However, cases where this assumption is violated can be easily found: For example, species
are distributed non-randomly across a wide range of spatial scales. The i.i.d. assumption is often violated
because of the phenomenon of autocorrelation.
The cross-correlation of an attribute with itself is typically referred to as autocorrelation: This is
the most general definition found in the literature. Specifically, in statistics, temporal autocorrelation is
defined as the cross-correlation between the attribute of a process at different points in time. In timeseries
analysis, temporal autocorrelation is defined as the correlation among time-stamped values due
to their relative proximity in time. In spatial analysis, spatial autocorrelation has been defined as the
correlation among data values, which is strictly due to the relative location proximity of the objects
that the data refer to. It is justified by Tobler’s first law of geography according to which “everything
is related to everything else, but near things are more related than distant things”. In network studies,
autocorrelation is defined by the homophily principle as the tendency of nodes with similar values to be
linked with each other.
In this dissertation, we first give a clear and general definition of the autocorrelation phenomenon,
which includes spatial and network autocorrelation for continuous and discrete responses. We then
present a broad overview of the existing autocorrelation measures for the different types of autocorrelation
and data analysis methods that consider them. Focusing on spatial and network autocorrelation, we
propose three algorithms that handle non-stationary autocorrelation within the framework of predictive
clustering, which deals with the tasks of classification, regression and structured output prediction. These
algorithms and their empirical evaluation are the major contributions of this thesis.
We first propose a data mining method called SCLUS that explicitly considers spatial autocorrelation
when learning predictive clustering models. The method is based on the concept of predictive clustering
trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive
model is associated to each cluster. In particular, our approach is able to learn predictive models for both
a continuous response (regression task) and a discrete response (classification task). It properly deals with
autocorrelation in data and provides a multi-level insight into the spatial autocorrelation phenomenon.
The predictive models adapt to the local properties of the data, providing at the same time spatially
smoothed predictions. We evaluate our approach on several real world problems of spatial regression
and spatial classification.
The problem of “network inference” is known to be a challenging task. In this dissertation, we
propose a data mining method called NCLUS that explicitly considers autocorrelation when building
predictive models from network data. The algorithm is based on the concept of PCTs that can be used
for clustering, prediction and multi-target prediction, including multi-target regression and multi-target
classification. We evaluate our approach on several real world problems of network regression, coming
from the areas of social and spatial networks. Empirical results show that our algorithm performs better
than PCTs learned by completely disregarding network information, CLUS* which is tailored for spatial
data, but does not take autocorrelation into account, and a variety of other existing approaches.
We also propose a data mining method called NHMC for (Network) Hierarchical Multi-label Classification.
This has been motivated by the recent development of several machine learning algorithms for
gene function prediction that work under the assumption that instances may belong to multiple classes
and that classes are organized into a hierarchy. Besides relationships among classes, it is also possible
to identify relationships among examples. Although such relationships have been identified and extensively
studied in the literature, in particular as defined by protein-to-protein interaction (PPI) networks,
they have not received much attention in hierarchical and multi-class gene function prediction. Their
use introduces the autocorrelation phenomenon and violates the i.i.d. assumption adopted by most machine
learning algorithms. Besides improving the predictive capabilities of learned models, NHMC is
helpful in obtaining predictions consistent with the network structure and consistently combining two
information sources (hierarchical collections of functional class definitions and PPI networks). We compare
different PPI networks (DIP, VM and MIPS for yeast data) and their influence on the predictive
capability of the models. Empirical evidence shows that explicitly taking network autocorrelation into
account can increase the predictive capability of the models, especially when the PPI networks are dense.
NHMC outperforms CLUS-HMC (that disregards the network) for GO annotations, since these are more
coherent with the PPI networks.

Attachments

Cite this work