Views: 6 | Downloads: 10
In this thesis we address the problem of learning various types of decision trees from timechanging
data streams. In particular, we study online machine learning algorithms for
learning regression trees, linear model trees, option trees for regression, multi-target model
trees, and ensembles of model trees from data streams. These are the most representative
and widely used models in the category of interpretable predictive models.
A data stream is an inherently unbounded sequence of data elements (numbers, coordinates,
multi-dimensional points, tuples, or objects of an arbitrary type). It is characterized
with high inbound rates and non-stationary data distributions. Real-world scenarios where
processing data streams is a necessity come from various management systems deployed
on top of sensor networks, that monitor the performance of smart power grids, city traffic
congestion, or scientific studies of environmental changes.
Due to the fact that this type of data cannot be easily stored or transported to a central
database without overwhelming the communication infrastructure, data processing and
analysis has to be done in-situ and in real-time, using constant amount of memory. To
enable in-situ real-time learning it is crucial to perform an incremental computation of unbiased
estimates for various types of statistical measures. This requires methods that would
enable us to collect an appropriate sample from the incoming data stream and compute the
necessary statistics and estimates for the evaluation functions on-the-fly.
We approached the problem of obtaining unbiased estimates on-the-fly by treating the
evaluation functions as random variables. This enabled the application of existing probability
bounds, among which the best results were achieved when using the Hoeffding bound.
The algorithms proposed in this thesis therefore use the Hoeffding probability bound for
bounding the probability of error when approximating the sample mean of a sequence of
random variables. This approach gives us the statistical machinery for scaling up various
machine learning tasks.
With our research we address three main sub-problems as part of the problem of learning
tree-based model from time-changing data streams. The first one is concerned with the nonstationarity
of concepts and the need for an informed adaptation of the decision tree. We
propose online change detection mechanisms integrated within the incrementally learned
model. The second subproblem is related to the myopia of decision tree learning algorithms
while searching the space of possible models. We address this problem trough a study and a
comparative assessment of online option trees for regression and ensembles of model trees.
We advise the introduction of options for improving the performance, stability and quality
of standard tree-based models. The third subproblem is related to the applicability of the
proposed approach to the multi-target prediction task. This thesis proposes an extension of
the predictive clustering framework in the online domain by incorporating Hoeffding bound
probabilistic estimates. The conducted study opened many interesting directions for further
work.
The algorithms proposed in this thesis are empirically evaluated on several stationary and
non-stationary datasets for single and multi-target regression problems. The incremental
algorithms were shown to perform favorably to existing batch learning algorithms, while
having lower variability in their predictions due to variations in the training data. Our
change detection and adaptation methods were shown to successfully track changes in realtime
and enable appropriate adaptations of the model. We have further shown that option
trees improve the accuracy of standard regression trees more than ensemble learning methods
without harming their robustness. At last, the comparative assessment of single target
and multi-target model trees has shown that multi-target regression trees offer comparable
performance to a collection of single-target model trees, while having lower complexity and
better interpretability.