REPOSITORY > RESULTS

Doctoral dissertation

Complex nodes in trees for structured output prediction

Author(s): Tomaž Stepišnik (Author), Dragi Kocev (Supervisor), Sašo Džeroski (Co-Supervisor)

Thesis defense date: 18.01.2021

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-14145

Views: 4 | Downloads: 1

Abstract

In this thesis, we integrate complex nodes into predictive clustering trees (PCTs). PCTs
are well-established machine learning models that are very flexible in terms of the machine
learning tasks that they can address, including structured output prediction and semisupervised
learning. Like standard decision trees, they are learned with a greedy top-down
induction algorithm that comes with two weaknesses. First, because of the greediness, the
algorithm makes myopic decisions, which can lead to sub-optimal trees. Second, the split
selection procedure scales poorly with the dimensionality of the output. This can lead to
prohibitively long learning times in structured output prediction problems.
To reduce the myopia of the learning algorithm, we extend PCTs with option nodes.
When there are multiple splits with heuristic scores close to the maximum one, option PCTs
do not select only the best split but combine several alternatives in an option node. For
each alternative split, the tree then continues to grow normally. When making predictions,
each alternative split in an option node makes its own prediction. These predictions are
then aggregated into the final prediction. We evaluate option PCTs on several structured
output prediction tasks: multi-target regression, multi-label classification, and hierarchical
multi-label classification. In the evaluation, we focus on how the number of option nodes
introduced in the tree affects the trade-off between the predictive performance and tree
size (which is the main factor of its interpretability). We show that if the number of option
nodes is small, the option PCT remains interpretable while improving the performance of
a standard PCT. On the other hand, option PCTs with many option nodes exhibit similar
behavior to bagging ensembles of PCTs in terms of predictive performance, model size,
and learning time.
To improve the computational complexity of learning PCTs, we propose PCTs with
oblique split nodes. In contrast to standard PCTs, oblique PCTs use linear combinations
of features in tests to split the examples. We propose two methods for learning the oblique
splits in PCTs. The SVM variant first clusters the examples based on the outputs to get the
ideal split, then learns a linear support vector machine on the features that approximates
this split. The gradient variant uses a differentiable approximation of the heuristic used
by PCTs to efficiently optimize oblique splits with gradient-based methods. The proposed
methods improve the computational scaling of standard PCTs and provide additional computational
benefits on sparse data. These advantages are first confirmed with theoretical
analysis and later with empirical evaluation. The results of the experiments also show
that the predictive performance of oblique PCTs is on par with that of PCTs, often even
exceeding it. We also evaluate the approach in a semi-supervised setting and get similar
results: oblique PCTs are learned faster and often achieve better performance than PCTs
as well. Additionally, we present a method for estimating feature importances from oblique
PCTs and perform experiments that confirm they are meaningful.

Attachments

Cite this work