REPOSITORY > RESULTS

Doctoral dissertation

Probabilistic grammar-based equation discovery

Author(s): Jure Brence (Author), Sašo Džeroski (Supervisor), Ljupčo Todorovski (Co-Supervisor)

Thesis defense date: 21.05.2024

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13716

Views: 5 | Downloads: 8

Abstract

In this thesis, we introduce novel methods for equation discovery (ED), based on the use of
probabilistic grammars. ED and symbolic regression address the task of finding a symbolic
mathematical model that best describes observed data. Models can be as simple as an algebraic
equation or as complex as a system of differential equations. Traditionally, domain
experts derive equations based on theory and use regression methods to estimate their parameters.
ED methods seek to automate the identification of equation structure as well as
its parameters. The advantage of discovering closed-form equations over black-box models,
popular in machine learning, lies in their inherent interpretability and correspondence with
domain theory.
Our methods focus on the use of probabilistic context-free grammars (PCFGs) as a tool
for generating mathematical expressions, constraining the space of expressions and encoding
background knowledge. We demonstrate that PCFGs parametrize the parsimony principle
inherently and intuitively. Furthermore, in addition to the hard constraints imposed
by CFG, PCFGs allow us to impose soft constraints on the search space of mathematical
expressions. To aid analysis, we introduce a novel method for visualizing the search space
of expressions, useful for any ED approach. We introduce a Monte-Carlo algorithm that
enables the use of PCFGs in ED and perform extensive computational experiments using
an established benchmark database. The results demonstrate that our approach can be
used to discover equations, but performs worse than existing methods.
To improve the performance of ED, we introduce dimensional attribute grammars, an
extension of PCFGs, that generate only dimensionally consistent mathematical expressions.
Our computational experiments demonstrate the impact of dimensional consistency in ED,
resulting in performance, comparable to state-of-the-art approaches in the field.
We further extend the ideas of attribute grammars into a general-purpose framework for
encoding background knowledge. The framework relies on probabilistic attribute grammars
to overcome the limitations of PCFGs in expressing complex types of background knowledge.
We demonstrate the utility of the framework by designing and analyzing grammars
that encode three different types of background knowledge: dimensional consistency, systems
of differential equations for chemical kinetics, and systems of differential equations
describing electronic circuits.
Finally, we pave the way toward more intricate PCFG-based ED algorithms by developing
a novel Bayesian algorithm for sampling mathematical expressions from a PCFG.
The algorithm iteratively updates grammar probabilities to improve the performance of
ED and enable the estimation of the posterior distribution. An illustrative computational
experiment shows that the algorithm works according to our expectations and improves
the performance of ED by guiding the search towards more promising areas of the space
of mathematical expressions.

Attachments

Cite this work