Views: 4 | Downloads: 8
Human knowledge about food and nutrition has evolved drastically with time. With food
and nutrition-related data being mass produced and easily accessible, the next step is to
use Artificial Intelligence (AI) to translate data into knowledge. The majority of AI research
is model-driven, and classical Machine Learning (ML) pipelines concentrate on the
model-centric approach, prioritizing training the best model for a specific task, with the
main focus on improving model parameters, overlooking the importance of data.
We propose a novel ML pipeline that fused data and domain-driven knowledge for a predictive
task from the Food and Nutrition domain – fast prediction of nutrient values from
unstructured recipe text. Our proposed pipeline consists of three parts: representation
learning (RL), unsupervised ML, and supervised ML. In the RL part, word and paragraph
embeddings are learned for text short descriptions of foods (recipe titles), in the unsupervised
ML part the recipes are separated in clusters based on a domain-specific coding
(FoodEx2 classification) from external domain resource, and in the supervised ML part,
the two parts are combined – separate predictive models are trained for each cluster for separate
nutrients using the learned embeddings as input features. The pipeline is evaluated
with a criteria defined using domain knowledge (nutrient tolerance levels) and compared
to baselines also calculated using the same criteria.
As the evaluation results showed that including the domain knowledge in the unsupervised
ML part improved the results compared to the baseline, we propose an alteration of the
ML pipeline. We include two different external sources of domain knowledge for clustering
in the unsupervised ML part, to explore the domain bias for the same prediction task.
To further improve the ML pipeline, we include domain knowledge in the RL part of the
pipeline. Instead of obtaining recipe title embeddings, we introduce a domain heuristic
for merging embeddings of the ingredients of the recipe. This proved to be a successful
way to train excellent performing predictive models for predicting nutrient values, as the
accuracies obtained were significantly higher than the baseline.
As the domain-specific embeddings showed to be high performant, through the process
of data normalization using dictionary and rule-based Named Entity Recognition and
data mapping to a Food Composition Database from six heterogeneous multilingual recipe
datasets, we composed two predefined corpora of embeddings – ingredient and recipe embeddings.
Training embeddings tailored for a specific task is a very time-consuming process,
therefore these corpora of predefined embeddings can be used for research purposes as well
as transferred to other tasks for application purposes.
To explore the major impact data has on model-performance, we focused on generalization
of predictive models, by defining a generalizability index that indicates the trust of
transferring a predictive model learned on one dataset to another. Going a step further to
show the importance of data in predictive modeling, we show different ways of selecting a
representative training dataset, and the results show how different selections of the training
dataset produce different outcomes. The training data should be representative of the data
expected in deployment, covering all variations that deployment data will present.