Abstract
The advent of the Fourth Industrial Revolution highlighted the need for greater efficiency and productivity
across all systems and aspects of our lives, a demand machine learning techniques are poised to meet.
This dissertation explores the application of machine learning techniques to small datasets, specifically
within the domain of azo dye synthesis in chemistry. As traditional optimisation methods face challenges
when applied to large datasets, machine learning can be a viable alternative for extracting meaningful
insights from limited data. The research study set out to develop predictive models that can accurately
forecast outcomes in azo dye synthesis, a critical process in the chemical industry that is labour-intensive
and environmentally detrimental. The difficulty in addressing this problem stems from the limited datasets
available for training machine learning models to make such predictions. The study employs a design
science research methodology to guide the exploration of small dataset machine learning approaches,
selecting suitable techniques for azo dye synthesis. A small dataset of 119 entries is utilised in an attempt
to predict the colours synthesised. It investigates various machine learning models, including Support
Vector Machines, Random Forests, and Gradient Boosting Machines, assessing their performance through
metrics such as Mean Absolute Error and Root Mean Square Error. The findings demonstrate that while
small dataset machine learning holds promise, the quality and breadth of data are crucial for achieving
accurate predictions. This dissertation offers a valuable contribution to the field by providing a benchmark
for machine learning applications in chemistry and proposing a baseline approach for handling small
datasets. Future research is suggested to enhance model performance through alternative optimisation
techniques and expanded datasets. Overall, this research underscores the potential for machine learning to
transform chemical processes by improving efficiency and sustainability, while also highlighting the need
for comprehensive datasets to fully realise these benefits.