Abstract
Facial expression recognition (FER) has garnered significant attention due to
advances in artificial intelligence, particularly in applications like driver monitoring,
healthcare, and human-computer interaction, which benefit from deep learning
techniques. The motivation of this research is to address the challenges of accurately
recognizing emotions despite variations in expressions across emotions and
similarities between different expressions. In this work, we propose an early fusion
approach that combines features from visible and infrared modalities using publicly
accessible VIRI and NVIE databases. Initially, we developed single-modality models
for visible and infrared datasets by incorporating an attention mechanism into the
ResNet-18 architecture. We then extended this to a multi-modal early fusion
approach using the same modified ResNet-18 with attention, achieving superior
accuracy through the combination of convolutional neural network (CNN) and
transfer learning (TL). Our multi-modal approach attained 84.44% accuracy on the
VIRI database and 85.20% on the natural visible and infrared facial expression
(NVIE) database, outperforming previous methods. These results demonstrate that
our single-modal and multi-modal approaches achieve state-of-the-art performance
in FER.