Abstract
Language plays a crucial role in Sotho-Tswana musical videos, as it helps determine the sentiment and genre. The
Sotho-Tswana languages, spoken in parts of Southern Africa, are used to compose many indigenous songs and music.
However, speakers of one of the Sotho-Tswana languages may not understand other Sotho-Tswana languages. Given the
widespread availability of these musical videos on social media platforms, there is a need for appropriate recommendations
for users based on the language used in the videos. While traditional language identification in music has focused on
audio, music information for identifying the singing language can also be embedded in other modalities, such as visual and
text. This study employs a multimodal approach to identify the singing language in Sotho-Tswana musical videos. The
multimodal approach focuses on three modalities, visual, audio, and textual/lyrics. A multimodal dataset of Sotho-Tswana
musical videos is used to train deep learning and language models, for each of the modalities. After the independent training,
for each of the modalities, a decision-level (late) fusion method is used to combine the results of the training from the
three modalities. The results demonstrate that a multimodal approach outperforms single-modality methods, such as those
relying solely on lyrics or textual information.