Optimized byte pair encoding tokenizer for South African languages

Sicelo Sipho Simphiwe  Simelane

Recent advances in Artificial Intelligence (AI), such as Large Language Models (LLMs) and Generative Pre-Trained Transformers (GPTs), Conversational AI and Chatbots, and Language Translation (LT) have emerged from the field of Natural Language Processing (NLP). Typical NLP pipelines depend on tokenization as the first stage, in which the text is divided into manageable pieces. Although several practical tokenization algorithms exist, they are best suited for English. Their performance deteriorates for high variability in language and morphological complexity, especially when handling the multilingual dataset characteristics. This dissertation introduces a new algorithm that improves the popular Byte Pair Encoding (BPE) tokenizer. The proposed algorithm, Optimized BPE (OBPE), produces better performance for South African languages, including Sesotho, Setswana, Xhosa, Xitsonga, and Zulu. It is tailored to handle the characteristics of multilingual datasets and language complexity, especially regarding the morphological richness of South African Languages. The traditional BPE algorithm begins by initializing its base vocabulary with unique characters identified from the corpus. It proceeds by scanning through the corpus to find the most frequent pairs iteratively until the vocabulary size is reached to build the final vocabulary. This approach has been proven effective and applied in Transformer models such as GPT, GPT-2, RoBERTa, BART, and DeBERTa. However, the algorithm’s dependence on building its vocabulary utilizing only the unique characters and the frequently identified pairs limits the algorithm in capturing the characteristics and patterns in different languages. The algorithm proposed in this dissertation extends BPE. Firstly, language-specific tokens are introduced to allow the algorithm to learn language-specific context and differentiation when training the algorithm on multilingual corpora. Secondly, common words derived from the multilingual corpus are added to the initial/base vocabulary, which helps to reach convergence faster and possibly get more meaningful or accurate tokenization. Thirdly, a new common words parameter is introduced, thus allowing the user to specify the number of common words derived from the corpus to be used for the initial vocabulary of the algorithm. Therefore, unlike traditional BPE, the proposed OBPE initializes its base vocabulary with unique characters, language-specific tokens, and common words. The limitations posed by the traditional BPE also apply to WordPiece and Unigram as their vocabulary initialization is based only on unique characters, resulting in a need for more context awareness in different languages and rare token representation. The optimized version of BPE addresses these issues and brings a solid algorithm that learns iv different patterns in each language and addresses the Out of Vocabulary Issue (OOV). The experimental setup of this work included assembling an extensive multilingual data collection and comparing the OBPE algorithm against the mentioned traditional algorithms (BPE, WordPiece, and Unigram). Firstly, the results demonstrated, under the same experimental setup, that the OBPE has a much higher tokenization accuracy of 96% compared to BPE and WordPiece (87-88%), and even more notably, it has a higher accuracy rate than Unigram (25%). Secondly, it has been observed that the OBPE was more accurate in terms of Exact Match Ratio (91%), had better average token length difference (0.131), average edit distance (0.219), average suffix precision and recall (97-98%) than the baseline algorithms (BPE, Unigram, and WordPiece). Thirdly, the experiment further demonstrated that as the vocabulary size increases, the algorithm’s performance increases, especially with the OBPE. Finally, the OBPE algorithm’s performance in each language is demonstrated, and it is evident that the OBPE performs better, especially in Nguni Languages (Xhosa and Zulu), followed by BPE and WordPiece.

Optimized byte pair encoding tokenizer for South African languages

Abstract

Files and links (1)

Metrics

Details