Abstract
Recent advances in Artificial Intelligence (AI), such as Large Language Models (LLMs)
and Generative Pre-Trained Transformers (GPTs), Conversational AI and Chatbots, and
Language Translation (LT) have emerged from the field of Natural Language Processing
(NLP). Typical NLP pipelines depend on tokenization as the first stage, in which
the text is divided into manageable pieces. Although several practical tokenization algorithms
exist, they are best suited for English. Their performance deteriorates for high
variability in language and morphological complexity, especially when handling the multilingual
dataset characteristics.
This dissertation introduces a new algorithm that improves the popular Byte Pair
Encoding (BPE) tokenizer. The proposed algorithm, Optimized BPE (OBPE), produces
better performance for South African languages, including Sesotho, Setswana, Xhosa,
Xitsonga, and Zulu. It is tailored to handle the characteristics of multilingual datasets
and language complexity, especially regarding the morphological richness of South African
Languages. The traditional BPE algorithm begins by initializing its base vocabulary with
unique characters identified from the corpus. It proceeds by scanning through the corpus
to find the most frequent pairs iteratively until the vocabulary size is reached to build
the final vocabulary. This approach has been proven effective and applied in Transformer
models such as GPT, GPT-2, RoBERTa, BART, and DeBERTa. However, the algorithm’s
dependence on building its vocabulary utilizing only the unique characters and
the frequently identified pairs limits the algorithm in capturing the characteristics and
patterns in different languages.
The algorithm proposed in this dissertation extends BPE. Firstly, language-specific
tokens are introduced to allow the algorithm to learn language-specific context and differentiation
when training the algorithm on multilingual corpora. Secondly, common words
derived from the multilingual corpus are added to the initial/base vocabulary, which helps
to reach convergence faster and possibly get more meaningful or accurate tokenization.
Thirdly, a new common words parameter is introduced, thus allowing the user to specify
the number of common words derived from the corpus to be used for the initial vocabulary
of the algorithm. Therefore, unlike traditional BPE, the proposed OBPE initializes
its base vocabulary with unique characters, language-specific tokens, and common words.
The limitations posed by the traditional BPE also apply to WordPiece and Unigram as
their vocabulary initialization is based only on unique characters, resulting in a need
for more context awareness in different languages and rare token representation. The
optimized version of BPE addresses these issues and brings a solid algorithm that learns
iv
different patterns in each language and addresses the Out of Vocabulary Issue (OOV).
The experimental setup of this work included assembling an extensive multilingual
data collection and comparing the OBPE algorithm against the mentioned traditional
algorithms (BPE, WordPiece, and Unigram). Firstly, the results demonstrated, under
the same experimental setup, that the OBPE has a much higher tokenization accuracy of
96% compared to BPE and WordPiece (87-88%), and even more notably, it has a higher
accuracy rate than Unigram (25%). Secondly, it has been observed that the OBPE was
more accurate in terms of Exact Match Ratio (91%), had better average token length difference
(0.131), average edit distance (0.219), average suffix precision and recall (97-98%)
than the baseline algorithms (BPE, Unigram, and WordPiece). Thirdly, the experiment
further demonstrated that as the vocabulary size increases, the algorithm’s performance
increases, especially with the OBPE. Finally, the OBPE algorithm’s performance in each
language is demonstrated, and it is evident that the OBPE performs better, especially in
Nguni Languages (Xhosa and Zulu), followed by BPE and WordPiece.