Preprint / Version 1

Robust Yorùbá Named Entity Recognition through Simple Mixed Training

##article.authors##

DOI:

https://doi.org/10.31224/5640

Keywords:

Yorùbá, named entity recognition, diacritics, code-switching, robustness, low-resource languages

Abstract

Yorùbá named entity recognition (NER) is sensitive to missing tone marks and in-line English, both common in Nigerian social and news text. Using the Yorùbá split of MasakhaNER 2.0, we quantify these effects and present a minimal fix. A standard xlm-roberta-base fine tune scores F1= 0.832 on clean test, falls to 0.584 when diacritics are stripped, and remains stable under light code-switch (0.834) measured with seqeval. We then train on a 50–50 mix of the original training set and a de-diacritised copy, keeping BIO tags intact. This lifts the no-diacritics test F1 to 0.842 while keeping clean at 0.854 and code-switch at 0.857. Per-entity analysis shows the largest gains for date and per. Results align with prior work on Yorùbá diacritic restoration and with evidence that Yorùbá–English code-switch is frequent in Nigeria. We release an end-to-end notebook hosted on GitHub to support reuse. The method is simple, cheap and effective, and can serve as a baseline for responsible NLP on low-resource African languages.

Downloads

Download data is not yet available.

Downloads

Posted

2025-10-22