Robust Yorùbá Named Entity Recognition through Simple Mixed Training
DOI:
https://doi.org/10.31224/5640Keywords:
Yorùbá, named entity recognition, diacritics, code-switching, robustness, low-resource languagesAbstract
Yorùbá named entity recognition (NER) is sensitive to missing tone marks and in-line English, both common in Nigerian social and news text. Using the Yorùbá split of MasakhaNER 2.0, we quantify these effects and present a minimal fix. A standard xlm-roberta-base fine tune scores F1= 0.832 on clean test, falls to 0.584 when diacritics are stripped, and remains stable under light code-switch (0.834) measured with seqeval. We then train on a 50–50 mix of the original training set and a de-diacritised copy, keeping BIO tags intact. This lifts the no-diacritics test F1 to 0.842 while keeping clean at 0.854 and code-switch at 0.857. Per-entity analysis shows the largest gains for date and per. Results align with prior work on Yorùbá diacritic restoration and with evidence that Yorùbá–English code-switch is frequent in Nigeria. We release an end-to-end notebook hosted on GitHub to support reuse. The method is simple, cheap and effective, and can serve as a baseline for responsible NLP on low-resource African languages.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Christian Adeoye Adebambo

This work is licensed under a Creative Commons Attribution 4.0 International License.