Robust Yorùbá Named Entity Recognition through Simple Mixed Training

Christian Adeoye Adebambo

doi:10.31224/5640

##article.authors##

Christian Adeoye Adebambo African Leadership University, School of Wildlife Conservation https://orcid.org/0009-0006-1418-9096

DOI:

https://doi.org/10.31224/5640

Keywords:

Yorùbá, named entity recognition, diacritics, code-switching, robustness, low-resource languages

Abstract

Yorùbá named entity recognition (NER) is sensitive to missing tone marks and in-line English, both common in Nigerian social and news text. Using the Yorùbá split of MasakhaNER 2.0, we quantify these effects and present a minimal fix. A standard xlm-roberta-base fine tune scores F1= 0.832 on clean test, falls to 0.584 when diacritics are stripped, and remains stable under light code-switch (0.834) measured with seqeval. We then train on a 50–50 mix of the original training set and a de-diacritised copy, keeping BIO tags intact. This lifts the no-diacritics test F1 to 0.842 while keeping clean at 0.854 and code-switch at 0.857. Per-entity analysis shows the largest gains for date and per. Results align with prior work on Yorùbá diacritic restoration and with evidence that Yorùbá–English code-switch is frequent in Nigeria. We release an end-to-end notebook hosted on GitHub to support reuse. The method is simple, cheap and effective, and can serve as a baseline for responsible NLP on low-resource African languages.

Downloads

Download data is not yet available.

Robust Yorùbá Named Entity Recognition through Simple Mixed Training

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints