Preprint has been published in a journal as an article
DOI of the published article https://doi.org/10.1017/nlp.2024.11
Preprint / Version 2

Ben-Sarc: A Self-Annotated Corpus for Sarcasm Detection from Bengali Social Media Comments and Its Baseline Evaluation

##article.authors##

DOI:

https://doi.org/10.31224/osf.io/7yb4c

Keywords:

Bengali sarcasm, Bengali sarcasm detection, sarcasm, sarcasm detection

Abstract

Sarcasm detection research of the Bengali language so far can be considered to be narrow due to the unavailability of resources. In this paper, we introduce a large-scale self-annotated Bengali corpus for sarcasm detection research problem in the Bengali language named ’Ben-Sarc’ containing 25,636 comments, manually collected from different public Facebook pages and evaluated by external evaluators. Then we present a complete strategy to utilize different models of traditional machine learning, deep learning, and transfer learning to detect sarcasm from text using the Ben-Sarc corpus. Finally, we demonstrate a comparison between the performance of traditional machine learning, deep learning, and transfer learning models on our Ben-Sarc corpus. Transfer learning using Indic-Transformers Bengali BERT as a pre-trained source model has achieved the highest accuracy of 75.05%. The second highest accuracy is obtained by the LSTM model with 72.48% and Multinomial Naive Bayes is acquired the third highest with 72.36% accuracy for deep learning and machine learning, respectively. The Ben-Sarc corpus is made publicly available in the hope of advancing the Bengali Natural Language Processing community.

Downloads

Download data is not yet available.

Downloads

Posted

2022-01-17 — Updated on 2024-11-16

Versions

Version justification

After acceptance