Ben-Sarc: A Corpus for Sarcasm Detection from Bengali Social Media Comments and Its Baseline Evaluation

Sarcasm detection research of the Bengali language so far can be considered to be narrow due to the unavailability of resources. In this paper, we introduce a large-scale self annotated Bengali corpus for sarcasm detection research problem in the Bengali language named ’Ben-Sarc’ containing 25,636 comments, manually collected from different public Facebook pages and evaluated by external evaluators. Then we present a complete strategy to utilize different models of traditional machine learning, deep learning, and transfer learning to detect sarcasm from text using the Ben-Sarc corpus. Finally, we demonstrate a comparison between the performance of traditional machine learning, deep learning, and transfer learning models on our Ben-Sarc corpus. Transfer learning using Indic-Transformers Bengali BERT as a pre-trained source model has achieved the highest accuracy of 75.05%. The second highest accuracy is obtained by the LSTM model with 72.48% and Multinomial Naive Bayes is acquired the third highest with 72.36% accuracy for deep learning and machine learning, respectively. The Ben-Sarc corpus is made publicly available in the hope of advancing the Bengali Natural Language Processing community. ∗Corresponding author Email addresses: sanzanalora@yahoo.com (Sanzana Karim Lora), shahariar_shibli.cse@aust.edu (G. M. Shahariar), tamanna.naz98@gmail.com (Tamanna Nazmin), nafirahman27@gmail.com (Noor Nafeur Rahman), rafsanrahman35549@gmail.com (Rafsan Rahman ), miyadbhuiyan@gmail.com (Miyad Bhuiyan), faisal.cse@aust.edu (Faisal Muhammad Shah) Preprint submitted to Computer Speech and Language January 4, 2022


INTRODUCTION
An ironic, stinging, sour, cutting statement or comment that indicates the reverse of what someone truly intends to express is Sarcasm [1]. The use of sarcastic language is a resentment concealed as humor and intended to provoke, annoy, or convey contempt. As the intention of sarcasm is often vague and mis- 5 leading, people cannot discriminate between a true story and satire or irony [1].
Facebook, YouTube, and Twitter are influential social media platforms for sharing people's judgments, thoughts, opinions, sentiments nowadays [2]. The aforementioned large amount of available data offers the extent to research in Natural Language Processing (NLP). 10 Sarcasm detection in low resource language is a very narrow research area in Natural Language Processing. Sarcasm detection is a subset of sentiment analysis problems where the focus is on recognizing sarcasm rather than identifying a sentiment across the board [3]. Sarcasm detection researches are available for 15 high resource languages such as English. But, despite being the world's seventh most spoken language with 240 million native speakers [4], research on sarcasm detection in the Bengali language is unexplored and overlooked. Due to the limited resources and the scarcity of large-scale sarcasm data, identifying sarcasm from Bengali text is currently a difficult challenge for the researchers of NLP [5].
perspectives, judgments, and opinions on the content of a post. Any automatic detection system that uses machine learning is large-scale dataset dependent as it requires rigorous training and testing. As far as we have noticed, there is no available Bengali text corpus for sarcasm detection. We have constructed a corpus named 'Ben-Sarc' that contains Facebook comments written in Bengali. 30 Furthermore, we have classified the Bengali texts as Sarcastic and Non-sarcastic and proposed a sarcasm detection model using machine learning.
Our main contributions in this paper are summarized as follows: • At first, we have constructed a large-scale self annotated Bengali corpus for sarcasm detection. The corpus can be found at https://shorturl. 35 at/oFJRZ.
• Next, we have evaluated our constructed corpus by external human evaluators who are experts in this field.
• Then, we have conducted a comprehensive experiment on this corpus to detect sarcasm from Bengali texts with the help of traditional Machine 40 Learning, Deep Learning, and Transfer Learning approaches to set a baseline for future researchers.
In the next section, we briefly discuss related works on high and low resource language sarcasm detection. Section 3 shows the dataset creation along with the annotation process. Moreover, Section 4 explains the proposed methodology. 45 Section 5 contains the experimental results and their analysis, while Section 6 contains the conclusion and future work.

RELATED WORKS
The increasing engagement of social media users influences the quantitative and qualitative analysis of available data. Though most of the research is on 50 the English language, sarcasm detection for low resource languages such as Indonesian [6], Hindi [7] [8] [9], Czech [10], and Japanese [11] are available. We discuss some of the related approaches in the following literature review analysis.

55
[9] experimented with traditional machine learning algorithms such as SVM, KNN, and Random forest on 9104 tweets on Twitter. [12] worked on both sarcasm and irony detection separately. SVM, Naive Bayes (NB), Decision Tree, Random Forest (RF) were applied on the irony detection dataset whereas SVM and Random Forest (RF) algorithms were on the sarcasm detection dataset. 60 The segregated experiments gained 64% accuracy on irony and 76% accuracy on sarcasm detection.
There exist a few models that use contextual information regarding the tweets on Twitter to detect sarcasm. [13] focused on the context of authors 65 and audiences on Twitter posts to figure out sarcastic content with 85.1% accuracy. Binary logistic regression was applied to train the model upon 19534 tweets. [14] also aimed at the context for identifying sarcasm accurately. They collected 1500 tweets and derived 6774 history-based, 453 conversation-based, 2618 topic-based contextual tweets. Sequential SVM classifier exhibited a de-70 cent accuracy of 69.13%. [15] extracted 5000 tweets that include texts, labels, and contexts and analyzed the dataset through linear SVC, Logistic Regression (LR), Gaussian Naive Bayes (GNB), and Random Forest (RF) classifiers. They utilized BERT and GloVe embeddings to the algorithms. Logistic Regression with GloVe embeddings gained 69% accuracy on the dataset that involves con-75 text.
Hashtags exhibit a meaningful role in the content on Twitter. [9] extracted 9104 tweets containing hashtags such as "#sarcasm" and "#not" in Hindi and English. They implemented three SVM, KNN, and Random Forest (RF) clas-80 sifiers. Random Forest (RF) showed an 81% accuracy on sarcasm detection.
[1] considered the impact of positive and negative situations on different sentiments to analyze sarcasm. They used a supervised SVM classifier and an N-gram classifier. To increase the accuracy, they optimized the RBF kernel, 85 cost, and gamma parameters over 35000 tweets.
[16] inflicted four models: bidirectional LSTM, LSTM and CNN, SVM, and Multi-layer perception on 9400 data collected from Reddit and Twitter. Each model used 10-fold cross-validation. The ensemble method achieved the best 90 F1 score. Very few research works executed deep learning models alongside the transformers models to improve the accuracy of the prediction of sarcasm detection models. There is a limited number of contributions in the area of Satire, irony, or sarcasm detection. [25] detected satire in Bengali documents. They created their 135 own Word2Vec model and achieved an accuracy of 96.4% by using the CNN model but the dataset had insufficient data. [26] identified sarcasm from 41350 Facebook posts considering public reactions and interactive comments and images. They utilized machine learning algorithms and a CNN-based model to detect sarcasm from images. Though the dataset is adequately large, the anno-140 tation process should have received special attention.

Bengali
As far as we have seen, there is no comprehensive study that utilizes machine

150
As far as we have seen, there is no available labeled dataset for sarcasm detection in Bengali. We felt the need to create our sarcasm detection dataset for the Bengali language. We defined our dataset as the Bengali Sarcasm dataset (Ben-Sarc). The duration of dataset construction is approximately three months. In the following subsections, we discuss the features of our Ben-Sarc dataset in 155 detail.

Content Source
As Facebook is one of the major sources of textual data [27], we have targeted public Facebook pages to construct the Ben-Sarc dataset. We have collected  Bengali. We have only taken the Bengali comments. All the comments have been scrapped manually by the authors of this paper.

170
Text preprocessing is generally a vital phase of natural language processing (NLP) problems [28].

Annotation Process
Each text in the Ben-Sarc dataset has been annotated manually by us using 180 '0' and '1' as we intend to work on a binary classification problem -sarcasm detection. '0' means non-sarcastic comments and '1' represents sarcastic comments. Each text in the Ben-Sarc dataset has been annotated by five annotators.
The final choice on the polarity of a single text has been made using the majority voting method from five annotations. Facebook comments are frequently 185 filled with harsh and filthy phrases, slang, and personal attacks [2] [24] [22]. As a result, we made sure that all annotators are of adult age and have domain knowledge.

Human Evaluation of Ben-Sarc Dataset
To maintain the quality of a labeled dataset, evaluation is a necessary step. 190 We have tried to make sure the data in the Ben-Sarc dataset is not labeled vaguely keeping in mind that the researchers can use it for further applications without hesitation. The assessment process has been carefully accomplished in the Ben-Sarc dataset by two external human evaluators experts in this field. Each evaluator is an adult, native Bengali speaker, and proficient in Bengali. 195 Each evaluator has been provided the task of assessing the quality of the dataset by replying 'Yes' or 'No' to the given questions stated below:

Q4. If Q1 is 'Yes', is there any information in the text that causes confusion to decide whether the text is sarcastic or not?
The motivation of designing the questions for human evaluation of the Ben-

205
Sarc dataset is from [29]. The recent advancement in the quality estimation of neural language generation (NLG) models has inspired the creation of these characteristics. [30]  where others may take it as a joke that leads to sarcasm.
The inter-annotator agreement is measured using Cohen's kappa coefficient [31] in table 2. Cohen's kappa measures annotator agreement and determines how well one annotator agrees with another. To evaluate the conventional inter-220 annotator agreement, a pairwise kappa coefficient is computed using equation where P o represents relative observed agreement and P e denotes the hypothesized probability of chance agreement. The quality assessment of Ben-Sarc is done on 5000 random samples of Ben-Sarc data. In most cases, the evaluators 225 agree that the text seems ironic without any emoticons. Besides, a high percentage for Q2 indicates that dialect, manipulation of the traditional poems, songs, and spelling mistakes also express sarcasm from the text whereas a low percentage for Q3 determines the opposite context that is pretty normal. However, the Q4 raises an ambiguity to decide whether the text is sarcastic or not. In our 230 situation, Q3 and Q4 should be in a very low percentage but the percentage of Q4 is comparatively higher than Q3 according to the inter-annotator agreement.

Dataset Description
A detailed description of our Ben-Sarc dataset has been presented in this sec-   Table 3 represents a short overview of our labeled dataset construction. The maximum length of a text in the Ben-Sarc dataset is 395 in words and the

PROPOSED METHODOLOGY
In this section, we present our proposed methodology for sarcasm detection in brief. Figure 4 represents our proposed approach. We have distributed our proposed approach into five phases. The first phase comprises dataset construction. The second phase involves dataset preprocessing by utilizing a few natural 250 language preprocessing techniques like punctuation removal and tokenization.
The third phase incorporates the feature selection process. This process includes TF-IDF (Term-Frequency, Inverse Document Frequency) and n-grams for traditional machine learning models, word embeddings for deep learning models, and pre-trained transformer-based models for transfer learning. The fourth phase of 255 our proposed method is the training phase. In this phase, we have employed traditional machine learning models, deep learning models, and transfer learning to classify text as sarcastic or non-sarcastic. We have examined the performance of each classifier and presented the best-performed classifier in the last phase.
The details of all the phases are discussed in the following subsections.

Phase I -Dataset Construction
We have collected 25636 Facebook comments written in Bengali. The overall dataset construction process is described in section 3.

Phase II -Preprocessing
A few pre-processing steps have been executed before model training which

Phase III -Feature Selection
Feature selection is the third phase of our proposed model. We have used three feature extraction approaches: n-grams, TF-IDF, and word embeddings.
For traditional machine learning classifiers, we have used TF-IDF and n-grams methods. TF-IDF is the most extensively utilized traditional feature extractor 275 approach in classification applications [32]. It is a mathematical statistic that reveals to us how essential a term is to a document in a collection. The increase in a word's TF-IDF value is directly proportional to the number of times that term appears in the document but is offset by the frequency of the term in the corpus, which helps to balance out terms that come more commonly in general.

Phase IV -Training
To classify whether a text is sarcastic or not, we have investigated traditional classifiers, deep learning classifiers, and transfer learning techniques. A comprehensive description of all the models is manifested in the following subsections.  : 330 where σ indicates sigmoid function and indicates element-wise multipli- layers. The convolutional layers include weights that must be taught, whereas the pooling layers change the activation using a fixed function.
-Convolutional Layer: A convolutional layer is made up of a number of kernels whose parameters must be learnt. It is a local feature 345 extractor layer with well-trained kernels for weight modification utilizing the back-propagation approach [37]. The kernels' height and weight are less than those of the input volume. Every filter is convolved with the input volume to generate a neuron activation map.
The convolutional layer's output volume is calculated by stacking 350 the activation maps of all filters along the depth dimension. Convolution operation output is calculated by convolving an input (I) with a number of filters as follows.
x k = I * W k + b k ; k = 1, 2, 3, ..., F where F is the number of filters, x k is the output corresponding to the kth convolution filter, W k is the weights of the kth filter, and b k 355 is the kth bias. • Pre-trained Word Embeddings: Pre-trained Word Embeddings are embeddings that are learnt in one task and then applied to solve another 375 related problem. In this paper, we have used the following pre-trained word embeddings available in Bengali.
-GloVe [39] creates the feature vector based on global and local word counts, word-word co-occurrence, and local context with the center word. GloVe's semantic and syntactic features can be extracted more 380 effectively. However, owing to matrix factorization, it takes a long time. In our task, we have used Bengali-GloVe 3 .
-Word2Vec [40] is a prediction-based embedding approach that generates an embedding vector from the center word to the context word or vice versa. In this paper, we have used Bengali-Word2Vec 3 .

Transfer Learning
Transfer learning is a machine learning procedure in which the starting point of a new task is an already produced model for similar tasks [44]. Transfer learning approaches have been effectively used for speech recognition, document categorization, and sentiment analysis in natural language processing [45]. Figure   450 5 represents an illustration of the transfer learning approach. In  • Indic-Transformers Bengali BERT [49], a BERT language model that has been pre-trained on about 3 GB of monolingual training corpus, majorly taken from OSCAR 8 . It has achieved state-of-the-art performance on the Bengali language for the text classification tasks.
The weights of the appended layers are updated during model training.

515
Different optimizers and learning rates are experimented with to get the optimized hyperparameter which is explained in section 5.4.3.

Phase V -Evaluation
In the last phase, we have measured the performance of all models of phase IV. Then, the achieved results are compared and the best-performed mode is 520 reported. The details of measuring the performance of the models are discussed briefly in the experiment section.

Experimental Setup
Python Keras framework with Tensorflow is used as a background to imple-525 ment all deep learning models and Pytorch library is used for transfer learning models for training, tuning, and testing. Experimental evaluation was conducted on a machine with an Intel Core i5 processor with 2.71GHz clock speed and 4GB RAM. Tensorflow based experiments can utilize GPU instructions.
Google Colaboratory has been used for developing all the models described in 530 this paper in later sections as we have used Python language.

Experiments
Our experiments are categorized into three parts.  Then, the performance of this LSTM+CNN model has decreased more after  Here, for all cases, Negative Log-Likelihood (NLL) [50] Loss has been used as

Result Analysis
The highest accuracy from each experiment has been shown briefly in

Hyperparameter Tuning
Hyperparameter tuning is a necessary stage in each experiment to boost performance. The hyperparameter tuning has been carried out on all of our experiments I, II, and III. A thorough explanation of all of the models is provided 620 in the following subsections.

For Experiment I
For the experiment I, we have applied 5-Fold and 10-Fold cross-validation on seven traditional classifiers listed in The performance of other classifiers cannot surpass these results.

CONCLUSION AND FUTURE WORK
In this paper, we have presented a benchmark dataset for Bengali sarcastic comments on Facebook to make an influence in one of the low resources