Preprint / Version 1

Online Twitter Bot Detection: A Comparison Study of Vectorization and Classification Methods on Balanced and Imbalanced Data

##article.authors##

  • Jiahe Ling University of Wisconsin-Madison
  • Yicong Chen University of Wisconsin-Madison

DOI:

https://doi.org/10.31224/3139

Abstract

In this study, we aim to classify whether a Tweet comes from a human or a bot. We are particularly interested in comparing the performances of different word embedding methods and classification models under both imbalance and balanced data through the f1-score and confusion matrix. Text data preprocessing methods tokenization, stop words & punctuation marks removal, and stemming are performed. Text embedding models including Bag-of-words (BoW), TF-IDF, Doc2Vec, BERT, and fastText are used for feature extraction. The classification models including Support Vector Machine, Logistic Regression, and Naive Bayes are also used. The results suggest the power of Transformer based vectorization methods including Doc2Vec, BERT, and fastText when handling imbalanced data.

Downloads

Download data is not yet available.

Downloads

Posted

2023-07-28