Online Twitter Bot Detection: A Comparison Study of Vectorization and Classification Methods on Balanced and Imbalanced Data
DOI:
https://doi.org/10.31224/3139Abstract
In this study, we aim to classify whether a Tweet comes from a human or a bot. We are particularly interested in comparing the performances of different word embedding methods and classification models under both imbalance and balanced data through the f1-score and confusion matrix. Text data preprocessing methods tokenization, stop words & punctuation marks removal, and stemming are performed. Text embedding models including Bag-of-words (BoW), TF-IDF, Doc2Vec, BERT, and fastText are used for feature extraction. The classification models including Support Vector Machine, Logistic Regression, and Naive Bayes are also used. The results suggest the power of Transformer based vectorization methods including Doc2Vec, BERT, and fastText when handling imbalanced data.
Downloads
Downloads
Posted
License
Copyright (c) 2023 Jiahe Ling, Yicong Chen

This work is licensed under a Creative Commons Attribution 4.0 International License.