Preprint / Version 2

Taxonomy Classification using Machine Learning Based Model

##article.authors##

DOI:

https://doi.org/10.31224/3967

Keywords:

Data Exploration, Data Visualization, Classifiers, LLM Survey, TF-IDF, Confusion Matrix

Abstract

Large language model (LLM) trends and taxonomy have changed rapidly in the last few years, primarily due to the advancement of data sciences like natural language processing (NLP), deep learning, and the ever-growing size of computational resources. These models aim to enhance logical and mathematical reasoning beyond pattern recognition. This work aims to explore trends in survey papers over time and analyze their associated taxonomies through data exploration, visualization, and machine learning modeling. Initially, the dataset of survey papers is preprocessed by grouping the number of surveys by year and month, revealing publication trends across time. A detailed analysis of taxonomy distributions is performed to identify the prevalence of various survey categories. Using the TF-IDF method, the titles and summaries of papers are vectorized, transforming textual information into numerical features. A one-hot encoding approach is applied to the survey categories to enable better feature representation for machine learning models. The results show that the Random Forest Classifier and Support Vector Machine achieved the highest accuracies in classifying survey papers based on their taxonomy. This research not only highlights trends in the publication of surveys but also offers an automated approach for classifying them, potentially aiding future research in organizing and categorizing survey literature efficiently.

Downloads

Download data is not yet available.

Downloads

Posted

2024-09-30 — Updated on 2024-10-22

Versions

Version justification

I have updated and revised the article with new content, and now I need to submit and publish this latest version.