Preprint / Version 1

Automated Classification and Trend Analysis of Large Language Model Survey Papers Using Machine Learning and Natural Language Processing Techniques

##article.authors##

Meherunnesa Tania Boise State University

DOI:

Abstract

This study investigates the application of machine learning (ML) and natural language processing (NLP) techniques to classify academic survey papers into predefined taxonomy categories. The dataset, consisting of paper titles, summaries, release dates, taxonomy labels, and categories, was analyzed to uncover trends and patterns in the publication of research papers. Exploratory data analysis (EDA) revealed important insights through visualizations, such as publication trends over time, the distribution of taxonomy categories, and the most common terms used in paper summaries. Key NLP techniques, including Term Frequency-Inverse Document Frequency (TF-IDF), were employed to transform the textual data into numerical features, while one-hot encoding was applied to the categorical data. A Random Forest Classifier was trained on the extracted feature matrix to predict the taxonomy category of each paper. The model achieved promising accuracy, effectively capturing patterns in the dataset. The study also identified areas for future improvement, including addressing class imbalance and exploring more sophisticated models. These findings demonstrate the potential of ML and NLP for automating the classification of academic papers, providing a scalable solution for managing large collections of research literature while offering insights into publication dynamics and trends.

Downloads

Download data is not yet available.

Additional Files

Posted

2024-10-02

License

This work is licensed under a Creative Commons Attribution 4.0 International License.