Preprint / Version 1

LLM Survey Analysis Using Random Forest

##article.authors##

  • Priyanka Singla Boise State University

DOI:

https://doi.org/10.31224/3956

Keywords:

Random Forest

Abstract

This project investigates the application of a Random Forest Classifier for analyzing metadata from survey papers on large language models (LLMs), a rapidly growing area within AI. The goal is to assist new researchers by providing insights into the trends and patterns in LLM survey publications. Through a structured workflow—comprising data loading, exploration, manipulation, and visualization—key attributes such as release dates, categories, and taxonomies were analyzed. Techniques like TF-IDF vectorization, one-hot encoding, and feature scaling were employed to construct a robust feature matrix. Hyperparameter tuning using grid search optimized the classifier’s performance. Although the model achieved perfect training accuracy, a lower test accuracy (0.39) indicated overfitting, likely caused by dataset imbalance. With a best cross-validation score of 0.26, future improvements will focus on addressing data imbalance, enhancing feature engineering, and exploring alternative models to boost performance. The project highlights trends in LLM research and suggests paths for enhancing model accuracy.

Downloads

Download data is not yet available.

Additional Files

Posted

2024-10-03