Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

Manish Shukla

doi:10.31224/6009

##article.authors##

Manish Shukla NA

DOI:

https://doi.org/10.31224/6009

Keywords:

Agentic AI, Generative AI, AI Evaluation and Benchmarking, Autonomous Intelligent Systems, Tool-Augmented Language Models, Multi-Agent Systems, Long-Horizon Planning, Memory and Reasoning Evaluation, Adaptive Monitoring, AI Safety and Robustness, Multimodal AI Evaluation

Abstract

The rapid emergence of generative and agentic artificial intelligence (AI) has outpaced traditional evaluation practices. While large language models excel on static language benchmarks, real-world deployment demands more than accuracy on curated tasks. Agentic systems use planning, tool invocation, memory and multi-agent collaboration to perform complex workflows. Enterprise adoption therefore hinges on holistic assessments that include cost, latency, reliability, safety and multi-agent coordination. This survey provides a comprehensive taxonomy of evaluation dimensions, reviews existing benchmarks for generative and agentic systems, identifies gaps between laboratory tests and production requirements, and proposes future directions for more realistic, multi-dimensional benchmarking.

Downloads

Download data is not yet available.

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints