Interpretability and Trust in Large Language and Agentic Models: A Survey of Methods, Metrics, and Applications
DOI:
https://doi.org/10.31224/6090Abstract
Large-scale language models (LLMs) and the agentic systems that
embed them are being deployed across finance, healthcare, law and
other high-stakes domains. Their emergence has intensified concerns
about interpretability—the ability to understand how the models produce
their outputs—and trust—the confidence that the models behave
reliably and ethically. The opaque internal representations of
deep neural models mean that decisions may be unpredictable or unfair,
undermining public confidence and limiting adoption. This paper
surveys the state of interpretability and trust for both standalone
LLMs and agentic AI systems, synthesising methodological advances,
evaluation metrics and real-world applications. We organise methods
into feature-attribution techniques such as LIME and SHAP, examplebased
and counterfactual explanations, process-level and mechanistic
interpretability, and system-level approaches tailored to agentic multiagent
systems. We then review evaluation frameworks that measure
explanation quality, fairness, robustness and other trust dimensions,
including recent benchmarks like TrustLLM and psychometric scales
for human–LLM trust. We discuss how interpretability interacts with
safety, robustness, privacy and ethics, and how adaptive monitoring
and balanced evaluation frameworks can promote trustworthy deployment.
Finally, we highlight open research challenges in ensuring
that increasingly autonomous agentic systems remain transparent, accountable
and aligned with human values.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Jithesh Yemi Reddy

This work is licensed under a Creative Commons Attribution 4.0 International License.