Interpretability and Trust in Large Language and Agentic Models: A Survey of Methods, Metrics, and Applications

##article.authors##

Large-scale language models (LLMs) and the agentic systems that

embed them are being deployed across finance, healthcare, law and

other high-stakes domains. Their emergence has intensified concerns

about interpretability—the ability to understand how the models produce

their outputs—and trust—the confidence that the models behave

reliably and ethically. The opaque internal representations of

deep neural models mean that decisions may be unpredictable or unfair,

undermining public confidence and limiting adoption. This paper

surveys the state of interpretability and trust for both standalone

LLMs and agentic AI systems, synthesising methodological advances,

evaluation metrics and real-world applications. We organise methods

into feature-attribution techniques such as LIME and SHAP, examplebased

and counterfactual explanations, process-level and mechanistic

interpretability, and system-level approaches tailored to agentic multiagent

systems. We then review evaluation frameworks that measure

explanation quality, fairness, robustness and other trust dimensions,

including recent benchmarks like TrustLLM and psychometric scales

for human–LLM trust. We discuss how interpretability interacts with

safety, robustness, privacy and ethics, and how adaptive monitoring

and balanced evaluation frameworks can promote trustworthy deployment.

Finally, we highlight open research challenges in ensuring

that increasingly autonomous agentic systems remain transparent, accountable

and aligned with human values.

Download data is not yet available.