Contextual Financial Insight Generation (CFIG) using Large Vision-Language Models: A Case Study on Corporate Financial Reports
DOI:
https://doi.org/10.31224/5058Abstract
The automated extraction and generation of actionable financial insights from complex, multimodal corporate reports remain a significant challenge due to the intricate interplay of tabular data, textual descriptions, and visual charts. Existing methods often struggle with true multimodal integration and necessitate extensive, resource-intensive fine-tuning. To address these limitations, we propose the Contextual Financial Insight Generation (CFIG) framework, a novel approach that leverages the inherent multimodal understanding capabilities of Large Vision-Language Models (LVLMs) through a sophisticated prompt engineering strategy, minimizing the need for large-scale fine-tuning. CFIG meticulously integrates disparate data types from financial reports by establishing contextual relationships between raw PDF images, structured tabular data, and key textual passages, thereby enabling LVLMs to holistically interpret financial narratives. Evaluated on the comprehensive, fabricated Corporate Financial Report Analysis Dataset (CFRAD), our CFIG framework consistently outperforms traditional text-only baselines (e.g., fine-tuned FinBERT) and LVLMs employing basic prompting strategies. Specifically, CFIG with LLaVA-1.5 achieved an average score of 0.47, significantly surpassing a LoRA-tuned LLaVA-1.5 (0.36) and fine-tuned FinBERT (0.33). Even the lightweight Fuyu-8B model, when integrated with CFIG, yielded an average score of 0.44, demonstrating broad applicability. Furthermore, CFIG combined with GPT-4V achieved the highest average score of 0.51, validating its potential with state-of-the-art models. The substantial improvement in our novel Insight Coherence (ICo) metric underscores CFIG's ability to generate logically sound, factually accurate, and contextually relevant financial insights. Our work demonstrates that carefully designed prompt engineering can unlock advanced multimodal reasoning in LVLMs, providing an efficient and scalable solution for high-quality financial intelligence.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Min-ho Kang

This work is licensed under a Creative Commons Attribution 4.0 International License.