Preprint / Version 1

Visual Question Answering for Bioavailable Iron

Food Image Analysis and Component Estimation using Small Vision-Language Models

##article.authors##

  • Chelsea Ramos The University of Texas at Austin

DOI:

https://doi.org/10.31224/6975

Keywords:

Visual Question Answering (VQA), Vision-Language Models (VLM), Parameter-Efficient Fine-Tuning (PEFT), Low-Rank Adaptation (LoRA), Bioavailable Iron Estimation, Predictive Modeling, Multimodal Learning, Foundation Model Adaptation, Nutritional Informatics, Quantitative Dietary Assessment

Abstract

Food image analysis with dish feature identification and nutrient estimation has become more prominent in research and in our daily lives. Accurate predictions are critical for improving our ability to monitor diet and nutrition, especially for alleviating iron deficiencies. Current systems only estimate total iron rather than bioavailable iron or its individual factors, such as ingredient portions, cooking method, and iron-inhibiting micronutrients like calcium. We address this gap by creating a new Visual Question Answering (VQA) dataset with a subset of MM-Food-100K that we supplemented with iron, calcium, and vitamin C measurements from the USDA FoodData Central API. We also finetune small Vision-Language Models (VLMs) from the SmolVLM family using efficient LoRA (Low-Rank Adaptation) techniques. Our final finetuned 2.2B SmolVLM-Instruct model achieved a total classification accuracy of 0.587 and a Mean Absolute Error (MAE) of 1.45 mg for iron, demonstrating its potential and establishing a baseline for using VLMs to predict key components from real-world images that are necessary for estimating and calculating bioavailable iron calculations.

Downloads

Download data is not yet available.

Downloads

Posted

2026-05-04