🔥【活动通知】:科研通第二届『应助活动周』重磅启航,3月24-30日求助秒级响应🚀,千元现金等你拿。这个春天,让互助之光璀璨绽放!查看详情
亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

可解释性 计算机科学 水准点(测量) 领域(数学分析) 人工智能 机器学习 一般化 自然语言处理 地理 大地测量学 数学 数学分析
作者
Zheng Ma,Mianzhi Pan,Wenhan Wu,Kanzhi Cheng,Jianbing Zhang,Shujian Huang,Jiajun Chen
标识
DOI:10.1145/3581783.3611994
摘要

Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
轻松的冰巧完成签到 ,获得积分10
1秒前
李爱国应助lwj采纳,获得10
47秒前
Artin发布了新的文献求助30
47秒前
48秒前
DDL发布了新的文献求助10
52秒前
1分钟前
Artin完成签到,获得积分10
1分钟前
lwj发布了新的文献求助10
1分钟前
钟D摆完成签到 ,获得积分10
2分钟前
3分钟前
桃知予发布了新的文献求助10
3分钟前
玛琳卡迪马完成签到,获得积分10
3分钟前
Akim应助科研通管家采纳,获得10
3分钟前
科研通AI2S应助科研通管家采纳,获得10
3分钟前
科研通AI2S应助桃知予采纳,获得10
3分钟前
伊笙完成签到 ,获得积分10
4分钟前
Cui完成签到 ,获得积分10
4分钟前
4分钟前
5分钟前
脑洞疼应助YQQ采纳,获得10
5分钟前
5分钟前
YQQ发布了新的文献求助10
5分钟前
清净126完成签到,获得积分10
5分钟前
ljw完成签到,获得积分10
5分钟前
5分钟前
ljw发布了新的文献求助10
5分钟前
5分钟前
5分钟前
yo一天完成签到,获得积分10
6分钟前
6分钟前
7分钟前
7分钟前
7分钟前
坦率完成签到,获得积分10
7分钟前
英姑应助科研通管家采纳,获得10
7分钟前
8分钟前
8分钟前
老宇126完成签到,获得积分10
8分钟前
lixuebin完成签到 ,获得积分10
8分钟前
在水一方应助学术混子采纳,获得10
9分钟前
高分求助中
Production Logging: Theoretical and Interpretive Elements 2700
Conference Record, IAS Annual Meeting 1977 1150
Structural Load Modelling and Combination for Performance and Safety Evaluation 1000
Neuromuscular and Electrodiagnostic Medicine Board Review 800
Teaching language in context (3rd edition) by Derewianka, Beverly; Jones, Pauline 610
EEG in clinical practice 2nd edition 1994 600
Barth, Derrida and the Language of Theology 500
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3603960
求助须知:如何正确求助?哪些是违规求助? 3172105
关于积分的说明 9573118
捐赠科研通 2878148
什么是DOI,文献DOI怎么找? 1580847
邀请新用户注册赠送积分活动 743245
科研通“疑难数据库(出版商)”最低求助积分说明 725882