Can LLMs Speak Art?
Description
This PhD project investigates what multimodal large language models (LLMs) know about Art History and how they can support art-historical reasoning. While Digital Art History has traditionally relied on computational methods for large-scale image analysis, recent multimodal LLMs introduce new possibilities for interpreting visual and semantic content (Cetinic, 2022; Impett & Offert, 2023; Cornelis et al., 2009). However, their actual understanding of domain-specific knowledge remains unclear.
The first part of the project answers the question: What do LLMs know about art history? For answering this question, the project benchmarks three model families: supervised convolutional neural networks, contrastive vision-language models: CLIP, SigLIP, and generative multimodal LLMs: GPT and Gemini , on the classification of Christian saints across three datasets: ArtDL, ICONCLASS, and Wikidata. The evaluation is structured around zero-shot and few-shot settings, testing both label-based classification and contextual enrichment through Iconclass descriptions. Results show that multimodal LLMs consistently outperform other architectures, achieving accuracy above 90% on curated datasets, while performance decreases on noisier collections. Prompt enrichment through textual descriptions generally improves results, whereas few-shot prompting yields inconsistent gains. These findings suggest that recent LLMs encode substantial visual-semantic knowledge but remain sensitive to dataset quality and task design. Another step toward answering the question is Art-Historical Visual Question Answering (VQA), with a benchmark designed with 870 multiple-choice questions created by domain experts. The benchmark is designed to evaluate not only factual knowledge but also interpretative and reasoning abilities, including iconographic identification, stylistic analysis, and contextual understanding. This part is still in development.
Overall, this study provides a dual evaluation framework that moves beyond classification toward reasoning-based assessment. The results demonstrate that multimodal LLMs can effectively support some tasks in Digital Art History, while highlighting the need for grounded knowledge integration and more robust evaluation methods to address domain-specific complexity. Future experiments aim to understand how these models can support art-historical research.
Cetinic, E., & She, J. (2022). Understanding and creating art with AI: Review and outlook. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2), 1–22.
Cornelis, B., Dooms, A., Daubechies, I., & Schelkens, P. (2009). Report on Digital Image Processing for Art Historians. In L. Fesquet & B. Torrésani (Eds.), SAMPTA’09, International Conference on Sampling Theory and Applications (p. Special session on sampling and (in)painting). https://hal.science/hal-00452288
Impett, L., & Offert, F. (2023). There Is a Digital Art History (arXiv:2308.07464). arXiv. https://doi.org/10.48550/arXiv.2308.07464
Panofsky, E. (2018). Studies in iconology: Humanistic themes in the art of the Renaissance. Routledge.
Milani, F., & Fraternali, P. (2021). A Dataset and a Convolutional Model for Iconography Classification in Paintings. Journal on Computing and Cultural Heritage, 14(4), 1–18. https://doi.org/10.1145/3458885
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 Technical Report (arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774
Posthumus, E. (2020). Iconclass AI test set. Retrieved September, 16, 2024.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & others. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748–8763.
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., & others. (2023). Gemini: A family of highly capable multimodal models. arXiv Preprint arXiv:2312.11805.
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. Proceedings of the IEEE/CVF International Conference on Computer Vision, 11975–11986.
Files
spinaci_gianmarco.pdf
Files
(178.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c485add53e130875ce28e2d855cd75f0
|
178.6 kB | Preview Download |