Reference Set and MLLM Visual Information Extraction Prototype

Dilán-Pantojas, Israel; Duong, Phu T.; Boyce, Richard

doi:10.5281/zenodo.17795454

Published December 2, 2025 | Version 1.0.0

Dataset Open

Reference Set and MLLM Visual Information Extraction Prototype

1. University of Pittsburgh School of Medicine
2. University of Pittsburgh

Contributors

Other:

Lopes, Kevin¹

1. Rochester Institute of Technology

Manually extracted data from figures and tables reported in pharmacology studies previously collected as part of an internal reference set. The dataset contains images from the visual elements and their corresponding values from 45 published pharmacology studies with distinct PubMed IDs. Multiple images could be sampled from each of these papers, and multiple values were often sampled from each image. Therefore, the reference set contains multiple rows with values from images of tables or from figures corresponding to graphs, plots, or charts. Dataset contains annotated information from 43 images of figueres and 40 images of tables. The visual elements contain data from any of eight different types of experiments, namely in vitro enzyme inhibition, induction, & kinetics, in vitro transporter inhibition, induction, and kinetics, as well as in vivo enzyme kinetics and in vivo interaction studies. The selected sample represents a wide range of styles, layouts, and structures for both figures and tables.

We also provide code from our MLLLMs Visual Information Extraction prototype using the Pydantic AI v1.25 Python module to connect with multiple models to perform VIE and produce a structured JSON output. Our pilot VIE system was used to process images from the reference set along with the rest of the annotated information to generate prompts.

We have evaluated the following models.

Inference Provider	Model Company	Model Name	Context Window	Number of Parameters
AWS Bedrock	Anthropic	Claude Sonnet 3.7	128K	*
	Anthropic	Claude Sonnet 4.0	1M	*
	AWS	Nova Pro	300K	*
	AWS	Nova Premier	1M	*
	Meta	Llama 3.2	128K	90B
		Llama 4 Scout	10M	109B
		Llama 4 Maverick	1M	400B
Open AI API	Open AI	GPT-4o	128K	*
Open AI API	Open AI	GPT-5	400K	*
Google Vertex	Google	Gemini 2.5 Pro	1M	*
*The actual number of parameters for this model has not been made publicly available.

Error corrections:

Within the "Manuscript Results folder" > "Tolerance Based ACC.ods" the calculation of Tolerance based accuracy for cells F12-J21 was incorrectly calculated by dividing the corresponding cell F1-J10 over 172 instead of 162. For example the correction for the value of cell F12 is to change it's content from "=ROUND(F1/172,3)*100" to "=ROUND(F1/162,3)*100".

Files

Complexity.ipynb

Files (1.4 GB)

Name	Size	Download all
Complexity.ipynb md5:c8e8891a15590faa9cd0fbb334d8e228	756.9 kB	Preview Download
DataObjects.py md5:5f96a3619bfeb588ab38ca783113148f	3.8 kB	Download
DataProcessing.py md5:a6b64aef5b613d87b8bfa9ea3abb1db3	9.3 kB	Download
ExperimentGenerator.py md5:7a4ac1898ec56b38990c15b0171b5ff8	37.0 kB	Download
ExperimentObjects.py md5:6c0bdb263bfda5d3e7bb84bf81aa7e59	7.3 kB	Download
ExtractionExperiment.ipynb md5:c40da088a59f49af7bd7fa85f16890e9	265.5 kB	Preview Download
Manuscript_Results_Objects.zip md5:a6731cd414ee993be7e7d084a4338b3d	607.8 MB	Preview Download
RawResultObjects.py md5:9f56152a733b5467b3ac516530d91d80	1.9 kB	Download
Reference_set.zip md5:5de30d230258b59c01b84bf8eaef69cf	832.5 MB	Preview Download
ResultsAnalysis.py md5:00b14652d428218ccb572d2dcef16c5d	4.1 kB	Download

Additional details

Repository URL: https://github.com/dbmi-pitt/visual-info-extraction
Programming language: Python
Development Status: Concept

	All versions	This version
Views	36	36
Downloads	17	17
Data volume	3.5 GB	3.5 GB

Reference Set and MLLM Visual Information Extraction Prototype

Authors/Creators

Contributors

Other:

Description

Files

Complexity.ipynb

Files (1.4 GB)

Additional details

Software