Published September 22, 2025 | Version v1
Preprint Open

Autonomous extraction and building of machine-readable molecular models from publications using large language models

  • 1. ROR icon Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
  • 2. ROR icon University of Sarajevo
  • 3. ROR icon Norwegian University of Life Sciences
  • 4. Science and Technology Facilities Council (STFC)

Description

Force field models are one of the central elements of molecular simulations. A
large number of molecular force field models has been developed in the past decades
– mostly published in scientific papers. Building machine readable force field input
files for simulation engines is a tedious and error-prone task. We developed a method
for autonomously extracting and building force field files from publications using large
language models (LLM). We have tested the new method by extracting 114 force
field models from 21 scientific publications. The studied force fields comprise 6 - 74
parameters. We have compared the performance of different LLMs, namely Gemini
2.5 Pro, Claude 4 Sonnet, GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Flash. Overall,
they yield a similar performance – yet, important differences in individual cases. The
overall best performance was obtained by the Gemini 2.5 Pro LLM. The force field
parameters were extracted and identified with an accuracy of 89.1% by the Gemini
2.5 Pro LLM. The new autonomous extraction method drastically reduces the time
required for building force field files – and does not depend on the experience of the
simulator.

Files

ExtractingForceFieldsFromPapersUsingLLMs.pdf

Files (2.4 MB)

Name Size Download all
md5:046487281d96e2b7b3ed87a5923f1135
2.4 MB Preview Download