Published July 24, 2023 | Version v1
Conference paper Open

Integrating Chemical Language and Physicochemical Features for Enhanced Molecular Property Prediction with Multimodal Language Models

Description

Here we present a novel multimodal language model (MultiModal-MoLFormer) approach for predicting molecular properties, which combines chemical language representation embeddings derived from the recently introduced MoLFormer chemical language model and physicochemical features. Our approach employs a causal multi-stage feature selection method that selects physicochemical features based on their direct causal-effect on a specific target property to predict. Specifically, we use Mordred descriptors as physicochemical features and Markov blanket causal graphs as the inference algorithm to identify the most relevant features. Our results demonstrate that our proposed approach outperforms existing state-of-the-art algorithms, including the chemical language-based MoLFormer and graph neural networks, in predicting complex tasks such as the biodegradability of general compounds and PFAS toxicity estimation. The MultiModal-MoLFormer model resulted in a significant improvement in the classification accuracy for EPA categories of PFAS Toxicity, from 0.75 to 0.84, when compared to the base MoLFormer approach. Additionally, our proposed approach achieves an accuracy of 0.94 for the biodegradability estimation task.

Files

BioKDD___Multimodal_MoLFormer.pdf

Files (261.3 kB)

Name Size Download all
md5:2120b77a402cb4488a65c8ca2a176c77
261.3 kB Preview Download

Additional details

Related works