Presentation Open Access
Presentation for International Conference on Chemical Structures: ICCS, 2022
Despite the success of COVID-19 vaccines there remains an urgent need for small-molecule antivirals. The recent orally effective M-protease inhibitor PF-07321332 thus represents a breakthrough. While Pfizer first declared the structure at an ACS meeting in April 2021 their patent WO2021250648 "Nitrile Containing Antiviral Compounds" published on the 16th of Dec 2021 followed closely by their paper on the 24th (PMID: 34726479). The problem is that the extraction of structures and activity data proceeds at different speeds by different commercial and public sources. For SARS Cov-2 targets published lead structures are curated by the Guide to Pharmacology and complemented by full SAR data sets curated from patents by BindingDB. Both resources promptly submit to PubChem. ChEMBL also extracts from papers, but the release cycle is long. This work will look at the timings affecting the flow of post-publication structures and data into PubChem as well as SciFinder. While some chemistry has been automatically extracted from WO2021250648 by Google Patents and WIPO Patentscope, SureChEMBL has not yet subsumed the chemistry into their database. None of these patent extractions has yet fed through to PubChem where the structures would be usefully merged at the compound level. SciFinder has indexed the substances but not the activity values. It also turns out that a new set of Pfizer 2021 patents include a new M-protease inhibitor series, possibly as PF07321332 back-ups. Tracking these through the system illustrates the technical challenges for automated pipelines to extract the correct structures from PDFs. Methods using open tools such as OPSIN and OSRA for manual patent curation will be exemplified. These can improve extraction fidelity for key compounds but are obviously more difficult to scale. There are additional recently declared clinical candidates including SH-879 from Sosei Heptares and S-217622 from Shinogi. The former is still blinded (i.e., no structure in the public domain) but the latter is specified in a preprint and has been curated by BindingDB (CID162533924). However, neither of these two have surfaced from patents so far (February). An academic team from Stanford also claim to have filed on their ML1000 lead compound, CID155925840. We can expect that more M-protease patents from companies, as well as academic groups, will publish in 2022. The the open science COVID Moonshot efforts have just nominated CID156906151 as their clinical candidate. It is thus important that all quality data, both open and commercial, can be extracted and tracked quickly into resources that are FAIR. Large sets of analogue activity data from patents are particularly suitable for classical pharmacophore modelling or AI/ML approaches. Extracting SAR from an expanding range of patents will thus enhance the development of further improved clinical M-protease inhibitors for battling the pandemic.