Published June 17, 2022 | Version v1
Poster Open

40 million open chemical structures from patents: treasure trove? junk yard? or both?

  • 1. Medicines Discovery Catapult


Poster for International Conference on Chemical Structures: ICCS,  June 2022 


Compared to the literature, the patent  corpus has both pros and cons for chemistry data mining. The latter include being a) a “Cinderella” source that is difficult to get to grips with, b) massively redundant document corpus from patent families and kind codes and c) include various degrees of deliberate obfuscation to impede data mining.  Pros include a) paradoxically, compared to restricted access to the literature, they are completely open for text mining and entity extraction, b) they contain ~ 3x to ~5x more medicinal chemistry SAR than published papers, c) include discloses of new drug targets and chemotypes  years ahead of papers d) constitute a rich source  of executed synthesis protocols and experimental chemistry property data e) within the last few years open automated chemical named entity recognition (CNER) has broken the monopoly of commercial chemistry curation.  Because Medicines Discovery Catapult needs to keep up with developments in both commercial and open sources this work was undertaken to update our overview of patent extractions in general and the expanding integration within PubChem in particular. The four largest PubChem sources, SureChEMBL, Google Patents, WIPO, and IBM,  use similar CNER pipelines that include name look-ups, IUPAC conversions and image-to-structure extractions. Their compound (CID) counts are 21.5, 17.9, 17.7 and 10.7 million, respectively, and together with small sources such as NextMove Software synthetic pathway extractions at 1.8 million, the CNER sources add up to just under 40 million from the PubChem March 2022 total of 111 million.  The “treasure trove“  aspects that will be presented  includes a) expert curation of SAR from patents by BindingDB with 400K compounds from 5.4K US patents and data points covering 2,197 target proteins b) extensive coverage of the ~5 million exemplified compounds from all C07 and A61 patent classified filings relevant to medicinal chemistry c) the ability to track back to exact example numbers in documents via SureChEMBL and WIPO. However, this presentation will also outline the “junk yard“ aspects. These include a) beyond the ~ 5 million structures linkable to data how much of a junk yard the other 35 million represent  b) CNER produces artifactual structures from broken IUPAC strings and mixture extractions of various sorts  c) all the large extraction sources diverge significantly in exacltly what chemistry their own pipelines pull out and d) the 28 million patent document to chemistry links represent significant  massive over-mapping (but reasons for this will be discussed).  All things considered however, the PubChem team are congratulated on their efforts not only in wrangling and integrating these sources but also linking and search-indexing the chemistry linked to the patent documents they were extracted from.  





Files (1.3 MB)

Name Size Download all
1.3 MB Preview Download