USDA Phytochemical Database — Enriched v2.4.0 (400-Row Sample)
Authors/Creators
Description
A 400-record sample of the USDA Dr. Duke's Phytochemical and Ethnobotanical Database, denormalized into a flat 16-column schema and enriched with quantitative signals and structural identifiers.
Enrichment Sources:
-
pubmed_mentions_2026: PubMed publication count per compound
-
clinical_trials_count: ClinicalTrials.gov v2 study count per compound
-
chembl_bioactivity_count: ChEMBL v35 bioassay data points (CC BY-SA 3.0)
-
patent_count: USPTO patents since 2020-01-01 (Patent-Literature Gap Analysis)
-
pubchem_cid + canonical_smiles + inchi_key: PubChem compound identifiers and structural notations
Schema (16 Columns): plant_id, plant_species, common_name, chemical, activity, reference, pubchem_cid, canonical_smiles, patent_count, patent_count_method, compound_type, clinical_trials_count, partner_cid, inchi_key, iupac_verified, partner_match_method
Records: 400 (top compounds by PubMed mentions). Total dataset: 76,907 records across 24,746 compounds and 2,313 species. Full dataset available at: https://ethno-api.com
Formats: JSON (~41 MB full dataset) + Parquet (~1.3 MB full dataset, Snappy compression). Methodology: https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON/blob/main/METHODOLOGY.md
Changes in v2.4.0 vs v2.3.1:
-
Integrated an expert-reviewed crossmatch pipeline leveraging structural mappings against COCONUT and FooDB.
-
Expanded the schema from 12 to 16 columns to include
partner_cid,inchi_key,iupac_verified, andpartner_match_method. -
Recovered an additional 1,534 PubChem CIDs, reducing the Null-CID rate by a further 8% (now 17,616 missing).
-
Introduced InChIKey structural identifiers (157 entries) and IUPAC nomenclature verification (459 entries).
Files
ethno_sample_400.json
Files
(226.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:11f5254afc23cd616a2d720ad8ca60e6
|
226.2 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Dataset: https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON (URL)
Software
References
- Wirth, A. (2026). USDA Phytochemical Database — Enriched v2.4.0. Enrichment sources: PubMed (NCBI E-utilities), ClinicalTrials.gov v2, ChEMBL v35 (CC BY-SA 3.0), USPTO PatentsView, PubChem. https://ethno-api.com