There is a newer version of the record available.

Published March 23, 2026 | Version 2.2
Dataset Open

USDA Phytochemical Database — Enriched v2.2 (400-Row Sample)

Authors/Creators

Description

A 400-record sample of the USDA Dr. Duke's Phytochemical and Ethnobotanical Database, denormalized into a flat 10-column schema and enriched with quantitative signals from five sources:

  • pubmed_mentions_2026: PubMed publication count per compound (NCBI E-utilities)
  • clinical_trials_count_2026: ClinicalTrials.gov v2 study count per compound
  • chembl_bioactivity_count: ChEMBL v35 bioassay data points (CC BY-SA 3.0)
  • patent_count_since_2020: USPTO patents since 2020-01-01 (PatentsView REST API)
  • pubchem_cid + canonical_smiles: PubChem compound identifier and canonical SMILES notation (PubChem REST API)

Schema: chemical, plant_species, application, dosage, pubmed_mentions_2026, clinical_trials_count_2026, chembl_bioactivity_count, patent_count_since_2020, pubchem_cid, canonical_smiles

Records: 400 (top compounds by PubMed mentions) Total dataset: 76,907 records across 24,746 compounds and 2,313 species. Full dataset: https://ethno-api.com

Formats: JSON (~25 MB) + Parquet (~800 KB, Snappy compression). Methodology: https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON/blob/main/METHODOLOGY.md

Changes in v2.2 vs v2.1:

  • Added pubchem_cid and canonical_smiles fields (71.8% coverage, 55,217 of 76,907 records)
  • Updated clinical_trials_count_2026 values (3,243 records refreshed via ClinicalTrials.gov API)

Files

ethno_sample_400.json

Files (133.9 kB)

Name Size Download all
md5:08b51dfa8ebc81821832513f48b02d0f
133.9 kB Preview Download

Additional details

References

  • Wirth, A. (2026). USDA Phytochemical Database — Enriched v2.2. Enrichment sources: PubMed (NCBI E-utilities), ClinicalTrials.gov v2, ChEMBL v35 (CC BY-SA 3.0), PatentsView (USPTO), PubChem (NIH/NLM). License: CC BY 4.0.