SDS Toolbox: End-to-End SDS Retrieval and Structured Data Extraction Using LLMs
Authors/Creators
Description
SDS Toolbox is an automated system designed to streamline the process of retrieving and extracting structured data from Safety Data Sheets (SDS) using chemical identifiers such as CAS numbers or IUPAC names. It combines intelligent search capabilities with language model–powered data extraction in a modular architecture.
Modules
The toolbox consists of three core modules:
1. SDS-FIND
-
Function: Searches for SDS files online using a CAS number or IUPAC name.
-
Sources: PDF files are retrieved from trusted sources indexed by search engines.
-
Search Engines Supported: Currently uses SerpAPI (Google) or Brave Search.
2. SDS-STRUCT
-
Function: Parses and extracts structured data from SDS PDF files.
-
Technology: Utilizes LLMs to convert unstructured PDF content into structured formats (e.g., CSV, JSON).
-
Data Output: Extracted data includes chemical identifiers, hazard statements, manufacturer info, and safety classifications.
3. SDS-FLOW
-
Function: An end-to-end pipeline that combines search (SDS-FIND) and extraction (SDS-STRUCT) into a single seamless process.
-
Input: CAS number or IUPAC name.
-
Output: Archived SDS files + extracted structured data.
-
Features:
-
Automated retries
-
Logging
-
Zip output with downloadable results
-
Email delivery (optional)
-
Key Features
-
Supports both single requests and batch processing
-
Modular and extensible architecture
-
Works with both local PDFs and online searches
-
Fully integrated with Streamlit UI
-
Optional email delivery of results