SDS Extraction
Tellus uses AI/LLM to automatically extract structured data from Safety Data Sheets (SDS).
Overview
SDS documents are standardized but vary widely in format. The extraction service:
- Downloads PDF from S3 storage
- Extracts text using PDF parsing (with OCR fallback)
- Parses content using LLM (Claude or GPT-4)
- Validates extracted data against GHS schema
- Stores structured data in database tables
Extracted Sections
The parser extracts all 16 GHS-mandated sections:
| Section | Content | Database Table |
|---|---|---|
| 1 | Identification | chemiq_sds_sections |
| 2 | Hazard Identification | chemiq_sds_hazard_info |
| 3 | Composition | chemiq_sds_composition |
| 4 | First-Aid Measures | chemiq_sds_sections |
| 5 | Fire-Fighting | chemiq_sds_sections |
| 6 | Accidental Release | chemiq_sds_sections |
| 7 | Handling & Storage | chemiq_sds_sections |
| 8 | Exposure Controls/PPE | chemiq_sds_sections |
| 9 | Physical/Chemical Properties | chemiq_sds_sections |
| 10 | Stability & Reactivity | chemiq_sds_sections |
| 11 | Toxicological Info | chemiq_sds_sections |
| 12 | Ecological Info | chemiq_sds_sections |
| 13 | Disposal | chemiq_sds_sections |
| 14 | Transport | chemiq_sds_sections |
| 15 | Regulatory Info | chemiq_sds_sections |
| 16 | Other Info | chemiq_sds_sections |
Key Extracted Data
Section 1: Identification
{
"product_identifier": "Acetone",
"other_identifiers": ["2-Propanone", "Dimethyl ketone"],
"recommended_use": "Solvent, cleaning agent",
"manufacturer": {
"company_name": "Chemical Corp",
"address": {...},
"phone": "1-800-XXX-XXXX",
"emergency_contact": {...}
}
}
Section 2: Hazard Identification
{
"signal_word": "Danger",
"pictograms": ["GHS02", "GHS07"],
"classification": [
{
"hazard_class": "Flammable liquids",
"hazard_category": "Category 2",
"hazard_code": "H225"
}
],
"hazard_statements": ["H225: Highly flammable liquid and vapor"],
"precautionary_statements": {...}
}
Section 3: Composition
{
"ingredients": [
{
"chemical_name": "Acetone",
"cas_number": "67-64-1",
"concentration": {"range_min": 99, "range_max": 100}
}
]
}
Processing Pipeline
┌─────────────────┐
│ SDS Upload │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Parse Queue │ ← Background job picks up
└────────┬────────┘
│
▼
┌─────────────────┐
│ PDF Download │ ← From S3
└────────┬────────┘
│
▼
┌─────────────────┐
│ Text Extract │ ← PyMuPDF + OCR fallback
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM Parsing │ ← Claude/GPT-4
└────────┬────────┘
│
▼
┌─────────────────┐
│ Validation │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Database Save │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Post-Processing│ ← PPE, enrichment, etc.
└─────────────────┘
Confidence Scoring
Each parse includes a confidence score (0.0 - 1.0):
| Score Range | Interpretation |
|---|---|
| 0.9 - 1.0 | High confidence, all sections extracted |
| 0.7 - 0.9 | Good confidence, minor gaps |
| 0.5 - 0.7 | Moderate, some sections missing |
| < 0.5 | Low confidence, manual review needed |
Related Features
After parsing, additional processing includes:
- Chemical Enrichment - PubChem data lookup
- PPE Recommendations - AI-generated PPE requirements
- Hazard Classification - GHS category derivation
- Regulatory Checking - OSHA/EPA list matching