Skip to main content

SDS Extraction

Tellus uses AI/LLM to automatically extract structured data from Safety Data Sheets (SDS).

Overview

SDS documents are standardized but vary widely in format. The extraction service:

  1. Downloads PDF from S3 storage
  2. Extracts text using PDF parsing (with OCR fallback)
  3. Parses content using LLM (Claude or GPT-4)
  4. Validates extracted data against GHS schema
  5. Stores structured data in database tables

Extracted Sections

The parser extracts all 16 GHS-mandated sections:

SectionContentDatabase Table
1Identificationchemiq_sds_sections
2Hazard Identificationchemiq_sds_hazard_info
3Compositionchemiq_sds_composition
4First-Aid Measureschemiq_sds_sections
5Fire-Fightingchemiq_sds_sections
6Accidental Releasechemiq_sds_sections
7Handling & Storagechemiq_sds_sections
8Exposure Controls/PPEchemiq_sds_sections
9Physical/Chemical Propertieschemiq_sds_sections
10Stability & Reactivitychemiq_sds_sections
11Toxicological Infochemiq_sds_sections
12Ecological Infochemiq_sds_sections
13Disposalchemiq_sds_sections
14Transportchemiq_sds_sections
15Regulatory Infochemiq_sds_sections
16Other Infochemiq_sds_sections

Key Extracted Data

Section 1: Identification

{
"product_identifier": "Acetone",
"other_identifiers": ["2-Propanone", "Dimethyl ketone"],
"recommended_use": "Solvent, cleaning agent",
"manufacturer": {
"company_name": "Chemical Corp",
"address": {...},
"phone": "1-800-XXX-XXXX",
"emergency_contact": {...}
}
}

Section 2: Hazard Identification

{
"signal_word": "Danger",
"pictograms": ["GHS02", "GHS07"],
"classification": [
{
"hazard_class": "Flammable liquids",
"hazard_category": "Category 2",
"hazard_code": "H225"
}
],
"hazard_statements": ["H225: Highly flammable liquid and vapor"],
"precautionary_statements": {...}
}

Section 3: Composition

{
"ingredients": [
{
"chemical_name": "Acetone",
"cas_number": "67-64-1",
"concentration": {"range_min": 99, "range_max": 100}
}
]
}

Processing Pipeline

┌─────────────────┐
│ SDS Upload │
└────────┬────────┘


┌─────────────────┐
│ Parse Queue │ ← Background job picks up
└────────┬────────┘


┌─────────────────┐
│ PDF Download │ ← From S3
└────────┬────────┘


┌─────────────────┐
│ Text Extract │ ← PyMuPDF + OCR fallback
└────────┬────────┘


┌─────────────────┐
│ LLM Parsing │ ← Claude/GPT-4
└────────┬────────┘


┌─────────────────┐
│ Validation │
└────────┬────────┘


┌─────────────────┐
│ Database Save │
└────────┬────────┘


┌─────────────────┐
│ Post-Processing│ ← PPE, enrichment, etc.
└─────────────────┘

Confidence Scoring

Each parse includes a confidence score (0.0 - 1.0):

Score RangeInterpretation
0.9 - 1.0High confidence, all sections extracted
0.7 - 0.9Good confidence, minor gaps
0.5 - 0.7Moderate, some sections missing
< 0.5Low confidence, manual review needed

After parsing, additional processing includes:

  • Chemical Enrichment - PubChem data lookup
  • PPE Recommendations - AI-generated PPE requirements
  • Hazard Classification - GHS category derivation
  • Regulatory Checking - OSHA/EPA list matching