Skip to main content

SDS Web Search via Perplexity API - Implementation Plan

Overview

Add a feature to search for Safety Data Sheets (SDS) across the web using Perplexity Search API when no SDS is found in our internal library. This enables users to find and attach SDS documents from manufacturer websites and SDS databases.

User Flow

Product Details Screen → SDS Missing

"Search for SDS" button clicked

Step 1: Search INTERNAL library first (company + global SDS)

If found → Display results → User attaches SDS → Done

If NOT found → Show "Search Web for SDS" option

User clicks "Search Web for SDS"

Step 2: Check cache table (chemiq_sds_web_search_cache)

If cached results exist (< 7 days old) → Return cached results

If NO cache → Call Perplexity API → Save results to cache

Display results with PDF links, source URLs, titles

User previews/selects an SDS

System downloads PDF and uploads to our S3 → creates SDS record

SDS attached to product

Clean up cache for this product (optional - results served their purpose)

Parse job created in chemiq_sds_parse_queue

Background service picks up job → LLM parses SDS

Hazard info, composition, sections populated

Cost Optimization Strategy

1. Internal Search First

Always search internal library before calling Perplexity:

  • Company-mapped SDS documents
  • Global SDS repository
  • Only show "Search Web" if internal search returns 0 results

2. Web Search Result Caching

Cache Perplexity results to avoid repeated API calls:

  • Cache Key: company_id + product_name + manufacturer (normalized)
  • Cache TTL: 7 days (SDS sources don't change frequently)
  • Cleanup: Delete cache entry when user imports an SDS
  • Table: chemiq_sds_web_search_cache

3. Cache Benefits

  • User returns to product screen → Shows cached web results instantly
  • User searches same product multiple times → No additional API cost
  • Multiple users in same company search same product → Share cache

Perplexity Search API Overview

Endpoint: POST https://api.perplexity.ai/search

Authentication: Authorization: Bearer <PERPLEXITY_API_KEY>

Key Parameters:

ParameterTypeDescription
querystringSearch query (e.g., "Clorox Disinfecting Wipes SDS PDF safety data sheet")
max_resultsint (1-20)Number of results
search_domain_filterstring[]Limit to known SDS databases
countrystringGeographic filter (US)

Response:

{
"results": [
{
"title": "Clorox Disinfecting Wipes Safety Data Sheet",
"url": "https://www.thecloroxcompany.com/wp-content/uploads/2024/sds-wipes.pdf",
"snippet": "SAFETY DATA SHEET - Product: Clorox Disinfecting Wipes...",
"date": "2024-01-15"
}
]
}

Pricing: Per-request (no token-based pricing)

Implementation Steps

Phase 1: Backend - Perplexity Integration Service

1.1 Create Perplexity Client Service

File: tellus-ehs-hazcom-service/app/services/external/perplexity_client.py

class PerplexityClient:
"""Client for Perplexity Search API"""

BASE_URL = "https://api.perplexity.ai/search"

# Known SDS databases to prioritize
SDS_DOMAINS = [
"msdsonline.com",
"chemicalsafety.com",
"sds.chemtel.net",
"ehs.stanford.edu",
"msdsdigital.com",
"sdsmanager.com",
"hazard.com",
# Manufacturer sites often have SDSs
]

async def search_sds(
self,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None,
max_results: int = 10
) -> List[WebSDSResult]:
"""Search for SDS documents on the web"""

# Build optimized search query
query = self._build_sds_query(product_name, manufacturer, barcode)

payload = {
"query": query,
"max_results": max_results,
"country": "US",
"search_domain_filter": self.SDS_DOMAINS # Optional: focus on SDS sites
}

response = await self._make_request(payload)
return self._parse_results(response)

def _build_sds_query(self, product_name, manufacturer, barcode):
"""Build optimized search query for SDS"""
parts = [product_name, manufacturer, "SDS", "safety data sheet", "PDF"]
if barcode:
parts.insert(0, barcode)
return " ".join(parts)

1.2 Create Web SDS Search Service

File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py

class WebSDSSearchService:
"""Service for searching SDS documents on the web"""

async def search_web_sds(
self,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None
) -> List[WebSDSSearchResult]:
"""
Search for SDS on the web via Perplexity
Returns list of potential SDS documents with URLs
"""

async def import_sds_from_url(
self,
url: str,
product_name: str,
manufacturer: str,
company_id: UUID,
user_id: UUID
) -> SDSDocument:
"""
Download SDS PDF from URL and import into our system
1. Download PDF from URL
2. Validate it's a PDF
3. Extract basic metadata
4. Upload to S3
5. Create SDS record
6. Queue for parsing
"""

1.3 Add API Endpoint

File: tellus-ehs-hazcom-service/app/api/v1/chemiq/sds.py

@router.post("/search-web", response_model=WebSDSSearchResponse)
async def search_web_for_sds(
request: WebSDSSearchRequest,
ctx: UserContext = Depends(get_user_context),
db: Session = Depends(get_db)
):
"""
Search the web for SDS documents using Perplexity API

- Searches public SDS databases and manufacturer sites
- Returns URLs to potential SDS PDFs
- User can preview and select to import
"""

@router.post("/import-from-url", response_model=SDSDocumentResponse)
async def import_sds_from_url(
request: ImportSDSFromURLRequest,
ctx: UserContext = Depends(get_user_context),
db: Session = Depends(get_db)
):
"""
Import an SDS document from a URL

- Downloads PDF from provided URL
- Validates and deduplicates
- Uploads to S3 and creates SDS record
- Attaches to product/inventory
"""

1.4 Create Schemas

File: tellus-ehs-hazcom-service/app/schemas/chemiq/web_sds_search.py

class WebSDSSearchRequest(BaseModel):
product_name: str
manufacturer: str
barcode_upc: Optional[str] = None

class WebSDSSearchResult(BaseModel):
title: str
url: str
snippet: str
source_domain: str
date: Optional[str] = None
is_pdf: bool # True if URL ends with .pdf
confidence_score: float # Our calculated relevance

class WebSDSSearchResponse(BaseModel):
results: List[WebSDSSearchResult]
total: int
search_query_used: str

class ImportSDSFromURLRequest(BaseModel):
url: str
product_name: str
manufacturer: str
revision_date: Optional[date] = None
attach_to_chemical_id: Optional[UUID] = None
attach_to_company_product_id: Optional[UUID] = None

Phase 2: Configuration & Environment

2.1 Add Environment Variables

File: .env

# Perplexity API
PERPLEXITY_API_KEY=pplx-xxxxxxxxxxxx
PERPLEXITY_ENABLED=true

2.2 Update Config

File: tellus-ehs-hazcom-service/app/core/config.py

# Perplexity API (for web SDS search)
PERPLEXITY_API_KEY: Optional[str] = None
PERPLEXITY_ENABLED: bool = False

Phase 3: Frontend - Web SDS Search UI

3.1 Add API Function

File: tellus-ehs-hazcom-ui/src/services/api/chemiq.api.ts

export interface WebSDSSearchResult {
title: string;
url: string;
snippet: string;
source_domain: string;
date?: string;
is_pdf: boolean;
confidence_score: number;
}

export interface WebSDSSearchResponse {
results: WebSDSSearchResult[];
total: number;
search_query_used: string;
}

export async function searchWebForSDS(
token: string,
userId: string,
companyId: string,
productName: string,
manufacturer: string,
barcodeUpc?: string
): Promise<WebSDSSearchResponse> {
return apiClient.post<WebSDSSearchResponse>(
'/api/v1/chemiq/sds/search-web',
{ product_name: productName, manufacturer, barcode_upc: barcodeUpc },
{ headers: { ...authHeaders(token, userId, companyId) } }
);
}

export async function importSDSFromURL(
token: string,
userId: string,
companyId: string,
request: ImportSDSFromURLRequest
): Promise<SDSDocumentResponse> {
return apiClient.post<SDSDocumentResponse>(
'/api/v1/chemiq/sds/import-from-url',
request,
{ headers: { ...authHeaders(token, userId, companyId) } }
);
}

3.2 Create WebSDSSearchModal Component

File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/WebSDSSearchModal.tsx

interface WebSDSSearchModalProps {
isOpen: boolean;
onClose: () => void;
productName: string;
manufacturer: string;
barcodeUpc?: string;
chemicalId?: string;
companyProductId?: string;
onSDSImported: (sdsId: string) => void;
}

export const WebSDSSearchModal: React.FC<WebSDSSearchModalProps> = ({...}) => {
// State for search results, loading, selected result

// Display:
// 1. Search status and query used
// 2. List of results with:
// - Title (linked to URL)
// - Source domain badge
// - Snippet preview
// - PDF indicator
// - Confidence score
// - "Preview" button (opens URL in new tab)
// - "Import & Attach" button
// 3. Import progress when user selects one
}

3.3 Update SDSInfoCard

File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/SDSInfoCard.tsx

Add "Search Web for SDS" button in the SDS Missing state:

// In SDS Missing state section
<div className="flex items-center justify-center gap-3">
<button className="btn-secondary px-4 py-2 flex items-center gap-2">
<Search className="w-4 h-4" />
Search Library
</button>
<button
onClick={() => setShowWebSearchModal(true)}
className="btn-secondary px-4 py-2 flex items-center gap-2"
>
<Globe className="w-4 h-4" />
Search Web for SDS
</button>
<button className="btn-primary px-4 py-2 flex items-center gap-2">
<Upload className="w-4 h-4" />
Upload SDS
</button>
</div>

{/* Web Search Modal */}
<WebSDSSearchModal
isOpen={showWebSearchModal}
onClose={() => setShowWebSearchModal(false)}
productName={chemical.product_name}
manufacturer={chemical.manufacturer}
barcodeUpc={chemical.barcode_upc}
chemicalId={chemical.chemical_id}
onSDSImported={handleSDSImported}
/>

3.4 Update SDSSearchSection (Add Web Search Tab)

File: tellus-ehs-hazcom-ui/src/pages/chemiq/inventory/components/SDSSearchSection.tsx

Add a second tab or section for "Web Search" when internal search returns no results:

{searchResults.length === 0 && hasSearched && (
<div className="mt-4 p-4 bg-blue-50 rounded-lg">
<p className="text-sm text-blue-800 mb-3">
No SDS found in library. Try searching the web:
</p>
<button
onClick={() => setShowWebSearch(true)}
className="btn-secondary flex items-center gap-2"
>
<Globe className="w-4 h-4" />
Search Web for SDS
</button>
</div>
)}

Phase 4: PDF Download & Import Logic

4.1 PDF Download Utility

File: tellus-ehs-hazcom-service/app/utils/pdf_downloader.py

async def download_pdf_from_url(
url: str,
max_size_mb: int = 20,
timeout_seconds: int = 30
) -> Tuple[bytes, str, int]:
"""
Download PDF from URL

Returns: (pdf_bytes, content_type, file_size)
Raises: HTTPException on validation failure
"""
# 1. Validate URL format
# 2. Make HEAD request to check content type and size
# 3. Download with size limit
# 4. Validate it's actually a PDF (check magic bytes)
# 5. Return bytes

4.2 SDS Import Flow

URL → Download PDF → Validate PDF → Calculate SHA256 hash
→ Check for duplicates (by hash)
→ If duplicate: return existing SDS
→ If new: Upload to S3 → Create SDS record
→ Create company mapping
→ Attach to chemical/product (if specified)
→ Create parse job in chemiq_sds_parse_queue (priority=8, high)
→ Background service parses → populates hazard_info, composition, sections

4.3 Create Parse Job After Import

File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py

async def _queue_for_parsing(self, sds_id: UUID, db: Session) -> None:
"""
Queue the imported SDS for background parsing.

Creates a job in chemiq_sds_parse_queue with high priority
since this is a user-initiated import.
"""
from datetime import datetime, timezone
from uuid import uuid4

parse_job = SDSParseJob(
job_id=uuid4(),
sds_id=sds_id,
job_status='pending',
priority=8, # High priority for user-initiated imports
parse_sections=list(range(1, 17)), # All 16 sections
retry_count=0,
created_at=datetime.now(timezone.utc)
)
db.add(parse_job)
db.commit()

Phase 5: Web Search Cache Table

5.1 Create Cache Table

Migration: add_chemiq_sds_web_search_cache.py

CREATE TABLE chemiq_sds_web_search_cache (
cache_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
company_id UUID NOT NULL REFERENCES core_data_companies(company_id) ON DELETE CASCADE,

-- Search criteria (normalized for matching)
product_name_normalized VARCHAR(255) NOT NULL,
manufacturer_normalized VARCHAR(255) NOT NULL,
barcode_upc VARCHAR(100),

-- Search metadata
search_query_used TEXT NOT NULL,
searched_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),

-- Cached results (JSONB array)
results JSONB NOT NULL,
result_count INTEGER NOT NULL DEFAULT 0,

-- Tracking
created_by_user_id UUID REFERENCES core_data_users(user_id),
times_accessed INTEGER NOT NULL DEFAULT 1,
last_accessed_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),

-- Cleanup tracking
sds_imported BOOLEAN NOT NULL DEFAULT FALSE,
imported_sds_id UUID REFERENCES chemiq_sds_documents(sds_id),

UNIQUE(company_id, product_name_normalized, manufacturer_normalized)
);

CREATE INDEX idx_sds_web_cache_company ON chemiq_sds_web_search_cache(company_id);
CREATE INDEX idx_sds_web_cache_searched_at ON chemiq_sds_web_search_cache(searched_at);

5.2 Cache Schema for Results

class CachedWebSDSResult(BaseModel):
title: str
url: str
snippet: str
source_domain: str
date: Optional[str]
is_pdf: bool
confidence_score: float

# Stored in results JSONB column as array

5.3 Cache Service Logic

File: tellus-ehs-hazcom-service/app/services/chemiq/web_sds_search_service.py

class WebSDSSearchService:

CACHE_TTL_DAYS = 7

async def search_web_sds(
self,
company_id: UUID,
product_name: str,
manufacturer: str,
barcode: Optional[str] = None,
user_id: Optional[UUID] = None
) -> WebSDSSearchResponse:
"""
Search for SDS on the web with caching.

1. Check cache first
2. If cache hit and fresh → return cached results
3. If cache miss → call Perplexity → save to cache
"""

# Normalize for cache lookup
product_normalized = self._normalize(product_name)
manufacturer_normalized = self._normalize(manufacturer)

# Check cache
cached = await self._get_cached_results(
company_id, product_normalized, manufacturer_normalized
)

if cached and self._is_cache_fresh(cached):
# Update access tracking
await self._update_cache_access(cached.cache_id)
return self._cached_to_response(cached)

# Cache miss or stale - call Perplexity
results = await self.perplexity_client.search_sds(
product_name, manufacturer, barcode
)

# Save to cache
await self._save_to_cache(
company_id=company_id,
product_normalized=product_normalized,
manufacturer_normalized=manufacturer_normalized,
barcode=barcode,
results=results,
user_id=user_id
)

return results

def _normalize(self, text: str) -> str:
"""Normalize text for cache matching."""
return text.lower().strip()

def _is_cache_fresh(self, cached) -> bool:
"""Check if cache entry is still valid."""
age = datetime.now(timezone.utc) - cached.searched_at
return age.days < self.CACHE_TTL_DAYS

async def mark_cache_used(self, company_id: UUID, sds_id: UUID) -> None:
"""Mark cache as used when SDS is imported."""
# Update cache entry to mark sds_imported = True
# Optional: delete old cache entries

5.4 Cache Cleanup (Background Job)

Add to background service scheduler:

@scheduler.scheduled_job('cron', day='*', hour=3, minute=0)
def cleanup_stale_sds_web_cache():
"""
Daily cleanup of stale web search cache.

- Delete entries older than 30 days
- Delete entries where sds_imported = True and older than 7 days
"""

Phase 6: Rate Limiting

  • Limit web searches per company: 50/day
  • Limit per user: 20/day
  • Track in Redis or database

File Changes Summary

New Files to Create:

Backend (tellus-ehs-hazcom-service):

  1. app/services/external/__init__.py
  2. app/services/external/perplexity_client.py - Perplexity API client
  3. app/services/chemiq/web_sds_search_service.py - Web search + cache orchestration
  4. app/schemas/chemiq/web_sds_search.py - Request/response schemas
  5. app/utils/pdf_downloader.py - PDF download utility
  6. app/db/models/chemiq_sds_web_cache.py - Cache table model
  7. alembic/versions/xxx_add_chemiq_sds_web_search_cache.py - Migration

Background Service (tellus-ehs-background-service):

  1. app/jobs/cleanup_sds_web_cache.py - Daily cache cleanup job

Frontend (tellus-ehs-hazcom-ui):

  1. src/pages/chemiq/inventory/components/WebSDSSearchModal.tsx - Modal for web search
  2. src/types/webSdsSearch.ts - TypeScript types

Files to Modify:

Backend:

  1. app/api/v1/chemiq/sds.py - Add new endpoints
  2. app/core/config.py - Add Perplexity config
  3. .env - Add API key

Frontend:

  1. src/services/api/chemiq.api.ts - Add API functions
  2. src/services/api/index.ts - Export new functions
  3. src/pages/chemiq/inventory/components/SDSInfoCard.tsx - Add web search button
  4. src/pages/chemiq/inventory/components/SDSSearchSection.tsx - Add web search fallback

Security Considerations

  1. URL Validation: Only allow HTTPS URLs, validate against known patterns
  2. PDF Validation: Verify magic bytes, scan for malicious content
  3. Size Limits: Max 20MB per PDF download
  4. Rate Limiting: Prevent API abuse
  5. Domain Allowlist: Consider limiting to trusted SDS databases

Testing Checklist

Internal Search First:

  • Internal library search runs before web search option appears
  • "Search Web" button only shows when internal search returns 0 results

Caching:

  • First web search calls Perplexity and saves to cache
  • Second search for same product returns cached results (no API call)
  • Cache expires after 7 days and triggers fresh API call
  • Cache is marked as used when SDS is imported
  • Background cleanup job removes stale cache entries

Perplexity Integration:

  • Perplexity API integration works with valid API key
  • Search returns relevant SDS results
  • Domain filtering focuses on SDS databases

PDF Import:

  • PDF download handles various URL formats
  • Duplicate detection by hash works
  • S3 upload and SDS record creation work

Parse Job:

  • Parse job created in chemiq_sds_parse_queue after import
  • Background service picks up and parses imported SDS
  • Hazard info, composition, sections populated after parsing

Frontend:

  • Frontend modal displays results correctly
  • Import flow attaches SDS to product
  • Error handling for failed downloads
  • Loading states during search and import

Rate Limiting:

  • Rate limiting works (50/day per company, 20/day per user)

Sources