Skip to main content

CAS & Regulatory Linking — Determination Logic

This document explains how every field on the /chemiq/cas-linking screen is determined, including how the OSHA, NIOSH, TRI, CERCLA, Prop 65, and Carcinogen columns are populated, along with the exposure limit values shown in the expanded detail view.


1. High-Level Architecture

SDS Upload / Parse


chemiq_sds_composition (CAS numbers extracted from SDS Section 3)


Background Service (RegulatoryEnrichmentService)

├─ Primary: PubChem PUG View API (regulatory sections)
└─ Fallback: SDS Section 15 parser (parsed regulatory text)


chemiq_regulatory_lists (boolean flags per CAS)
chemiq_pubchem_cache (numeric exposure limits per CAS)


CAS Linking Page reads via CASLinkingService SQL query

Trigger: Regulatory data is enriched automatically when a new SDS is processed and new CAS numbers are detected. The background service (RegulatoryEnrichmentService) checks if each CAS number already exists in chemiq_regulatory_lists — if not, it fetches data from external sources and inserts it.


2. Data Sources

2.1 Primary: PubChem PUG View API

The PubChemRegulatoryClient fetches data from the NIH PubChem compound database via three API endpoints:

PUG View HeadingWhat It Provides
Regulatory InformationCERCLA RQ, TRI/SARA 313, Prop 65, OSHA PEL mentions, NIOSH REL mentions, ACGIH TLV mentions
Safety and HazardsOSHA PEL confirmations, GHS data
ToxicityIARC carcinogen classification, NTP carcinogen classification, ACGIH A1/A2 carcinogen classification, NIOSH/OSHA indicators

Additionally, numeric exposure limit values are fetched from specific headings:

PUG View HeadingValues Extracted
OSHA StandardsPEL in ppm and mg/m³ (from TWA text)
NIOSH RecommendationsREL in ppm and mg/m³ (from TWA text)
Immediately Dangerous to Life or HealthIDLH in ppm
Threshold Limit ValuesACGIH TLV in ppm

Process:

  1. CAS number → PubChem CID resolution via PUG REST /compound/name/\{cas\}/cids/JSON
  2. CID → PUG View sections queried for regulatory text blocks
  3. Text blocks are parsed with keyword and regex matching to extract boolean flags and numeric values

Rate limit: 3 requests/second with retry (3 attempts, exponential backoff).

2.2 Fallback: SDS Section 15 Parser

When PubChem returns no data (common for UVCB substances and petroleum mixtures), the SDSSection15Client reads from the parsed SDS Section 15 data already stored in chemiq_sds_sections.

It parses the safety_health_environmental_regulations JSON key from the section data, scanning for:

  • Key names containing "sara", "tri", "cercla", "prop 65", "osha", "niosh", "acgih"
  • Text values are checked against negative phrases ("does not contain", "not subject", "not regulated", "not listed", etc.) to avoid false positives

2.3 "Not Found" Handling

If neither PubChem nor SDS Section 15 returns data, the CAS number is still inserted into chemiq_regulatory_lists with all flags set to false and a note: "Not found in PubChem or SDS Section 15". This prevents re-fetching on every page load.


3. Database Tables

3.1 chemiq_regulatory_lists — Boolean Regulatory Flags

One row per CAS number (globally, not per company). Keyed by cas_number (unique).

ColumnTypeDescription
cas_numberString(20)CAS registry number (unique key)
chemical_nameString(255)Common name
is_osha_pel_listedBooleanHas an OSHA Permissible Exposure Limit
has_osha_specific_standardBooleanHas a substance-specific OSHA standard
osha_standard_citationString(50)e.g., "1910.1028" (benzene)
is_niosh_rel_listedBooleanHas a NIOSH Recommended Exposure Limit
is_acgih_tlv_listedBooleanHas an ACGIH Threshold Limit Value
is_epa_sara_313BooleanOn EPA Toxics Release Inventory list
sara_313_threshold_lbsIntegerTRI reporting threshold (default 10,000 lbs)
sara_313_categoryString(50)manufactured, processed, or otherwise_used
sara_313_pbtBooleanPersistent Bioaccumulative Toxic substance
is_epa_cerclaBooleanHas a CERCLA Reportable Quantity
cercla_rq_lbsIntegerReportable quantity in pounds
is_california_prop65BooleanListed under California Proposition 65
prop65_typeString(20)"cancer", "reproductive", or "both"
prop65_listing_dateDateWhen listed
prop65_nsrl_ugNumericNo Significant Risk Level (cancer, µg/day)
prop65_madl_ugNumericMaximum Allowable Dose Level (reproductive, µg/day)
is_carcinogenBooleanClassified as carcinogenic
carcinogen_sourceString(20)"IARC", "NTP", "OSHA", "ACGIH", or "CA_Prop65"
carcinogen_classificationString(20)"Group 1", "Group 2A", "Group 2B", "K", "R", "A1", "A2"
last_verified_atDateTimeWhen data was last fetched/verified
source_urlsJSONBArray of source URLs (e.g., PubChem compound page)

3.2 chemiq_pubchem_cache — Numeric Exposure Limits

One row per CAS number. Stores chemical identity, physical properties, and numeric exposure limit values.

ColumnTypeDescription
cas_numberString(20)CAS registry number (unique key)
osha_pel_ppmNumericOSHA PEL in parts per million
osha_pel_mg_m3NumericOSHA PEL in mg/m³
niosh_rel_ppmNumericNIOSH REL in ppm
niosh_rel_mg_m3NumericNIOSH REL in mg/m³
niosh_idlh_ppmNumericNIOSH IDLH in ppm (Immediately Dangerous to Life or Health)
acgih_tlv_ppmNumericACGIH TLV-TWA in ppm

4. CAS Linking Page — How Data Is Queried

The CASLinkingService.get_cas_linking() method executes a single SQL query that:

  1. Starts from chemiq_sds_composition — all CAS numbers extracted from SDS documents
  2. JOINs chemiq_sds_documentschemiq_company_product_catalogchemiq_inventory — to filter to CAS numbers that are in the company's active inventory
  3. LEFT JOINs chemiq_regulatory_lists — to get regulatory boolean flags
  4. LEFT JOINs chemiq_pubchem_cache — to get numeric exposure limits
  5. LEFT JOINs core_data_sites — to get site names for product details
  6. GROUPs BY cas_number — aggregates product counts and product details per CAS

Key distinction:

  • A CAS number is "linked" if it has a row in chemiq_regulatory_lists (even if all flags are false)
  • A CAS number is "flagged" if it is linked AND at least one regulatory flag is true
  • A CAS number is "unlinked" if it has no row in chemiq_regulatory_lists (enrichment hasn't run yet)

5. Column-by-Column Determination Logic

5.1 Table Columns

ColumnSourceDetermination
Chemical Namechemiq_sds_composition.chemical_nameMAX(chemical_name) across all SDS documents
CAS Numberchemiq_sds_composition.cas_numberGrouped by CAS
Productschemiq_inventoryCOUNT(DISTINCT chemical_id) — number of inventory items containing this CAS
OSHAchemiq_regulatory_lists.is_osha_pel_listedSee Section 6.1
NIOSHchemiq_regulatory_lists.is_niosh_rel_listedSee Section 6.2
TRIchemiq_regulatory_lists.is_epa_sara_313See Section 6.3
CERCLAchemiq_regulatory_lists.is_epa_cerclaSee Section 6.4
Prop 65chemiq_regulatory_lists.is_california_prop65See Section 6.5
Carcinogenchemiq_regulatory_lists.is_carcinogenSee Section 6.6

5.2 UI Icons

StateIconMeaning
Unlinked (no regulatory data)Gray dash (—)CAS not yet enriched
Linked, flag is falseGreen checkmarkChemical is NOT on this list
Linked, flag is trueRed warning triangleChemical IS on this list — action may be required

6. Regulatory Field Determination — Detailed Logic

6.1 OSHA (Permissible Exposure Limit)

What it means: The chemical has a legally enforceable workplace air concentration limit set by the Occupational Safety and Health Administration.

How it's determined:

Primary (PubChem PUG View):
1. Fetch "Regulatory Information" section
→ Scan text blocks for "osha" AND ("pel" OR "permissible exposure")
→ If found: is_osha_pel_listed = true
→ Also scan for standard citation pattern "1910.\d+"
→ If found: has_osha_specific_standard = true, osha_standard_citation = match

2. Fetch "Safety and Hazards" section
→ Scan for "osha" AND ("pel" OR "permissible")
→ If found: is_osha_pel_listed = true

3. Fetch "Toxicity" section
→ Scan for "osha" AND ("pel" OR "permissible")
→ If found: is_osha_pel_listed = true

4. Fetch "OSHA Standards" heading
→ Parse text for TWA values: regex "(\d+\.?\d*)\s*ppm" and "(\d+\.?\d*)\s*mg/m"
→ Store as osha_pel_ppm and osha_pel_mg_m3 in chemiq_pubchem_cache
→ Also sets is_osha_pel_listed = true

Fallback (SDS Section 15):
→ Scan regulatory keys for "osha"
→ Check value text is not a negation (e.g., "not regulated")
→ If positive mention: is_osha_pel_listed = true

What it means: NIOSH has set a more protective (recommended, not legally binding) workplace exposure limit for this chemical.

How it's determined:

Primary (PubChem PUG View):
1. "Regulatory Information" section
→ Scan combined text for "niosh" AND ("rel" OR "recommended exposure")

2. "Toxicity" section
→ Scan for "niosh" AND ("rel" OR "recommended")

3. "NIOSH Recommendations" heading
→ Parse TWA values: regex for ppm and mg/m³
→ Store as niosh_rel_ppm and niosh_rel_mg_m3 in chemiq_pubchem_cache

4. "Immediately Dangerous to Life or Health" heading
→ Parse IDLH values: regex "(\d+\.?\d*)\s*ppm"
→ Store as niosh_idlh_ppm in chemiq_pubchem_cache

Fallback (SDS Section 15):
→ Scan regulatory keys/values for "niosh" AND "rel"
→ Verify positive mention

6.3 TRI (EPA Toxics Release Inventory / SARA 313)

What it means: If a facility has 10+ employees and manufactures, processes, or uses more than the threshold quantity (typically 10,000 lbs/year) of this chemical, it must file an annual Toxics Release Inventory report with the EPA.

How it's determined:

Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "sara 313" OR "tri " OR "toxic release inventory"
OR "toxics release inventory"
→ If found: is_epa_sara_313 = true
→ sara_313_threshold_lbs defaults to 10,000

Fallback (SDS Section 15):
→ Scan regulatory keys for "sara" or "tri"
→ Verify positive mention

6.4 CERCLA (Comprehensive Environmental Response, Compensation, and Liability Act)

What it means: If you release this chemical into the environment at or above its Reportable Quantity (RQ), you must notify the National Response Center within 24 hours.

How it's determined:

Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "cercla" OR "reportable quantity"
→ If found: is_epa_cercla = true
→ Extract RQ value: regex "(\d+)\s*(?:lb|pound)" → cercla_rq_lbs

Fallback (SDS Section 15):
→ Scan regulatory keys for "cercla"
→ Verify positive mention
→ Extract RQ: regex "(\d+)\s*(?:lb|pound)" → cercla_rq_lbs

6.5 Prop 65 (California Proposition 65)

What it means: California's Safe Drinking Water and Toxic Enforcement Act requires warnings for chemicals known to the state to cause cancer or reproductive harm. Selling products in California containing listed chemicals requires consumer-facing warnings.

How it's determined:

Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "proposition 65" OR "prop 65" OR "prop65"
→ If found: is_california_prop65 = true
→ Determine type:
"cancer" AND "reproductive" in text → prop65_type = "both"
"cancer" in text → prop65_type = "cancer"
"reproductive" OR "developmental" in text → prop65_type = "reproductive"
else → prop65_type = "cancer" (default)

Fallback (SDS Section 15):
→ Scan regulatory keys for "prop" AND "65"
→ Verify positive mention
→ Same type determination logic

6.6 Carcinogen

What it means: One or more authoritative bodies have classified this chemical as causing or potentially causing cancer in humans.

How it's determined:

Primary (PubChem PUG View):
"Toxicity" section — scans text blocks for:

IARC (International Agency for Research on Cancer):
"iarc" AND "group 1" (AND NOT "not") → Group 1 (confirmed carcinogen)
"iarc" AND "group 2a" → Group 2A (probably carcinogenic)
"iarc" AND "group 2b" → Group 2B (possibly carcinogenic)

NTP (National Toxicology Program):
"ntp" AND "known" AND "carcinogen" → Classification "K" (Known)
"ntp" AND "reasonably anticipated" → Classification "R" (Reasonably Anticipated)

ACGIH:
"acgih" AND "a1" AND "carcinogen" → Classification "A1" (Confirmed human carcinogen)
"acgih" AND "a2" AND "carcinogen" → Classification "A2" (Suspected human carcinogen)

Priority: If multiple sources classify a chemical, the FIRST match
sets carcinogen_source and carcinogen_classification. Later matches
only fill in if carcinogen_source was not already set.

Fallback (SDS Section 15):
→ Carcinogen classification is NOT extracted from SDS Section 15
(only OSHA/NIOSH/ACGIH/TRI/CERCLA/Prop 65 are parsed from it)

Carcinogen classification codes:

SourceClassificationMeaning
IARCGroup 1Carcinogenic to humans (sufficient evidence)
IARCGroup 2AProbably carcinogenic (limited human evidence, sufficient animal evidence)
IARCGroup 2BPossibly carcinogenic (limited evidence in humans and animals)
NTPK (Known)Known to be a human carcinogen
NTPR (Reasonably Anticipated)Reasonably anticipated to be a human carcinogen
ACGIHA1Confirmed human carcinogen
ACGIHA2Suspected human carcinogen

7. Expanded Detail View — Additional Fields

When a user clicks a row, an expanded section shows:

7.1 Regulatory Details (What This Means for Your Business)

Each true flag gets a plain-language explanation:

FlagLabelBusiness Impact
is_osha_pelOSHA Permissible Exposure LimitMust keep air levels below legal limit; ventilation, monitoring, respirators required
is_niosh_rel (when OSHA not listed)NIOSH Recommended Exposure LimitMore protective guideline; not legally required but reduces liability
is_epa_triEPA Toxics Release Inventory (SARA 313)Annual TRI report required if 10+ employees and use above threshold
is_epa_cerclaCERCLA Reportable QuantityMust call National Response Center (1-800-424-8802) within 24 hours of release >= RQ
is_california_prop65California Proposition 65Warning labels required on products sold in California; up to $2,500/violation/day
is_carcinogenCarcinogen ClassificationMust inform workers, minimize exposure, special handling, medical surveillance

7.2 Exposure Limits (How Much Is Safe?)

Numeric values from chemiq_pubchem_cache, displayed as 8-hour TWA concentrations:

LimitSource ColumnDescription
OSHA PELosha_pel_ppm / osha_pel_mg_m3Legally enforceable maximum; OSHA violations and fines if exceeded
NIOSH RELniosh_rel_ppm / niosh_rel_mg_m3More protective recommended limit
NIOSH IDLHniosh_idlh_ppmEmergency threshold — workers must evacuate or use supplied-air respirators
ACGIH TLVacgih_tlv_ppmIndustry best-practice guideline, not legally binding

7.3 Products Containing This Chemical

Aggregated from the SQL query's json_agg() — shows product name, manufacturer, and site name for each inventory item containing this CAS.


8. Summary Cards

CardValueCalculation
Unique ChemicalsTotal unique CAS numbersCOUNT(*) from query results (all CAS in inventory)
Linked to RegistryCAS numbers with regulatory dataCount where chemiq_regulatory_lists row exists
UnlinkedCAS numbers without regulatory datatotal - linked_count
Regulatory FlagsCAS numbers on at least one listCount where linked AND any of: is_osha_pel, is_niosh_rel, is_acgih_tlv, is_epa_tri, is_epa_cercla, is_california_prop65, is_carcinogen is true

9. Filtering

Regulatory Filter

The dropdown filter applies post-query in Python:

Filter ValueMatches
oshais_osha_pel = true
nioshis_niosh_rel = true
acgihis_acgih_tlv = true
triis_epa_tri = true
cerclais_epa_cercla = true
prop65is_california_prop65 = true
carcinogenis_carcinogen = true

Unlinked CAS numbers are excluded when a regulatory filter is active.

Searches by CAS number or chemical name (case-insensitive ILIKE).


10. Key Files

LayerFilePurpose
Frontend Pagetellus-ehs-hazcom-ui/src/pages/chemiq/cas-linking/index.tsxReact page with table, filters, expanded detail view
Frontend Typestellus-ehs-hazcom-ui/src/types/index.ts (lines 781–838)CASRegulatoryStatus, CASExposureLimits, CASLinkingItem types
Frontend API Clienttellus-ehs-hazcom-ui/src/services/api/chemiq.api.tsgetCASLinking() function
API Endpointtellus-ehs-hazcom-service/app/api/v1/chemiq/inventory.py (line 449)GET /api/v1/chemiq/cas-linking
Query Servicetellus-ehs-hazcom-service/app/services/chemiq/cas_linking_service.pyCASLinkingService — SQL query, filtering, pagination
Pydantic Schemastellus-ehs-hazcom-service/app/schemas/chemiq/cas_linking.pyRequest/response validation
DB Model — Regulatorytellus-ehs-hazcom-service/app/db/models/compliance.pyRegulatoryList (chemiq_regulatory_lists)
DB Model — Cachetellus-ehs-hazcom-service/app/db/models/chemical_enrichment.pyPubChemCache (chemiq_pubchem_cache)
Enrichment Servicetellus-ehs-background-service/app/services/regulatory_enrichment/service.pyRegulatoryEnrichmentService — orchestrates fetch and upsert
PubChem Clienttellus-ehs-background-service/app/services/regulatory_enrichment/pubchem_regulatory_client.pyPubChemRegulatoryClient — API calls, text parsing
SDS Section 15 Clienttellus-ehs-background-service/app/services/regulatory_enrichment/sds_section15_client.pySDSSection15Client — fallback parser

11. Database Tables

TablePurpose
chemiq_regulatory_listsBoolean regulatory flags per CAS number (global, not per-company)
chemiq_pubchem_cacheNumeric exposure limits and chemical identity per CAS
chemiq_sds_compositionCAS numbers extracted from parsed SDS documents
chemiq_sds_sectionsParsed SDS sections (Section 15 used as fallback for regulatory data)
chemiq_company_compliance_settingsCompany-specific settings (ships_to_california, employee_count, etc.)
chemiq_compliance_resultsPer-item compliance check results (not used by CAS linking page)

12. Data Enrichment Pipeline

1. SDS uploaded and parsed
→ CAS numbers extracted to chemiq_sds_composition

2. Background job detects new CAS numbers
→ RegulatoryEnrichmentService.enrich_cas(cas_number)

3. Check chemiq_regulatory_lists
→ If exists: skip (already enriched)
→ If not: proceed to fetch

4. PubChem PUG View API
→ Resolve CAS → CID
→ Fetch Regulatory Information, Safety and Hazards, Toxicity sections
→ Fetch OSHA Standards, NIOSH Recommendations, IDLH, TLV headings
→ Parse text blocks with keyword/regex matching
→ Extract boolean flags + numeric exposure values

5. If PubChem returns nothing:
→ SDSSection15Client reads from chemiq_sds_sections
→ Parses safety_health_environmental_regulations JSON
→ Extracts boolean flags (no numeric values)

6. If neither source has data:
→ Insert row with all flags = false
→ Note: "Not found in PubChem or SDS Section 15"

7. Results stored:
→ Boolean flags → chemiq_regulatory_lists (INSERT)
→ Numeric exposure limits → chemiq_pubchem_cache (UPDATE existing row)

Backfill Job

The get_unenriched_cas_numbers() method finds CAS numbers in chemiq_sds_composition that have no corresponding row in chemiq_regulatory_lists. It validates CAS format with regex ^\d\{1,7\}-\d\{2\}-\d$ before returning candidates. This enables batch enrichment of chemicals that were missed during initial SDS processing.


13. Current Limitations

  1. First-match carcinogen classification: If multiple sources classify a chemical (e.g., IARC Group 1 AND NTP Known), only the first match sets carcinogen_source and carcinogen_classification. Later sources only fill in if the field was still empty.
  2. No ACGIH TLV from SDS Section 15: The SDS Section 15 fallback does not extract ACGIH-specific data for carcinogen classification — only boolean listing flags.
  3. Exposure limit parsing is heuristic: The regex-based extraction from PubChem text blocks may miss non-standard formats or extract incorrect values from complex multi-value text.
  4. No automatic refresh: Once a CAS number is enriched, the data is not automatically re-fetched. The last_verified_at timestamp tracks when data was fetched, but no scheduled refresh exists.
  5. Company compliance settings (chemiq_company_compliance_settings) exist in the model but are not yet used by the CAS linking page to filter applicability (e.g., hiding Prop 65 for companies that don't sell in California).
  6. TRI threshold defaults to 10,000 lbs for all chemicals. PBT (Persistent Bioaccumulative Toxic) chemicals have lower thresholds (100 lbs) but this is not yet dynamically determined.
  7. Regulatory filter is post-query: The filter is applied in Python after fetching all results, not in SQL. This works for moderate data sizes but may be inefficient for companies with thousands of CAS numbers.