CAS & Regulatory Linking — Determination Logic
This document explains how every field on the /chemiq/cas-linking screen is determined, including how the OSHA, NIOSH, TRI, CERCLA, Prop 65, and Carcinogen columns are populated, along with the exposure limit values shown in the expanded detail view.
1. High-Level Architecture
SDS Upload / Parse
│
▼
chemiq_sds_composition (CAS numbers extracted from SDS Section 3)
│
▼
Background Service (RegulatoryEnrichmentService)
│
├─ Primary: PubChem PUG View API (regulatory sections)
└─ Fallback: SDS Section 15 parser (parsed regulatory text)
│
▼
chemiq_regulatory_lists (boolean flags per CAS)
chemiq_pubchem_cache (numeric exposure limits per CAS)
│
▼
CAS Linking Page reads via CASLinkingService SQL query
Trigger: Regulatory data is enriched automatically when a new SDS is processed and new CAS numbers are detected. The background service (RegulatoryEnrichmentService) checks if each CAS number already exists in chemiq_regulatory_lists — if not, it fetches data from external sources and inserts it.
2. Data Sources
2.1 Primary: PubChem PUG View API
The PubChemRegulatoryClient fetches data from the NIH PubChem compound database via three API endpoints:
| PUG View Heading | What It Provides |
|---|---|
Regulatory Information | CERCLA RQ, TRI/SARA 313, Prop 65, OSHA PEL mentions, NIOSH REL mentions, ACGIH TLV mentions |
Safety and Hazards | OSHA PEL confirmations, GHS data |
Toxicity | IARC carcinogen classification, NTP carcinogen classification, ACGIH A1/A2 carcinogen classification, NIOSH/OSHA indicators |
Additionally, numeric exposure limit values are fetched from specific headings:
| PUG View Heading | Values Extracted |
|---|---|
OSHA Standards | PEL in ppm and mg/m³ (from TWA text) |
NIOSH Recommendations | REL in ppm and mg/m³ (from TWA text) |
Immediately Dangerous to Life or Health | IDLH in ppm |
Threshold Limit Values | ACGIH TLV in ppm |
Process:
- CAS number → PubChem CID resolution via
PUG REST /compound/name/\{cas\}/cids/JSON - CID → PUG View sections queried for regulatory text blocks
- Text blocks are parsed with keyword and regex matching to extract boolean flags and numeric values
Rate limit: 3 requests/second with retry (3 attempts, exponential backoff).
2.2 Fallback: SDS Section 15 Parser
When PubChem returns no data (common for UVCB substances and petroleum mixtures), the SDSSection15Client reads from the parsed SDS Section 15 data already stored in chemiq_sds_sections.
It parses the safety_health_environmental_regulations JSON key from the section data, scanning for:
- Key names containing "sara", "tri", "cercla", "prop 65", "osha", "niosh", "acgih"
- Text values are checked against negative phrases ("does not contain", "not subject", "not regulated", "not listed", etc.) to avoid false positives
2.3 "Not Found" Handling
If neither PubChem nor SDS Section 15 returns data, the CAS number is still inserted into chemiq_regulatory_lists with all flags set to false and a note: "Not found in PubChem or SDS Section 15". This prevents re-fetching on every page load.
3. Database Tables
3.1 chemiq_regulatory_lists — Boolean Regulatory Flags
One row per CAS number (globally, not per company). Keyed by cas_number (unique).
| Column | Type | Description |
|---|---|---|
cas_number | String(20) | CAS registry number (unique key) |
chemical_name | String(255) | Common name |
is_osha_pel_listed | Boolean | Has an OSHA Permissible Exposure Limit |
has_osha_specific_standard | Boolean | Has a substance-specific OSHA standard |
osha_standard_citation | String(50) | e.g., "1910.1028" (benzene) |
is_niosh_rel_listed | Boolean | Has a NIOSH Recommended Exposure Limit |
is_acgih_tlv_listed | Boolean | Has an ACGIH Threshold Limit Value |
is_epa_sara_313 | Boolean | On EPA Toxics Release Inventory list |
sara_313_threshold_lbs | Integer | TRI reporting threshold (default 10,000 lbs) |
sara_313_category | String(50) | manufactured, processed, or otherwise_used |
sara_313_pbt | Boolean | Persistent Bioaccumulative Toxic substance |
is_epa_cercla | Boolean | Has a CERCLA Reportable Quantity |
cercla_rq_lbs | Integer | Reportable quantity in pounds |
is_california_prop65 | Boolean | Listed under California Proposition 65 |
prop65_type | String(20) | "cancer", "reproductive", or "both" |
prop65_listing_date | Date | When listed |
prop65_nsrl_ug | Numeric | No Significant Risk Level (cancer, µg/day) |
prop65_madl_ug | Numeric | Maximum Allowable Dose Level (reproductive, µg/day) |
is_carcinogen | Boolean | Classified as carcinogenic |
carcinogen_source | String(20) | "IARC", "NTP", "OSHA", "ACGIH", or "CA_Prop65" |
carcinogen_classification | String(20) | "Group 1", "Group 2A", "Group 2B", "K", "R", "A1", "A2" |
last_verified_at | DateTime | When data was last fetched/verified |
source_urls | JSONB | Array of source URLs (e.g., PubChem compound page) |
3.2 chemiq_pubchem_cache — Numeric Exposure Limits
One row per CAS number. Stores chemical identity, physical properties, and numeric exposure limit values.
| Column | Type | Description |
|---|---|---|
cas_number | String(20) | CAS registry number (unique key) |
osha_pel_ppm | Numeric | OSHA PEL in parts per million |
osha_pel_mg_m3 | Numeric | OSHA PEL in mg/m³ |
niosh_rel_ppm | Numeric | NIOSH REL in ppm |
niosh_rel_mg_m3 | Numeric | NIOSH REL in mg/m³ |
niosh_idlh_ppm | Numeric | NIOSH IDLH in ppm (Immediately Dangerous to Life or Health) |
acgih_tlv_ppm | Numeric | ACGIH TLV-TWA in ppm |
4. CAS Linking Page — How Data Is Queried
The CASLinkingService.get_cas_linking() method executes a single SQL query that:
- Starts from
chemiq_sds_composition— all CAS numbers extracted from SDS documents - JOINs
chemiq_sds_documents→chemiq_company_product_catalog→chemiq_inventory— to filter to CAS numbers that are in the company's active inventory - LEFT JOINs
chemiq_regulatory_lists— to get regulatory boolean flags - LEFT JOINs
chemiq_pubchem_cache— to get numeric exposure limits - LEFT JOINs
core_data_sites— to get site names for product details - GROUPs BY
cas_number— aggregates product counts and product details per CAS
Key distinction:
- A CAS number is "linked" if it has a row in
chemiq_regulatory_lists(even if all flags arefalse) - A CAS number is "flagged" if it is linked AND at least one regulatory flag is
true - A CAS number is "unlinked" if it has no row in
chemiq_regulatory_lists(enrichment hasn't run yet)
5. Column-by-Column Determination Logic
5.1 Table Columns
| Column | Source | Determination |
|---|---|---|
| Chemical Name | chemiq_sds_composition.chemical_name | MAX(chemical_name) across all SDS documents |
| CAS Number | chemiq_sds_composition.cas_number | Grouped by CAS |
| Products | chemiq_inventory | COUNT(DISTINCT chemical_id) — number of inventory items containing this CAS |
| OSHA | chemiq_regulatory_lists.is_osha_pel_listed | See Section 6.1 |
| NIOSH | chemiq_regulatory_lists.is_niosh_rel_listed | See Section 6.2 |
| TRI | chemiq_regulatory_lists.is_epa_sara_313 | See Section 6.3 |
| CERCLA | chemiq_regulatory_lists.is_epa_cercla | See Section 6.4 |
| Prop 65 | chemiq_regulatory_lists.is_california_prop65 | See Section 6.5 |
| Carcinogen | chemiq_regulatory_lists.is_carcinogen | See Section 6.6 |
5.2 UI Icons
| State | Icon | Meaning |
|---|---|---|
| Unlinked (no regulatory data) | Gray dash (—) | CAS not yet enriched |
Linked, flag is false | Green checkmark | Chemical is NOT on this list |
Linked, flag is true | Red warning triangle | Chemical IS on this list — action may be required |
6. Regulatory Field Determination — Detailed Logic
6.1 OSHA (Permissible Exposure Limit)
What it means: The chemical has a legally enforceable workplace air concentration limit set by the Occupational Safety and Health Administration.
How it's determined:
Primary (PubChem PUG View):
1. Fetch "Regulatory Information" section
→ Scan text blocks for "osha" AND ("pel" OR "permissible exposure")
→ If found: is_osha_pel_listed = true
→ Also scan for standard citation pattern "1910.\d+"
→ If found: has_osha_specific_standard = true, osha_standard_citation = match
2. Fetch "Safety and Hazards" section
→ Scan for "osha" AND ("pel" OR "permissible")
→ If found: is_osha_pel_listed = true
3. Fetch "Toxicity" section
→ Scan for "osha" AND ("pel" OR "permissible")
→ If found: is_osha_pel_listed = true
4. Fetch "OSHA Standards" heading
→ Parse text for TWA values: regex "(\d+\.?\d*)\s*ppm" and "(\d+\.?\d*)\s*mg/m"
→ Store as osha_pel_ppm and osha_pel_mg_m3 in chemiq_pubchem_cache
→ Also sets is_osha_pel_listed = true
Fallback (SDS Section 15):
→ Scan regulatory keys for "osha"
→ Check value text is not a negation (e.g., "not regulated")
→ If positive mention: is_osha_pel_listed = true
6.2 NIOSH (Recommended Exposure Limit)
What it means: NIOSH has set a more protective (recommended, not legally binding) workplace exposure limit for this chemical.
How it's determined:
Primary (PubChem PUG View):
1. "Regulatory Information" section
→ Scan combined text for "niosh" AND ("rel" OR "recommended exposure")
2. "Toxicity" section
→ Scan for "niosh" AND ("rel" OR "recommended")
3. "NIOSH Recommendations" heading
→ Parse TWA values: regex for ppm and mg/m³
→ Store as niosh_rel_ppm and niosh_rel_mg_m3 in chemiq_pubchem_cache
4. "Immediately Dangerous to Life or Health" heading
→ Parse IDLH values: regex "(\d+\.?\d*)\s*ppm"
→ Store as niosh_idlh_ppm in chemiq_pubchem_cache
Fallback (SDS Section 15):
→ Scan regulatory keys/values for "niosh" AND "rel"
→ Verify positive mention
6.3 TRI (EPA Toxics Release Inventory / SARA 313)
What it means: If a facility has 10+ employees and manufactures, processes, or uses more than the threshold quantity (typically 10,000 lbs/year) of this chemical, it must file an annual Toxics Release Inventory report with the EPA.
How it's determined:
Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "sara 313" OR "tri " OR "toxic release inventory"
OR "toxics release inventory"
→ If found: is_epa_sara_313 = true
→ sara_313_threshold_lbs defaults to 10,000
Fallback (SDS Section 15):
→ Scan regulatory keys for "sara" or "tri"
→ Verify positive mention
6.4 CERCLA (Comprehensive Environmental Response, Compensation, and Liability Act)
What it means: If you release this chemical into the environment at or above its Reportable Quantity (RQ), you must notify the National Response Center within 24 hours.
How it's determined:
Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "cercla" OR "reportable quantity"
→ If found: is_epa_cercla = true
→ Extract RQ value: regex "(\d+)\s*(?:lb|pound)" → cercla_rq_lbs
Fallback (SDS Section 15):
→ Scan regulatory keys for "cercla"
→ Verify positive mention
→ Extract RQ: regex "(\d+)\s*(?:lb|pound)" → cercla_rq_lbs
6.5 Prop 65 (California Proposition 65)
What it means: California's Safe Drinking Water and Toxic Enforcement Act requires warnings for chemicals known to the state to cause cancer or reproductive harm. Selling products in California containing listed chemicals requires consumer-facing warnings.
How it's determined:
Primary (PubChem PUG View):
"Regulatory Information" section
→ Scan text blocks for "proposition 65" OR "prop 65" OR "prop65"
→ If found: is_california_prop65 = true
→ Determine type:
"cancer" AND "reproductive" in text → prop65_type = "both"
"cancer" in text → prop65_type = "cancer"
"reproductive" OR "developmental" in text → prop65_type = "reproductive"
else → prop65_type = "cancer" (default)
Fallback (SDS Section 15):
→ Scan regulatory keys for "prop" AND "65"
→ Verify positive mention
→ Same type determination logic
6.6 Carcinogen
What it means: One or more authoritative bodies have classified this chemical as causing or potentially causing cancer in humans.
How it's determined:
Primary (PubChem PUG View):
"Toxicity" section — scans text blocks for:
IARC (International Agency for Research on Cancer):
"iarc" AND "group 1" (AND NOT "not") → Group 1 (confirmed carcinogen)
"iarc" AND "group 2a" → Group 2A (probably carcinogenic)
"iarc" AND "group 2b" → Group 2B (possibly carcinogenic)
NTP (National Toxicology Program):
"ntp" AND "known" AND "carcinogen" → Classification "K" (Known)
"ntp" AND "reasonably anticipated" → Classification "R" (Reasonably Anticipated)
ACGIH:
"acgih" AND "a1" AND "carcinogen" → Classification "A1" (Confirmed human carcinogen)
"acgih" AND "a2" AND "carcinogen" → Classification "A2" (Suspected human carcinogen)
Priority: If multiple sources classify a chemical, the FIRST match
sets carcinogen_source and carcinogen_classification. Later matches
only fill in if carcinogen_source was not already set.
Fallback (SDS Section 15):
→ Carcinogen classification is NOT extracted from SDS Section 15
(only OSHA/NIOSH/ACGIH/TRI/CERCLA/Prop 65 are parsed from it)
Carcinogen classification codes:
| Source | Classification | Meaning |
|---|---|---|
| IARC | Group 1 | Carcinogenic to humans (sufficient evidence) |
| IARC | Group 2A | Probably carcinogenic (limited human evidence, sufficient animal evidence) |
| IARC | Group 2B | Possibly carcinogenic (limited evidence in humans and animals) |
| NTP | K (Known) | Known to be a human carcinogen |
| NTP | R (Reasonably Anticipated) | Reasonably anticipated to be a human carcinogen |
| ACGIH | A1 | Confirmed human carcinogen |
| ACGIH | A2 | Suspected human carcinogen |
7. Expanded Detail View — Additional Fields
When a user clicks a row, an expanded section shows:
7.1 Regulatory Details (What This Means for Your Business)
Each true flag gets a plain-language explanation:
| Flag | Label | Business Impact |
|---|---|---|
is_osha_pel | OSHA Permissible Exposure Limit | Must keep air levels below legal limit; ventilation, monitoring, respirators required |
is_niosh_rel (when OSHA not listed) | NIOSH Recommended Exposure Limit | More protective guideline; not legally required but reduces liability |
is_epa_tri | EPA Toxics Release Inventory (SARA 313) | Annual TRI report required if 10+ employees and use above threshold |
is_epa_cercla | CERCLA Reportable Quantity | Must call National Response Center (1-800-424-8802) within 24 hours of release >= RQ |
is_california_prop65 | California Proposition 65 | Warning labels required on products sold in California; up to $2,500/violation/day |
is_carcinogen | Carcinogen Classification | Must inform workers, minimize exposure, special handling, medical surveillance |
7.2 Exposure Limits (How Much Is Safe?)
Numeric values from chemiq_pubchem_cache, displayed as 8-hour TWA concentrations:
| Limit | Source Column | Description |
|---|---|---|
| OSHA PEL | osha_pel_ppm / osha_pel_mg_m3 | Legally enforceable maximum; OSHA violations and fines if exceeded |
| NIOSH REL | niosh_rel_ppm / niosh_rel_mg_m3 | More protective recommended limit |
| NIOSH IDLH | niosh_idlh_ppm | Emergency threshold — workers must evacuate or use supplied-air respirators |
| ACGIH TLV | acgih_tlv_ppm | Industry best-practice guideline, not legally binding |
7.3 Products Containing This Chemical
Aggregated from the SQL query's json_agg() — shows product name, manufacturer, and site name for each inventory item containing this CAS.
8. Summary Cards
| Card | Value | Calculation |
|---|---|---|
| Unique Chemicals | Total unique CAS numbers | COUNT(*) from query results (all CAS in inventory) |
| Linked to Registry | CAS numbers with regulatory data | Count where chemiq_regulatory_lists row exists |
| Unlinked | CAS numbers without regulatory data | total - linked_count |
| Regulatory Flags | CAS numbers on at least one list | Count where linked AND any of: is_osha_pel, is_niosh_rel, is_acgih_tlv, is_epa_tri, is_epa_cercla, is_california_prop65, is_carcinogen is true |
9. Filtering
Regulatory Filter
The dropdown filter applies post-query in Python:
| Filter Value | Matches |
|---|---|
osha | is_osha_pel = true |
niosh | is_niosh_rel = true |
acgih | is_acgih_tlv = true |
tri | is_epa_tri = true |
cercla | is_epa_cercla = true |
prop65 | is_california_prop65 = true |
carcinogen | is_carcinogen = true |
Unlinked CAS numbers are excluded when a regulatory filter is active.
Search
Searches by CAS number or chemical name (case-insensitive ILIKE).
10. Key Files
| Layer | File | Purpose |
|---|---|---|
| Frontend Page | tellus-ehs-hazcom-ui/src/pages/chemiq/cas-linking/index.tsx | React page with table, filters, expanded detail view |
| Frontend Types | tellus-ehs-hazcom-ui/src/types/index.ts (lines 781–838) | CASRegulatoryStatus, CASExposureLimits, CASLinkingItem types |
| Frontend API Client | tellus-ehs-hazcom-ui/src/services/api/chemiq.api.ts | getCASLinking() function |
| API Endpoint | tellus-ehs-hazcom-service/app/api/v1/chemiq/inventory.py (line 449) | GET /api/v1/chemiq/cas-linking |
| Query Service | tellus-ehs-hazcom-service/app/services/chemiq/cas_linking_service.py | CASLinkingService — SQL query, filtering, pagination |
| Pydantic Schemas | tellus-ehs-hazcom-service/app/schemas/chemiq/cas_linking.py | Request/response validation |
| DB Model — Regulatory | tellus-ehs-hazcom-service/app/db/models/compliance.py | RegulatoryList (chemiq_regulatory_lists) |
| DB Model — Cache | tellus-ehs-hazcom-service/app/db/models/chemical_enrichment.py | PubChemCache (chemiq_pubchem_cache) |
| Enrichment Service | tellus-ehs-background-service/app/services/regulatory_enrichment/service.py | RegulatoryEnrichmentService — orchestrates fetch and upsert |
| PubChem Client | tellus-ehs-background-service/app/services/regulatory_enrichment/pubchem_regulatory_client.py | PubChemRegulatoryClient — API calls, text parsing |
| SDS Section 15 Client | tellus-ehs-background-service/app/services/regulatory_enrichment/sds_section15_client.py | SDSSection15Client — fallback parser |
11. Database Tables
| Table | Purpose |
|---|---|
chemiq_regulatory_lists | Boolean regulatory flags per CAS number (global, not per-company) |
chemiq_pubchem_cache | Numeric exposure limits and chemical identity per CAS |
chemiq_sds_composition | CAS numbers extracted from parsed SDS documents |
chemiq_sds_sections | Parsed SDS sections (Section 15 used as fallback for regulatory data) |
chemiq_company_compliance_settings | Company-specific settings (ships_to_california, employee_count, etc.) |
chemiq_compliance_results | Per-item compliance check results (not used by CAS linking page) |
12. Data Enrichment Pipeline
1. SDS uploaded and parsed
→ CAS numbers extracted to chemiq_sds_composition
2. Background job detects new CAS numbers
→ RegulatoryEnrichmentService.enrich_cas(cas_number)
3. Check chemiq_regulatory_lists
→ If exists: skip (already enriched)
→ If not: proceed to fetch
4. PubChem PUG View API
→ Resolve CAS → CID
→ Fetch Regulatory Information, Safety and Hazards, Toxicity sections
→ Fetch OSHA Standards, NIOSH Recommendations, IDLH, TLV headings
→ Parse text blocks with keyword/regex matching
→ Extract boolean flags + numeric exposure values
5. If PubChem returns nothing:
→ SDSSection15Client reads from chemiq_sds_sections
→ Parses safety_health_environmental_regulations JSON
→ Extracts boolean flags (no numeric values)
6. If neither source has data:
→ Insert row with all flags = false
→ Note: "Not found in PubChem or SDS Section 15"
7. Results stored:
→ Boolean flags → chemiq_regulatory_lists (INSERT)
→ Numeric exposure limits → chemiq_pubchem_cache (UPDATE existing row)
Backfill Job
The get_unenriched_cas_numbers() method finds CAS numbers in chemiq_sds_composition that have no corresponding row in chemiq_regulatory_lists. It validates CAS format with regex ^\d\{1,7\}-\d\{2\}-\d$ before returning candidates. This enables batch enrichment of chemicals that were missed during initial SDS processing.
13. Current Limitations
- First-match carcinogen classification: If multiple sources classify a chemical (e.g., IARC Group 1 AND NTP Known), only the first match sets
carcinogen_sourceandcarcinogen_classification. Later sources only fill in if the field was still empty. - No ACGIH TLV from SDS Section 15: The SDS Section 15 fallback does not extract ACGIH-specific data for carcinogen classification — only boolean listing flags.
- Exposure limit parsing is heuristic: The regex-based extraction from PubChem text blocks may miss non-standard formats or extract incorrect values from complex multi-value text.
- No automatic refresh: Once a CAS number is enriched, the data is not automatically re-fetched. The
last_verified_attimestamp tracks when data was fetched, but no scheduled refresh exists. - Company compliance settings (
chemiq_company_compliance_settings) exist in the model but are not yet used by the CAS linking page to filter applicability (e.g., hiding Prop 65 for companies that don't sell in California). - TRI threshold defaults to 10,000 lbs for all chemicals. PBT (Persistent Bioaccumulative Toxic) chemicals have lower thresholds (100 lbs) but this is not yet dynamically determined.
- Regulatory filter is post-query: The filter is applied in Python after fetching all results, not in SQL. This works for moderate data sizes but may be inefficient for companies with thousands of CAS numbers.