Methodology

LeadPipeLookup aggregates data from public sources only. Everything you see on a page links back to a primary source. Here's exactly how we do it.

Source layer

We combine the following authoritative datasets:

EPA Envirofacts SDWIS — the base directory of every US public water system keyed by PWSID, including population served, source type, and owner type. Monthly pull, per state.
City utility portals — address-level lead service line inventories. Currently active: Chicago Department of Water Management (monthly). Additional cities (Cleveland, NYC DEP, DC Water, Detroit, Milwaukee, St. Louis, Boston, Newark) are tracked on our data sources page but most offer ArcGIS viewers only, without bulk CSV downloads.
State LSL aggregators — per-water-system lead service line counts from state agencies. Currently active: New Jersey DEP (via Rutgers University ArcGIS layer, monthly) and New York State DOH (via Socrata aggregates, monthly). Additional states are added as public rollup endpoints become available.
Census TIGER/Line — ZIP ↔ place ↔ state geometry. Annual.
HUD USPS Crosswalk — ZIP-to-CBSA mapping to infer utility service area. Quarterly.
CDC Environmental Public Health Tracking — county-level childhood blood lead surveillance. Quarterly.

Change detection

Every raw file we pull is hashed (SHA-256) and stored. If the hash matches last month's pull, we skip re-parsing — only the last_verified_at timestamp is bumped on the affected pages. This is why a ZIP page that hasn't substantively changed can still correctly claim a 2026-04 verification date.

Quality gates

We halt an ingest and send an alert if any of these fire:

Row count swings by more than ±30% compared to the prior pull (±15% for EPA SDWIS).
Required columns declared in the source's schema.yaml are missing.
More than 5% of rows fail primary-key or format validation (ZIP, PWSID).

When a gate trips, the snapshot is flagged halted, no writes reach production, and a changelog entry is generated so readers can see what was suppressed and why.

Diff-and-upsert, never truncate

Even when an ingest succeeds, we never drop and recreate tables. Each incoming row is compared field-by-field with the production record; differences are written to a change_log table with (entity, field, old_value, new_value, changed_at, snapshot_id). Soft-deletes only — we never hard-delete a water system row, so your bookmark and Google's index both keep working when a utility dissolves or merges.

What "unknown" means

The EPA's LCRR defines four material categories: lead, galvanized requiring replacement, non-lead, and unknown. "Unknown" means the utility has not yet physically verified or documented the pipe material for that service line. Under the LCRR, unknowns must be treated as lead for planning and replacement purposes until inspected — so a high "unknown" count is a planning flag, not a reassurance.

Worst-of-two rule for service lines

A service line has a public side (from the water main to the property line) and a private side (from the property line to the house). If either side is classified as lead, we report the whole line as lead. If either side is galvanized and the other is lead, we report lead. This matches EPA guidance.

Freshness signals

Every page emits a dateModified in JSON-LD (visible to Google) and shows a human-readable "Last verified from [source]" badge. Sitemaps include lastmod per URL and are refreshed on every successful ingest.

Where the methodology will change

Once we have PostGIS wired up, ZIP→utility mapping will use actual service-area polygons instead of HUD's ZIP↔CBSA heuristic. Today's mapping is conservative (a ZIP may show several utilities).
We plan to add a per-utility confidence score based on inventory completeness (percentage of known vs unknown) and source verification age.

See the changelog for methodology updates.