This guide is part of Mastering Geospatial Data Ingestion in Python.

Web Scraping Spatial Metadata in Python

Extracting spatial metadata from unstructured or semi-structured web sources remains a persistent bottleneck in geospatial data engineering. While standardized APIs and machine-readable catalogs continue to proliferate, a significant volume of authoritative environmental, cadastral, and municipal datasets still reside behind legacy portals, HTML tables, or dynamically rendered dashboards. Web scraping bridges this gap — enabling automated discovery of coordinate reference systems (CRS), bounding boxes, and temporal extents before downstream ingestion begins.

Naive approaches collapse quickly in production: hardcoded CSS selectors break on DOM updates, coordinate order assumptions silently corrupt spatial joins, and missing encoding detection transforms character sets into garbage before pyproj ever sees the CRS string. Treating metadata extraction as a structured engineering task — not an ad-hoc scripting exercise — ensures data lineage, spatial integrity, and pipeline resilience. The workflow below applies directly to the broader ingestion strategy in Mastering Geospatial Data Ingestion in Python, where metadata validation precedes heavy geometry processing and prevents costly projection mismatches in production environments.

Prerequisites & Environment Setup

Before implementing a scraping pipeline, establish a controlled Python environment with version-pinned dependencies. Spatial metadata extraction requires HTTP clients, DOM parsers, and geospatial validation libraries.

python -m venv .venv
source .venv/bin/activate   # Linux/macOS
# .venv\Scripts\activate    # Windows

pip install "requests==2.32.*" "beautifulsoup4==4.12.*" lxml \
            "pyproj==3.6.*" "pandas==2.2.*" "pandera==0.20.*"

Core library roles:

requests + lxml — Fast, reliable HTTP fetching and HTML parsing. requests.Session() with connection pooling significantly improves throughput when crawling paginated directories.
beautifulsoup4 — DOM traversal and attribute extraction. Paired with lxml, it handles malformed markup common in legacy government HTML.
pyproj — CRS validation, EPSG lookup, and projection string normalization. The authoritative library for coordinate transformation and datum handling.
pandas + pandera — Structured metadata tabulation and schema enforcement before CSV/Parquet export.

Ensure Python 3.10+ is active. Many legacy portals serve ISO-8859-1 or Windows-1252 encoded HTML; always inspect response.apparent_encoding before passing content to BeautifulSoup.

Version & Compatibility Matrix

Library	Minimum Version	Recommended Method	Known Caveats
`requests`	2.28	`Session` + `HTTPAdapter` retry	Older versions lack `allowed_methods` parameter on `Retry` — use `method_whitelist` instead
`beautifulsoup4`	4.11	`lxml` parser	`html.parser` silently drops malformed `<meta>` tags on some government pages
`pyproj`	3.4	`CRS.from_string()`	`CRS.from_epsg()` raises `CRSError` rather than returning `None` on unknown codes in 3.4+
`pandas`	2.0	`DataFrame.to_parquet()` with `engine='pyarrow'`	Copy-on-write semantics changed in 2.0; avoid in-place column mutation
`pandera`	0.18	`DataFrameSchema` with `coerce=True`	0.17 and below do not support nullable `dtype` aliases

Step-by-Step Implementation

Step 1 — Ingest: Resilient Request Handling

Identify the HTML pages or directory listings containing dataset descriptions. Construct requests with realistic headers, connection timeouts, and retry logic. Respect robots.txt and implement exponential backoff to avoid IP bans.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import logging

log = logging.getLogger(__name__)

def get_session(max_retries: int = 3) -> requests.Session:
    session = requests.Session()
    retry_strategy = Retry(
        total=max_retries,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"],
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    session.headers.update({
        "User-Agent": "GeospatialETL/1.0 (metadata-harvest; contact@example.com)",
        "Accept": "text/html,application/xhtml+xml",
    })
    return session


def fetch_page(session: requests.Session, url: str, timeout: int = 30) -> BeautifulSoup:
    response = session.get(url, timeout=timeout)
    response.raise_for_status()
    # Correct encoding before parsing — many municipal portals report ISO-8859-1
    # in the Content-Type header even when the actual encoding is UTF-8
    if not response.encoding or response.encoding.upper() in ("ISO-8859-1", "LATIN-1"):
        response.encoding = response.apparent_encoding
    log.debug("Fetched %s (%d bytes, encoding=%s)", url, len(response.content), response.encoding)
    return BeautifulSoup(response.text, "lxml")

Step 2 — Diagnose: DOM Traversal and Metadata Field Extraction

Locate metadata blocks using CSS selectors or attribute-based matching. Government and municipal portals frequently structure spatial metadata in HTML definition lists (<dl>), nested tables, or <meta> tags within <head>. Avoid single brittle selectors; layer fallback strategies so the function degrades gracefully when the DOM changes.

def extract_spatial_fields(soup: BeautifulSoup, source_url: str) -> dict:
    """Extract raw CRS, bbox, title, and temporal fields from a parsed portal page."""
    metadata: dict = {"source_url": source_url}

    # CRS: try <meta name="projection">, then a labelled table cell, then inline text
    crs_tag = soup.find("meta", attrs={"name": "projection"})
    if crs_tag:
        metadata["raw_crs"] = crs_tag.get("content", "").strip()
    else:
        # Many ArcGIS-based portals embed CRS inside a <span> with a data attribute
        crs_span = soup.find("span", attrs={"data-field": "spatialReference"})
        if crs_span:
            metadata["raw_crs"] = crs_span.get_text(strip=True)

    # Bounding box: div.extent-coordinates is common on ESRI Open Data portals
    bbox_div = soup.find("div", class_="extent-coordinates")
    if bbox_div:
        metadata["raw_bbox"] = bbox_div.get_text(separator=" ", strip=True)

    # Dataset title
    h1 = soup.find("h1")
    metadata["title"] = h1.get_text(strip=True) if h1 else ""

    # Temporal extent — look for ISO 8601 date ranges in a <time> element or dt/dd pair
    time_tag = soup.find("time")
    if time_tag:
        metadata["temporal_start"] = time_tag.get("datetime", time_tag.get_text(strip=True))

    return metadata

Step 3 — Transform: Spatial Reference Normalization with pyproj

Raw CRS strings are notoriously inconsistent across legacy portals. You will encounter EPSG codes (EPSG:4326), PROJ strings (+proj=longlat +datum=WGS84), WKT fragments, and ambiguous names such as NAD83 / UTM Zone 15N. Normalization with pyproj prevents the silent projection mismatches that corrupt CRS normalization across mixed datasets further down the pipeline.

from pyproj import CRS
from pyproj.exceptions import CRSError
import re


def normalize_crs(raw_crs: str) -> str | None:
    """Return a canonical EPSG authority string or None if CRS cannot be resolved."""
    if not raw_crs or not raw_crs.strip():
        return None
    try:
        crs_obj = CRS.from_string(raw_crs.strip())
        epsg = crs_obj.to_epsg()
        return f"EPSG:{epsg}" if epsg else crs_obj.to_wkt()
    except CRSError:
        pass

    # Fallback: extract a 4-5 digit EPSG code from free-form text
    match = re.search(r"(?:EPSG[:\s]*)(\d{4,5})", raw_crs, re.IGNORECASE)
    if match:
        try:
            return f"EPSG:{CRS.from_epsg(int(match.group(1))).to_epsg()}"
        except CRSError:
            pass

    return None


def parse_bbox(raw_bbox: str, crs_auth: str | None) -> list[float] | None:
    """
    Extract [minx, miny, maxx, maxy] from a free-form bounding box string.

    Many portals list values as lat/lon pairs even when the CRS is projected;
    always verify axis order against the resolved CRS authority string.
    """
    if not raw_bbox:
        return None
    # Match up to four numeric tokens (integers or decimals, including negatives)
    nums = re.findall(r"-?\d+\.?\d*", raw_bbox)
    if len(nums) < 4:
        return None
    coords = [float(n) for n in nums[:4]]

    # Heuristic: if CRS is geographic (EPSG:4326, 4269, etc.) and values look
    # like latitude-first order (|val[0]| ≤ 90 and |val[1]| > 90), swap pairs
    if crs_auth and "4326" in crs_auth and abs(coords[0]) <= 90 and abs(coords[1]) > 90:
        coords = [coords[1], coords[0], coords[3], coords[2]]

    return coords  # [minx, miny, maxx, maxy]

Bounding boxes require the same rigor. Portals often swap latitude/longitude order or mix decimal degrees with projected metres. Always validate against the normalized CRS and store coordinates in a consistent [minx, miny, maxx, maxy] format before passing them to geopandas or PostGIS.

Step 4 — Validate: Schema Enforcement with pandera

Once extracted and normalized, metadata should be validated against a strict schema before moving downstream. Using pandera catches type violations, null fields in mandatory columns, and out-of-range bounding box values before they reach storage.

import pandera as pa
import pandas as pd

spatial_metadata_schema = pa.DataFrameSchema(
    columns={
        "source_url":     pa.Column(str, nullable=False),
        "title":          pa.Column(str, nullable=True),
        "normalized_crs": pa.Column(str, nullable=True),
        "bbox_minx":      pa.Column(float, pa.Check.in_range(-180, 180), nullable=True),
        "bbox_miny":      pa.Column(float, pa.Check.in_range(-90, 90),   nullable=True),
        "bbox_maxx":      pa.Column(float, pa.Check.in_range(-180, 180), nullable=True),
        "bbox_maxy":      pa.Column(float, pa.Check.in_range(-90, 90),   nullable=True),
        "scraped_at":     pa.Column(str, nullable=False),
    },
    coerce=True,
)


def compile_metadata(records: list[dict], parser_version: str = "1.0") -> pd.DataFrame:
    from datetime import datetime, timezone

    rows = []
    for rec in records:
        crs_auth = normalize_crs(rec.get("raw_crs", ""))
        bbox = parse_bbox(rec.get("raw_bbox", ""), crs_auth)
        rows.append({
            "source_url":     rec.get("source_url", ""),
            "title":          rec.get("title", ""),
            "normalized_crs": crs_auth,
            "bbox_minx":      bbox[0] if bbox else None,
            "bbox_miny":      bbox[1] if bbox else None,
            "bbox_maxx":      bbox[2] if bbox else None,
            "bbox_maxy":      bbox[3] if bbox else None,
            "scraped_at":     datetime.now(timezone.utc).isoformat(),
            "parser_version": parser_version,
        })

    df = pd.DataFrame(rows)
    return spatial_metadata_schema.validate(df, lazy=True)

Step 5 — Log: Parquet Output and Lineage Tracking

Store validated output as Apache Parquet to preserve column types and compress efficiently. Always include scraped_at, source_url, and parser_version to maintain auditability and support reproducible research.

def persist_metadata(df: pd.DataFrame, output_path: str) -> None:
    import logging
    log = logging.getLogger(__name__)

    df.to_parquet(output_path, index=False, engine="pyarrow", compression="snappy")
    log.info(
        "Persisted %d metadata records to %s (%.1f KB)",
        len(df),
        output_path,
        df.memory_usage(deep=True).sum() / 1024,
    )

Advanced Patterns and Edge Cases

JavaScript-Rendered Portals and API-First Alternatives

Many modern municipal dashboards render metadata client-side via JavaScript. In these cases, requests and BeautifulSoup return empty containers. Switching to a headless browser like Playwright becomes necessary, but adds significant infrastructure overhead. The preferred alternative is to identify the underlying REST API — most ArcGIS Online and Socrata portals expose a JSON endpoint alongside the HTML page.

For OpenStreetMap derivatives and community-mapped datasets, Fetching OSM Data via the Overpass API bypasses HTML parsing entirely, returning structured JSON that feeds directly into the normalization layer.

import requests

def try_arcgis_json_endpoint(portal_url: str, session: requests.Session) -> dict | None:
    """
    Many ESRI Open Data portal pages have a sibling /data.json or
    ?f=json endpoint that returns machine-readable metadata.
    Try it before falling back to HTML scraping.
    """
    json_url = portal_url.rstrip("/") + "?f=json"
    try:
        resp = session.get(json_url, timeout=15)
        resp.raise_for_status()
        payload = resp.json()
        # ArcGIS REST Info responses carry spatialReference.wkid
        if "spatialReference" in payload:
            return {
                "raw_crs": f"EPSG:{payload['spatialReference'].get('wkid', '')}",
                "raw_bbox": str(payload.get("extent", {}).get("spatialReference", "")),
            }
    except Exception:
        pass
    return None

Encoding Pitfalls in Legacy Government HTML

Government portals built before 2010 frequently declare charset=UTF-8 in their Content-Type header while actually serving ISO-8859-1 or Windows-1252 bytes. This creates mojibake in CRS strings — Ä appearing where degree symbols or special characters should be — which pyproj cannot parse. The apparent_encoding property from chardet (bundled with requests) resolves this in most cases, but falls back to utf-8 for short strings. For portals with consistently mangled encoding, pin the codec explicitly after an initial manual audit.

KNOWN_PORTAL_ENCODINGS: dict[str, str] = {
    "data.cityofchicago.org": "utf-8",
    "gis.state.mn.us":        "windows-1252",
}

def safe_decode(response: requests.Response) -> str:
    hostname = response.url.split("/")[2].lstrip("www.")
    forced_enc = KNOWN_PORTAL_ENCODINGS.get(hostname)
    if forced_enc:
        return response.content.decode(forced_enc, errors="replace")
    response.encoding = response.apparent_encoding or "utf-8"
    return response.text

Chunked Crawling of Paginated Dataset Directories

Large government open-data portals expose hundreds of dataset entries across paginated listing pages. Crawling these sequentially with a fixed delay avoids rate-limiting; crawling them without checkpointing means restarting from zero after a network fault. Implement a lightweight file-based checkpoint that records processed URLs and skips them on retry.

import json
from pathlib import Path

def load_checkpoint(checkpoint_path: str) -> set[str]:
    p = Path(checkpoint_path)
    if p.exists():
        return set(json.loads(p.read_text()))
    return set()

def save_checkpoint(checkpoint_path: str, visited: set[str]) -> None:
    Path(checkpoint_path).write_text(json.dumps(sorted(visited)))

# Usage within a crawl loop:
# visited = load_checkpoint("crawl_checkpoint.json")
# for url in all_urls:
#     if url in visited:
#         continue
#     soup = fetch_page(session, url)
#     records.append(extract_spatial_fields(soup, url))
#     visited.add(url)
#     save_checkpoint("crawl_checkpoint.json", visited)

Performance Optimization

Vectorized normalization over a pandas Series is significantly faster than applying normalize_crs row-by-row with .apply(). For batches of hundreds of records, use numpy string operations to pre-filter obviously invalid CRS strings before calling pyproj, which carries overhead per invocation.

import numpy as np

def normalize_crs_batch(raw_crs_series: pd.Series) -> pd.Series:
    """
    Vectorized CRS normalization: pre-filter blanks and known-bad patterns,
    then call normalize_crs only on plausible candidates.
    """
    # Mask for non-empty, non-null, non-trivially-invalid strings
    valid_mask = (
        raw_crs_series.notna()
        & (raw_crs_series.str.strip() != "")
        & (~raw_crs_series.str.upper().isin(["UNKNOWN", "N/A", "NONE"]))
    )

    result = pd.Series(index=raw_crs_series.index, dtype="object")
    result[~valid_mask] = None

    # Only call pyproj for plausible CRS strings
    result[valid_mask] = raw_crs_series[valid_mask].map(normalize_crs)
    return result

For attribute harmonization of the scraped schema against downstream targets, see Attribute Mapping and Schema Harmonization — the same column-name normalization patterns apply when scraped metadata columns need to align with an existing data catalog.

Benchmark on a representative sample of 500 URLs before committing to a crawl schedule. On a standard laptop with a 50 ms network latency, the five-stage pipeline completes in roughly 4–6 seconds per URL with a single thread; parallelizing with concurrent.futures.ThreadPoolExecutor(max_workers=4) achieves near-linear throughput gains without triggering rate limits on most government infrastructure.

Integration into ETL Pipelines

Scraped spatial metadata rarely exists in isolation. It serves as the routing layer for automated ingestion pipelines: normalized CRS values trigger conditional reprojection logic, bounding boxes drive tile-based download scheduling, and temporal extents feed incremental-load logic in orchestrated DAGs.

Schema enforcement hooks — wrap spatial_metadata_schema.validate() in a try/except that routes invalid records to a dead-letter Parquet file rather than raising and halting the pipeline:

import pandera.errors as pa_errors

def validate_or_quarantine(df: pd.DataFrame, dql_path: str) -> tuple[pd.DataFrame, int]:
    """
    Validate df against the spatial metadata schema.
    Valid rows are returned; invalid rows are written to dql_path (dead-letter).
    Returns (valid_df, quarantine_count).
    """
    try:
        valid_df = spatial_metadata_schema.validate(df, lazy=True)
        return valid_df, 0
    except pa_errors.SchemaErrors as exc:
        failure_cases = exc.failure_cases["index"].unique()
        invalid_df = df.loc[df.index.isin(failure_cases)]
        valid_df   = df.loc[~df.index.isin(failure_cases)]
        invalid_df.to_parquet(dql_path, index=False, engine="pyarrow")
        log.warning("Quarantined %d invalid metadata records to %s", len(invalid_df), dql_path)
        return valid_df, len(invalid_df)

Downstream routing — once normalized, metadata can drive asset downloads. Extracted bounding boxes and temporal ranges map directly into STAC item manifests, as demonstrated in Syncing STAC Catalogs with pystac-client. For teams managing bulk satellite imagery downloads, the scraped bounding box becomes the AOI filter that scopes each download request.

CI/CD embedding — run the scraper as a scheduled step in your orchestration DAG (Airflow, Prefect, or Dagster). Store the output Parquet in object storage with a date-partitioned prefix (s3://bucket/metadata/year=2026/month=06/) to enable incremental loads and time-travel queries.

Cross-module consistency — the parser_version column written at persist time ensures that changes to selector logic or normalization rules are traceable in the lineage of every downstream asset that consumed a particular metadata batch.

Failure Mode Reference

Failure Mode	Root Cause	Mitigation Strategy
`CRSError: Invalid CRS` from pyproj	Portal emits a WKT fragment or misspelled EPSG string	Run regex fallback to extract numeric EPSG code; fall back to `None` and log the raw value
Empty bounding box after parse	Portal renders extent client-side via JavaScript	Probe the `?f=json` or REST endpoint before HTML scraping; escalate to Playwright if both fail
Mojibake in CRS or title fields	`Content-Type` header encoding mismatch	Use `response.apparent_encoding`; maintain `KNOWN_PORTAL_ENCODINGS` per hostname
Coordinate order swapped silently	Portal lists lat/lon rather than lon/lat	Apply axis-order heuristic against resolved CRS; log swap events for audit
Pipeline stall on rate-limited portal	Too many requests without backoff	Use `Retry` with `backoff_factor=1`; add `time.sleep(random.uniform(1, 3))` between pages

Mastering Geospatial Data Ingestion in Python — parent overview covering all ingestion source types and pipeline stages
Parsing ISO 19115 Metadata with OWSLib — extract bounding box, CRS, and temporal extent from ISO 19115/19139 XML and CSW responses
Automating Government Portal Downloads — scheduled batch downloads from the same portal infrastructure covered here
Fetching OSM Data via the Overpass API — structured alternative to HTML scraping for OpenStreetMap-derived datasets
Syncing STAC Catalogs with pystac-client — consume the normalized bounding boxes and temporal extents produced by this workflow
CRS Normalization Across Mixed Datasets — deeper treatment of pyproj-based reprojection once metadata is ingested