Why do I get 403 Forbidden errors even with a valid API key?

Government portals often enforce IP allowlists, require a Referer or User-Agent header, or rotate OAuth2 tokens on short cycles. Verify your IP is registered, inject the expected headers, and refresh tokens before expiry.

How do I resume interrupted large shapefile downloads?

Check for an Accept-Ranges: bytes header in the server response. If present, read the size of any partial file on disk and send a Range: bytes={offset}- header on the retry request, opening the file in append-binary mode.

How should I handle schema drift when a government dataset changes column names?

Compare incoming column sets against a stored baseline contract. Log the delta, quarantine the file for review, and optionally apply a rename mapping before passing data downstream. Store the baseline in version control alongside your pipeline code.

This guide is part of Mastering Geospatial Data Ingestion in Python.

Automating Government Portal Downloads

Government agencies publish critical spatial datasets through highly fragmented infrastructure: legacy FTP endpoints, custom HTML portals, CKAN instances, and ArcGIS REST services that change without notice. Naive download scripts fail silently on token expiry, return partial GeoPackages after a proxy timeout, and produce pipelines that no one dares rerun. This guide builds a production-ready, six-stage Python architecture that handles dynamic authentication, paginated APIs, chunked resumable transfers, and post-download schema validation — turning brittle one-off downloads into reliable, auditable ingestion workflows.

Problem Framing: Why One-Off Download Scripts Break in Production

The most common failure pattern is a script that works once, manually, and then silently diverges from reality. Government portals present several compounding challenges that trip up simple requests.get() approaches:

Token lifecycles: OAuth2 access tokens issued by data.gov portals or state GIS hubs typically expire in 30–60 minutes. A script that fetches a token at startup and runs for two hours will succeed for the first dataset and silently fail on every subsequent one, producing empty files or swallowed 401 responses.
Opaque pagination: ArcGIS REST Feature Services cap result sets at 1,000 or 2,000 records per request and communicate pagination through exceededTransferLimit: true rather than a standard Link: rel="next" header. Scripts that miss this flag ingest only the first page and treat the dataset as complete.
Proxy-induced truncation: Government infrastructure commonly routes downloads through HTTP proxies that time out long-running connections. A 2 GB statewide parcel shapefile may arrive as a 340 MB truncated ZIP with no error code — just a quietly corrupt archive.
Schema drift: Dataset maintainers rename columns, change CRS assignments, or split geometry types between annual releases with no changelog. Downstream joins and rasterizations fail months later, long after the broken data entered the lake.

Unlike fetching OSM data via the Overpass API, which targets a single, well-documented endpoint with a known query language, government portal ingestion must handle five or six distinct API dialects and an essentially unlimited variety of legacy systems.

Prerequisites & Environment

Python and library requirements

pip install requests==2.32.3 beautifulsoup4==4.12.3 tenacity==8.5.0 \
    geopandas==1.0.1 lxml==5.3.0 pyproj==3.7.0

Verify GEOS and PROJ are linked correctly before running spatial validation steps:

import geopandas as gpd
import pyproj
print(gpd.__version__)       # expect 1.0+
print(pyproj.proj_version)   # expect 9.x

Environment variables

Store all credentials and endpoint roots outside version control. A secrets manager (HashiCorp Vault, AWS Secrets Manager, or a .env file excluded from git) injects these at runtime:

export GOV_PORTAL_BASE_URL="https://data.example.gov/api/3"
export GOV_API_KEY="your_api_key_here"
export DOWNLOAD_DIR="./data/raw/government"
export MAX_RETRIES=4
export CHUNK_SIZE=65536

Prerequisite knowledge

HTTP session management, cookie persistence, and header injection
RESTful pagination patterns (offset/limit, cursor tokens, ArcGIS resultOffset)
Spatial format identification: GeoJSON, Shapefile (ZIP), GeoPackage, KML, WFS
Idempotent file writes and SHA-256 content addressing

Version and Compatibility Matrix

Library	Minimum version	Recommended	Known caveats
`requests`	2.28	2.32	Earlier versions lack `PreparedRequest.body` streaming controls
`tenacity`	8.2	8.5	`retry_if_exception_type` behaviour changed in 8.0; upgrade if on 7.x
`geopandas`	0.14	1.0+	`make_valid()` removed in 1.0; use `shapely.make_valid()` directly
`lxml`	4.9	5.3	Required for `BeautifulSoup` XML mode; 4.x has CVE fixes pending
`pyproj`	3.5	3.7	PROJ data bundle must match version; mismatched bundles silently misproject

Step-by-Step Implementation

Step 1 — Discover Endpoint Capabilities

Before issuing any download request, determine whether the portal exposes a machine-readable API or requires HTML scraping. Query standard capability endpoints in priority order:

import os
import requests
from bs4 import BeautifulSoup

PORTAL_BASE = os.environ["GOV_PORTAL_BASE_URL"]

def detect_portal_type(session: requests.Session, base_url: str) -> str:
    """
    Return 'ckan', 'arcgis', 'wfs', or 'html' based on capability probes.
    """
    probes = {
        "ckan":   f"{base_url}/action/status_show",
        "arcgis": f"{base_url}/rest/info?f=json",
        "wfs":    f"{base_url}?service=WFS&request=GetCapabilities",
    }
    for portal_type, url in probes.items():
        try:
            resp = session.get(url, timeout=10)
            if resp.status_code == 200:
                return portal_type
        except requests.exceptions.RequestException:
            continue
    return "html"

def extract_download_links_from_html(session: requests.Session, page_url: str) -> list[str]:
    """
    Fall back to BeautifulSoup when no structured API is found.
    Targets <a href> pointing at known spatial extensions.
    """
    spatial_exts = {".geojson", ".gpkg", ".shp", ".zip", ".kml", ".gml"}
    resp = session.get(page_url, timeout=20)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")
    links = []
    for tag in soup.find_all("a", href=True):
        href: str = tag["href"]
        if any(href.lower().endswith(ext) for ext in spatial_exts):
            links.append(href if href.startswith("http") else page_url.rstrip("/") + "/" + href.lstrip("/"))
    return links

When the portal returns HTML, inspect <meta> tags and inline <script> blocks — many government sites embed JSON-LD or hidden API endpoints that bypass the need for CSS-selector scraping.

Step 2 — Authenticate and Build a Persistent Session

Establish a requests.Session to reuse TCP connections and maintain cookie state. Implement token refresh so OAuth2 credentials are renewed before they expire:

import time
from dataclasses import dataclass, field

@dataclass
class TokenCache:
    access_token: str = ""
    expires_at: float = 0.0

    def is_valid(self, buffer_seconds: int = 120) -> bool:
        return bool(self.access_token) and time.time() < (self.expires_at - buffer_seconds)

def build_authenticated_session(
    base_url: str,
    api_key: str | None = None,
    oauth_token_url: str | None = None,
    client_id: str | None = None,
    client_secret: str | None = None,
) -> tuple[requests.Session, TokenCache]:
    session = requests.Session()
    session.headers.update({
        "User-Agent": "GeospatialETL/1.0 (contact: pipeline@example.org)",
        "Accept": "application/json, application/geo+json, */*",
    })
    token_cache = TokenCache()

    if api_key:
        # CKAN-style or generic key-in-header auth
        session.headers["X-CKAN-API-Key"] = api_key

    if oauth_token_url and client_id and client_secret:
        resp = session.post(oauth_token_url, data={
            "grant_type": "client_credentials",
            "client_id": client_id,
            "client_secret": client_secret,
        }, timeout=15)
        resp.raise_for_status()
        payload = resp.json()
        token_cache.access_token = payload["access_token"]
        token_cache.expires_at = time.time() + payload.get("expires_in", 3600)
        session.headers["Authorization"] = f"Bearer {token_cache.access_token}"

    return session, token_cache

Step 3 — Construct Queries and Handle Pagination

Government APIs rarely return complete datasets in a single response. Build filter expressions using portal-specific syntax and iterate until the dataset is exhausted:

import logging
from typing import Iterator

logger = logging.getLogger(__name__)

def iter_ckan_resources(
    session: requests.Session,
    base_url: str,
    dataset_id: str,
    rows_per_page: int = 1000,
) -> Iterator[dict]:
    """
    Yield individual CKAN resource metadata dicts, page by page.
    """
    start = 0
    while True:
        params = {
            "id": dataset_id,
            "rows": rows_per_page,
            "start": start,
        }
        resp = session.get(f"{base_url}/action/package_show", params=params, timeout=30)
        resp.raise_for_status()
        body = resp.json()
        resources: list[dict] = body.get("result", {}).get("resources", [])
        if not resources:
            logger.info("CKAN pagination complete at offset %d", start)
            break
        for resource in resources:
            yield resource
        start += len(resources)

def iter_arcgis_features(
    session: requests.Session,
    layer_url: str,
    where_clause: str = "1=1",
    out_fields: str = "*",
    result_offset: int = 0,
    page_size: int = 1000,
) -> Iterator[dict]:
    """
    Yield ArcGIS REST feature dicts, handling exceededTransferLimit.
    """
    while True:
        params = {
            "where": where_clause,
            "outFields": out_fields,
            "f": "geojson",
            "resultOffset": result_offset,
            "resultRecordCount": page_size,
        }
        resp = session.get(f"{layer_url}/query", params=params, timeout=60)
        resp.raise_for_status()
        body = resp.json()
        features: list[dict] = body.get("features", [])
        if not features:
            break
        for feat in features:
            yield feat
        if not body.get("exceededTransferLimit", False):
            break
        result_offset += len(features)
        logger.debug("ArcGIS page complete, next offset=%d", result_offset)

Handling authentication tokens for ArcGIS REST services covers the full OAuth2 and token-generation flow for secured ArcGIS Feature Services in more detail.

Step 4 — Chunked and Resumable Download

Large spatial files — statewide parcel shapefiles, LiDAR point cloud archives — frequently exceed memory limits and proxy timeout windows. Enable streaming mode and write fixed-size chunks, resuming from any partial file when the server advertises Accept-Ranges: bytes:

import hashlib
from pathlib import Path
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests.exceptions as req_exc

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type((req_exc.ConnectionError, req_exc.Timeout, req_exc.ChunkedEncodingError)),
    reraise=True,
)
def download_with_resume(
    url: str,
    dest: Path,
    session: requests.Session,
    chunk_size: int = 65536,
) -> str:
    """
    Stream-download url to dest, resuming from existing partial file.
    Returns the hex SHA-256 digest of the completed file.
    """
    headers: dict[str, str] = {}
    dest.parent.mkdir(parents=True, exist_ok=True)

    existing_size = dest.stat().st_size if dest.exists() else 0
    if existing_size > 0:
        # Probe server support for range requests
        head = session.head(url, timeout=10)
        if head.headers.get("Accept-Ranges") == "bytes":
            headers["Range"] = f"bytes={existing_size}-"
            logger.info("Resuming %s from byte %d", url, existing_size)

    resp = session.get(url, headers=headers, stream=True, timeout=60)
    if resp.status_code == 416:
        # File already complete (server says range not satisfiable)
        logger.info("File already complete: %s", dest)
    else:
        resp.raise_for_status()
        write_mode = "ab" if "Range" in headers else "wb"
        with dest.open(write_mode) as fh:
            for chunk in resp.iter_content(chunk_size=chunk_size):
                if chunk:
                    fh.write(chunk)

    # Compute SHA-256 over the full file for integrity tracking
    sha256 = hashlib.sha256()
    with dest.open("rb") as fh:
        for chunk in iter(lambda: fh.read(chunk_size), b""):
            sha256.update(chunk)
    digest = sha256.hexdigest()
    logger.info("Download complete: %s  sha256=%s", dest, digest)
    return digest

tenacity’s wait_exponential with reraise=True ensures the caller sees the final exception after all retries are exhausted, rather than a silent None return.

Step 5 — Validate Format and Extract Metadata

Verify file integrity immediately after each download. Spatial format validation catches truncated archives and geometry-type mismatches before corrupt data propagates downstream. Apply geometry repair with Shapely and GeoPandas if make_valid checks reveal problems:

import json
import shapely
import geopandas as gpd
from pathlib import Path

def validate_spatial_file(
    path: Path,
    expected_crs: str | None = "EPSG:4326",
    expected_geom_types: set[str] | None = None,
) -> dict:
    """
    Open the file with geopandas, check CRS and geometry types.
    Returns a validation summary dict; raises ValueError on critical failures.
    """
    gdf = gpd.read_file(path)
    summary: dict = {
        "path": str(path),
        "rows": len(gdf),
        "columns": list(gdf.columns),
        "crs": str(gdf.crs),
        "geom_types": list(gdf.geometry.geom_type.unique()),
        "invalid_geom_count": int((~shapely.is_valid(gdf.geometry.values)).sum()),
    }

    if expected_crs and str(gdf.crs) != expected_crs:
        raise ValueError(f"CRS mismatch: expected {expected_crs}, got {gdf.crs}")

    if expected_geom_types:
        actual = set(summary["geom_types"])
        if not actual.issubset(expected_geom_types):
            raise ValueError(f"Unexpected geometry types: {actual - expected_geom_types}")

    if summary["invalid_geom_count"] > 0:
        import warnings
        warnings.warn(f"{summary['invalid_geom_count']} invalid geometries in {path}")

    return summary

def write_metadata_sidecar(path: Path, metadata: dict) -> None:
    """
    Write a JSON sidecar file adjacent to the spatial asset for lineage tracking.
    """
    sidecar = path.with_suffix(".meta.json")
    with sidecar.open("w") as fh:
        json.dump(metadata, fh, indent=2, default=str)
    logger.info("Metadata sidecar written: %s", sidecar)

Step 6 — Idempotent Storage and Registry

Design the final stage so re-running the pipeline with identical parameters produces no duplication. A lightweight SQLite registry tracks download state and enables incremental updates:

import sqlite3
from datetime import datetime, timezone
from pathlib import Path

def init_registry(db_path: Path) -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS downloads (
            url TEXT PRIMARY KEY,
            local_path TEXT,
            sha256 TEXT,
            last_modified TEXT,
            downloaded_at TEXT,
            validation_status TEXT
        )
    """)
    conn.commit()
    return conn

def is_already_current(conn: sqlite3.Connection, url: str, server_last_modified: str | None) -> bool:
    row = conn.execute(
        "SELECT last_modified FROM downloads WHERE url = ?", (url,)
    ).fetchone()
    if row is None:
        return False
    if server_last_modified is None:
        return True  # No ETag/Last-Modified to compare; skip re-download
    return row[0] == server_last_modified

def record_download(
    conn: sqlite3.Connection,
    url: str,
    local_path: Path,
    sha256: str,
    last_modified: str | None,
    validation_status: str,
) -> None:
    conn.execute("""
        INSERT OR REPLACE INTO downloads
            (url, local_path, sha256, last_modified, downloaded_at, validation_status)
        VALUES (?, ?, ?, ?, ?, ?)
    """, (
        url, str(local_path), sha256, last_modified,
        datetime.now(timezone.utc).isoformat(), validation_status,
    ))
    conn.commit()

This registry pattern mirrors the asset-tracking approach used when syncing STAC catalogs with pystac-client, though government portals typically lack STAC’s standardized item schemas and require custom metadata mapping.

Advanced Patterns and Edge Cases

OAuth2 Token Refresh Mid-Pipeline

Long-running ingestion jobs (downloading an entire state’s parcel database, for example) outlast a single OAuth2 token. Wrap every request in a token-aware retry that refreshes credentials on 401 before retrying the original call:

def make_request_with_token_refresh(
    session: requests.Session,
    token_cache: TokenCache,
    url: str,
    token_url: str,
    client_id: str,
    client_secret: str,
    **kwargs,
) -> requests.Response:
    if not token_cache.is_valid():
        resp = session.post(token_url, data={
            "grant_type": "client_credentials",
            "client_id": client_id,
            "client_secret": client_secret,
        }, timeout=15)
        resp.raise_for_status()
        payload = resp.json()
        token_cache.access_token = payload["access_token"]
        token_cache.expires_at = time.time() + payload.get("expires_in", 3600)
        session.headers["Authorization"] = f"Bearer {token_cache.access_token}"

    response = session.get(url, **kwargs)
    if response.status_code == 401:
        # Force refresh and retry once
        token_cache.expires_at = 0.0
        return make_request_with_token_refresh(
            session, token_cache, url, token_url, client_id, client_secret, **kwargs
        )
    response.raise_for_status()
    return response

Schema Drift Detection and Quarantine

Government datasets frequently change column names, add or remove attributes, or switch CRS assignments between annual releases. A schema contract stored in version control catches these surprises at ingestion time:

import json
from pathlib import Path
import geopandas as gpd

def check_schema_drift(
    gdf: gpd.GeoDataFrame,
    contract_path: Path,
) -> dict:
    """
    Compare gdf columns against a JSON schema contract.
    Returns a dict with 'added', 'removed', and 'type_changed' keys.
    """
    with contract_path.open() as fh:
        contract: dict = json.load(fh)

    baseline_cols: set[str] = set(contract.get("columns", []))
    actual_cols: set[str] = set(gdf.columns) - {"geometry"}

    drift = {
        "added":   sorted(actual_cols - baseline_cols),
        "removed": sorted(baseline_cols - actual_cols),
    }

    if drift["added"] or drift["removed"]:
        logger.warning("Schema drift detected: %s", drift)
        # Quarantine path for manual review
        return drift

    return drift  # Empty drift = schema matches contract

When drift is detected, log the delta and move the file to a quarantine directory rather than failing the entire pipeline run. Downstream consumers can process clean assets while an operator reviews the changed dataset.

WFS GetFeature Pagination with OGC Cursors

OGC Web Feature Services use startIndex and count parameters rather than offset/limit. Some implementations also support cursor-based pagination via next links in the response XML:

from lxml import etree

def iter_wfs_features(
    session: requests.Session,
    wfs_url: str,
    type_name: str,
    page_size: int = 500,
) -> Iterator[bytes]:
    """
    Yield raw WFS feature XML chunks, handling startIndex pagination.
    """
    start_index = 0
    while True:
        params = {
            "service": "WFS",
            "version": "2.0.0",
            "request": "GetFeature",
            "typeName": type_name,
            "outputFormat": "application/json",
            "count": page_size,
            "startIndex": start_index,
        }
        resp = session.get(wfs_url, params=params, timeout=60)
        resp.raise_for_status()
        body = resp.json()
        features = body.get("features", [])
        if not features:
            break
        yield from features
        start_index += len(features)
        # WFS 2.0 signals end via numberReturned < count
        if body.get("numberReturned", page_size) < page_size:
            break

Production Resilience: Rate Limiting and Compliance

Rate limiting with tenacity and Retry-After headers

Aggressive polling triggers IP bans. The wait_exponential decorator handles most transient failures; additionally, inspect Retry-After headers on 429 responses:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests.exceptions as req_exc

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type((req_exc.ConnectionError, req_exc.Timeout, req_exc.HTTPError)),
    reraise=True,
)
def fetch_with_backoff(session: requests.Session, url: str, params: dict | None = None) -> dict:
    response = session.get(url, params=params, timeout=30)
    if response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 30))
        logger.info("Rate-limited. Waiting %d seconds.", retry_after)
        time.sleep(retry_after)
        response = session.get(url, params=params, timeout=30)
    response.raise_for_status()
    return response.json()

Data licensing and attribution

Always parse and archive each dataset’s license metadata before storing the asset. Extract publisher, contact_email, and license_url fields during ingestion and write them to the metadata sidecar. This ensures attribution is available when data is republished or used commercially. Many CKAN portals surface these under result.license_id and result.organization.title.

Performance Optimization

Chunked I/O is the single largest lever for throughput on large government datasets. The CHUNK_SIZE environment variable controls the read-write loop; 64 KB (65536) balances memory pressure against system-call overhead for most network conditions. For multi-gigabyte files over a fast internal network, 1 MB (1048576) often halves wall-clock download time.

Where the same portal serves many datasets, reuse a single requests.Session across all downloads rather than constructing one per URL — TCP connection reuse and HTTP/1.1 keep-alive reduce handshake latency significantly.

Parsing GeoJSON and Shapefile APIs covers memory-efficient GeoJSON streaming and lazy Shapefile reading patterns that apply once the files are on disk.

Integration into ETL Pipelines

Wrap each stage as a callable function that accepts and returns a typed data contract. This makes the pipeline testable in isolation and slots cleanly into Airflow, Prefect, or Dagster task graphs:

from pathlib import Path
import sqlite3

def run_portal_ingestion(
    portal_url: str,
    dataset_id: str,
    output_dir: Path,
    db_path: Path,
    api_key: str | None = None,
) -> list[dict]:
    """
    Full pipeline: discover → authenticate → paginate → download → validate → store.
    Returns a list of validation summary dicts for each downloaded resource.
    """
    conn = init_registry(db_path)
    session, token_cache = build_authenticated_session(portal_url, api_key=api_key)
    results: list[dict] = []

    for resource in iter_ckan_resources(session, portal_url, dataset_id):
        url: str = resource.get("url", "")
        if not url:
            continue

        # Check Last-Modified to skip unchanged files (idempotency)
        head = session.head(url, timeout=10)
        last_mod = head.headers.get("Last-Modified")
        if is_already_current(conn, url, last_mod):
            logger.info("Skipping unchanged resource: %s", url)
            continue

        dest = output_dir / Path(url).name
        try:
            sha256 = download_with_resume(url, dest, session)
            summary = validate_spatial_file(dest)
            write_metadata_sidecar(dest, {**summary, "source_url": url, "sha256": sha256})
            record_download(conn, url, dest, sha256, last_mod, "valid")
            results.append(summary)
        except Exception as exc:
            logger.error("Failed to ingest %s: %s", url, exc)
            record_download(conn, url, dest, "", last_mod, f"error: {exc}")

    return results

For dead-letter queue semantics in Airflow, catch exceptions at the task level and push failed URLs to an XCom or secondary database table. A separate retry_failed_downloads task reads that table and re-attempts only the failed resources, rather than reprocessing the entire dataset. Apply CRS normalization across mixed datasets as a post-download transform step before writing to any shared data store.

Troubleshooting Common Failure Modes

Symptom	Root Cause	Resolution
`403 Forbidden` on valid API key	IP restriction, expired token, or missing `Referer` / `User-Agent` header	Verify IP allowlists, refresh OAuth2 tokens before expiry, inject expected headers
Incomplete downloads or corrupt ZIP	Server timeout, proxy truncation, or missing Range support	Enable streaming, check `Accept-Ranges: bytes`, implement checksum validation post-download
`geopandas.read_file` raises `DriverError`	Mixed geometry types, invalid polygons, or unsupported CRS	Use `fiona` for low-level inspection, apply `shapely.make_valid()`, or reproject during ingestion
Silent pagination loops	Missing `next` token, malformed `total` count, or `exceededTransferLimit` flag ignored	Add iteration caps, validate response structure, log raw payloads for debugging
Schema validation raises `KeyError` downstream	Government maintainer renamed columns between annual releases	Implement schema drift detection, quarantine mismatched files, update contract on confirmed schema change

Mastering Geospatial Data Ingestion in Python — parent overview of all ingestion approaches covered in this series
Detecting Dataset Changes with ETag and Last-Modified — conditional GET so scheduled pipelines skip unchanged portal files
Handling Authentication Tokens for ArcGIS REST Services — full OAuth2 and token-generation flow for secured ArcGIS endpoints
Syncing STAC Catalogs with pystac-client — asset-level download tracking and idempotent catalog sync patterns
CRS Normalization Across Mixed Datasets — reprojection workflow to unify coordinates after ingestion from multiple government sources
Bulk Downloading Satellite Imagery — chunked transfer and resumable download patterns applied to large raster assets