Parsing GeoJSON & Shapefile APIs: A Production-Ready Python Workflow
Parsing GeoJSON & Shapefile APIs remains a foundational requirement for modern geospatial ETL pipelines. While cloud-native formats like Parquet and cloud-optimized GeoTIFFs are gaining traction, municipal portals, environmental agencies, and legacy GIS platforms continue to expose vector data through REST endpoints that return either RFC 7946-compliant GeoJSON or zipped ESRI Shapefiles. For GIS analysts, data engineers, and urban tech teams, building resilient ingestion logic that handles both formats without manual intervention is critical to maintaining automated data pipelines.
This guide outlines a tested, step-by-step workflow for fetching, validating, normalizing, and persisting vector data from public and authenticated APIs. The patterns presented here align with broader strategies for Mastering Geospatial Data Ingestion in Python, emphasizing reproducibility, memory efficiency, and schema enforcement.
Prerequisites
Before implementing the ingestion workflow, ensure your environment meets the following baseline:
- Python 3.9+ with virtual environment isolation
- Core Libraries:
requests>=2.31.0,geopandas>=1.0.0,shapely>=2.0.0,pyproj>=3.6.0 - System Dependencies: GDAL/OGR compiled headers (typically resolved via
conda install -c conda-forge geopandasorapt install libgdal-dev) - API Access: Endpoint documentation, rate limit awareness, and optional authentication credentials
Modern geospatial APIs frequently implement pagination, spatial bounding filters, and content negotiation. Understanding how to programmatically negotiate these parameters prevents incomplete downloads, schema drift, and out-of-memory crashes during bulk operations.
Step-by-Step Workflow
1. Endpoint Configuration & Request Strategy
Identify whether the API serves GeoJSON natively (application/json or application/geo+json) or packages Shapefiles in .zip archives. Configure base URLs, query parameters, and HTTP headers. For spatially constrained requests, many providers accept bbox parameters formatted as west,south,east,north. When designing spatial filters, refer to Extracting bounding boxes from GeoJSON APIs to ensure coordinate ordering and projection alignment match the provider’s expectations. Misaligned bounding boxes are a common cause of empty result sets or truncated geometries.
2. Fetching & Streaming Responses
Use requests with streaming enabled to avoid loading multi-megabyte payloads into memory. For GeoJSON, parse the JSON incrementally or validate the structure before conversion. For Shapefiles, stream the binary response directly into a zipfile object and pass the in-memory archive to geopandas.
When endpoints require credentials, never hardcode tokens. Instead, implement a token-refresh routine or leverage environment variables. For enterprise deployments, consult Handling authentication tokens for ArcGIS REST services to understand token lifecycle management and automatic renewal patterns.
import requests
import io
import geopandas as gpd
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def fetch_vector_data(url: str, params: dict = None, headers: dict = None) -> gpd.GeoDataFrame:
session = requests.Session()
retry_strategy = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
headers = headers or {"Accept": "application/json, application/geo+json, application/zip"}
response = session.get(url, params=params, headers=headers, stream=True, timeout=60)
response.raise_for_status()
content_type = response.headers.get("Content-Type", "").lower()
if "application/zip" in content_type or url.endswith(".zip"):
return gpd.read_file(io.BytesIO(response.content))
else:
# Fallback: parse GeoJSON directly from response text
return gpd.read_file(io.StringIO(response.text), driver="GeoJSON")3. Geometry Validation & Schema Normalization
Raw API responses frequently contain self-intersecting polygons, collapsed lines, or mixed attribute naming conventions. Apply strict validation using shapely.is_valid_reason to diagnose issues, then repair geometries using shapely.make_valid(). Standardizing column names and dropping null geometries early prevents downstream pipeline failures.
import shapely
import pandas as pd
def normalize_gdf(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
# Drop rows with missing geometries
gdf = gdf.dropna(subset=["geometry"])
# Validate and repair geometries
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(shapely.make_valid)
gdf = gdf[gdf.geometry.is_valid]
# Standardize column names: lowercase, replace spaces with underscores
gdf.columns = [col.lower().replace(" ", "_") for col in gdf.columns]
return gdf4. CRS Harmonization & Spatial Indexing
Coordinate Reference System (CRS) mismatches cause silent spatial errors. Always inspect the source CRS, explicitly reproject to a consistent target (e.g., EPSG:4326 for web mapping or a local projected CRS for analysis), and build a spatial index. While raster workflows often rely on Syncing STAC Catalogs with pystac-client for metadata discovery, vector pipelines require explicit CRS enforcement at ingestion time.
TARGET_CRS = "EPSG:4326"
def harmonize_crs(gdf: gpd.GeoDataFrame, target_crs: str = TARGET_CRS) -> gpd.GeoDataFrame:
if gdf.crs is None:
raise ValueError("Input GeoDataFrame lacks CRS definition. Assign manually before harmonization.")
if gdf.crs.to_epsg() != int(target_crs.split(":")[1]):
gdf = gdf.to_crs(target_crs)
gdf.sindex # Build spatial index for fast spatial joins/queries
return gdf5. Persistence & Output Formatting
Persist validated data using columnar or spatial database formats rather than raw CSV/JSON. GeoPackage (.gpkg) and Parquet with geopandas’s to_parquet() method offer excellent compression and schema preservation. Partition large datasets by administrative boundaries or temporal windows to optimize query performance.
def persist_data(gdf: gpd.GeoDataFrame, output_path: str, format: str = "parquet"):
if format == "parquet":
gdf.to_parquet(output_path, compression="zstd", index=False)
elif format == "gpkg":
gdf.to_file(output_path, driver="GPKG", layer="ingested_data", index=False)
else:
raise ValueError("Unsupported output format. Use 'parquet' or 'gpkg'.")6. Error Handling & Pipeline Resilience
Production ingestion must survive transient network failures, malformed payloads, and provider-side schema changes. Wrap the core workflow in a structured exception handler, log diagnostic information, and implement graceful degradation. When dealing with community-driven or highly dynamic endpoints, patterns similar to those used for Fetching OSM Data via Overpass API—such as query validation before execution and strict timeout enforcement—prove invaluable.
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
def run_ingestion_pipeline(api_url: str, params: dict, output_path: str):
try:
logging.info("Fetching vector data from %s", api_url)
gdf = fetch_vector_data(api_url, params)
logging.info("Normalizing geometries and schema")
gdf = normalize_gdf(gdf)
logging.info("Harmonizing CRS to %s", TARGET_CRS)
gdf = harmonize_crs(gdf)
logging.info("Persisting to %s", output_path)
persist_data(gdf, output_path)
logging.info("Pipeline completed successfully. %d features ingested.", len(gdf))
except requests.exceptions.RequestException as e:
logging.error("Network/API error during fetch: %s", e)
except ValueError as e:
logging.error("Data validation error: %s", e)
except Exception as e:
logging.exception("Unexpected pipeline failure: %s", e)Production Checklist
Before deploying this workflow to a scheduler (e.g., Airflow, Prefect, or GitHub Actions), verify the following:
- Memory Profiling: Confirm that
stream=Trueand in-memoryzipfilehandling prevent OOM errors on payloads >500MB. - Schema Drift Monitoring: Log attribute column counts and types on each run. Alert on unexpected schema changes.
- Idempotency: Ensure repeated runs overwrite or append safely without duplicating features.
- GDAL Version Alignment: Match
fionaandgeopandasversions to the system GDAL installation to avoid silent driver failures. - Rate Limit Compliance: Implement exponential backoff and respect
Retry-Afterheaders.
Conclusion
Parsing GeoJSON & Shapefile APIs requires more than a simple HTTP GET and a format converter. By enforcing streaming downloads, rigorous geometry validation, explicit CRS harmonization, and structured error handling, teams can build ingestion pipelines that remain stable across provider updates and scale efficiently. The patterns outlined here provide a reusable foundation for integrating legacy GIS exports into modern data stacks, ensuring that spatial data remains a reliable asset rather than an operational bottleneck.