This guide is part of Automated Vector & Raster Cleaning Workflows.

Spatial Deduplication and Topology Simplification

Raw vector data arriving in automated geospatial ETL pipelines is rarely clean. Feeds from government portals, OSM exports, and sensor networks routinely contain redundant coordinates, overlapping polygon rings, and topological inconsistencies that silently break downstream spatial joins, routing algorithms, and rasterization steps. When a sjoin yields inflated match counts because duplicate parcels exist at the same coordinates, or when a rasterization step crashes on a self-intersecting polygon that wasn’t caught at ingest, the root cause almost always traces back to skipped deduplication and topology repair.

This guide covers the full pipeline: geometry-aware deduplication (points, lines, and polygons), topology-preserving simplification, and the QA gates that confirm you haven’t degraded data beyond acceptable thresholds. Every code block uses the Shapely 2.0 vectorized array API — no .apply() loops.

Prerequisites and Environment

Install the minimum required stack before running any topology routines. The Shapely 2.0 vectorized API underpins every code block on this page; earlier versions use a fundamentally different object-level API and will not be compatible.

pip install "geopandas>=1.0.0" "shapely>=2.0.0" "pyproj>=3.4.0" numpy scipy

Runtime environment requirements:

Python 3.9+ — f-string walrus operators and match statements are used in later examples.
GEOS 3.11+ — bundled with shapely>=2.0.0 on all major platforms via the binary wheel; verify with shapely.geos_version.
Projected CRS — all topology operations assume planar (metric) coordinates. Running simplification on geographic WGS84 degrees produces distorted tolerances and invalid outputs.
RAM headroom — for datasets exceeding 500 k features, budget at least 4 GB per worker or switch to Dask-GeoPandas for out-of-core chunked processing.

Version and Method Compatibility Matrix

GeoPandas	Shapely	Recommended dedup method	Known caveat
0.12–0.14	1.8.x	`.apply(wkb)` + `drop_duplicates`	No vectorized `is_valid`; slow on large frames
0.14–1.0	2.0.x	`shapely.to_wkb(arr)` array call	`sindex.query_bulk` deprecated in favour of `sindex.query`
1.0+	2.0+	`shapely.to_wkb(arr)` + `sindex.query(predicate=)`	Stable API; use this going forward

Problem Framing: Why Naive Approaches Fail

A naive deduplication loop that calls geom1.equals(geom2) across all feature pairs is O(n²) — infeasible beyond a few thousand rows. Equally, calling .simplify() without preserve_topology=True on adjacent cadastral parcels creates sliver gaps at shared boundaries that are invisible at map scale but cause sjoin overcounting and PostGIS topology validation failures.

The specific failure modes this page prevents:

Failure Mode	Root Cause	Mitigation
Inflated `sjoin` result counts	Duplicate geometries match the same target feature multiple times	WKB-hash deduplication before join
PostGIS topology exceptions	Self-intersecting rings survive ingest	`shapely.make_valid` pre-repair stage
Sliver gaps after simplification	Adjacent polygons simplified independently	`preserve_topology=True` on all polygon layers
Precision-mismatch false duplicates	Coordinates stored at different decimal precision across sources	Snap to grid before hash comparison
Collapsed features after simplification	Tolerance too large relative to feature size	Post-simplification area drift QA gate

Step 1: Ingest and CRS Validation

Load the raw source and immediately enforce a consistent projected CRS. Mixing geographic (EPSG:4326) and projected data in the same frame causes all distance-based operations to produce meaningless results. For multi-source pipelines where each feed may carry a different EPSG code, the patterns in CRS Normalization Across Mixed Datasets automate zone detection and re-projection.

import geopandas as gpd
import shapely
import numpy as np

TARGET_CRS = "EPSG:32633"  # UTM zone 33N — replace with your region

gdf = gpd.read_file("raw_parcels.gpkg")

if gdf.crs is None:
    raise ValueError("Source has no CRS defined — assign before proceeding.")

if gdf.crs.to_epsg() != int(TARGET_CRS.split(":")[1]):
    gdf = gdf.to_crs(TARGET_CRS)

# Confirm metric units
assert gdf.crs.axis_info[0].unit_name in ("metre", "meter"), (
    f"CRS {gdf.crs.to_epsg()} is not metric — topology tolerances will be wrong."
)

Step 2: Diagnose Invalid and Empty Geometries

Before any deduplication or simplification, isolate invalid and empty geometries. Invalid geometries propagate silently through many GeoPandas operations, surfacing as cryptic GEOS errors only in later pipeline stages.

import logging

logger = logging.getLogger(__name__)

geom_arr = gdf.geometry.values  # numpy array of shapely geometry objects (Shapely 2.0)

invalid_mask = ~shapely.is_valid(geom_arr)
empty_mask = shapely.is_empty(geom_arr)
problem_mask = invalid_mask | empty_mask

logger.info(
    "Geometry diagnostics — invalid: %d, empty: %d, total: %d",
    invalid_mask.sum(),
    empty_mask.sum(),
    len(gdf),
)

if problem_mask.any():
    quarantine = gdf[problem_mask].copy()
    quarantine["issue"] = np.where(empty_mask[problem_mask], "empty", "invalid")
    quarantine.to_file("quarantine_pre_repair.gpkg", driver="GPKG")

Step 3: Repair Self-Intersections and Invalid Rings

Apply shapely.make_valid on the full geometry array in one vectorized call. For a deeper treatment of the specific failure patterns — bowties, unclosed rings, multipart splits — see Geometry Repair with Shapely & GeoPandas.

if invalid_mask.any():
    repaired = shapely.make_valid(geom_arr)
    gdf = gdf.copy()
    gdf.geometry = repaired

    # Re-check: some geometries resist make_valid (e.g. degenerate curves)
    still_invalid = ~shapely.is_valid(gdf.geometry.values) | shapely.is_empty(gdf.geometry.values)
    if still_invalid.any():
        logger.warning(
            "%d geometries remain invalid after make_valid — dropping from output.",
            still_invalid.sum(),
        )
        gdf = gdf[~still_invalid].reset_index(drop=True)

Step 4: Geometry-Type Routing

Mixed geometry frames (Points alongside Polygons) require type-specific deduplication logic. Route by geometry type before entering the deduplication stage.

from shapely import GeometryType

geom_types = shapely.get_type_id(gdf.geometry.values)
# Shapely type IDs: 0=Point, 1=LineString, 3=Polygon, 4=MultiPoint, etc.

point_mask = np.isin(geom_types, [0, 4])
line_mask  = np.isin(geom_types, [1, 5])
poly_mask  = np.isin(geom_types, [3, 6])

gdf_points = gdf[point_mask].copy()
gdf_lines  = gdf[line_mask].copy()
gdf_polys  = gdf[poly_mask].copy()

Step 5: Point Deduplication with Tolerance Thresholds

Exact coordinate equality misses near-duplicate points that differ by sub-metre floating-point noise across source systems. A cKDTree proximity check groups points within a configurable radius and keeps the first representative.

For the full implementation — including centroid snapping, attribute conflict resolution, and performance benchmarks — see Removing Duplicate Spatial Points with Tolerance Thresholds.

from scipy.spatial import cKDTree

POINT_TOLERANCE_M = 0.5  # 50 cm — adjust per source precision

if len(gdf_points) > 0:
    coords = shapely.get_coordinates(gdf_points.geometry.values)
    tree = cKDTree(coords)
    pairs = tree.query_pairs(r=POINT_TOLERANCE_M, output_type="ndarray")

    # Mark second of each near-duplicate pair for removal
    keep = np.ones(len(gdf_points), dtype=bool)
    if len(pairs):
        keep[pairs[:, 1]] = False

    gdf_points = gdf_points[keep].reset_index(drop=True)
    logger.info("Point deduplication: removed %d near-duplicates.", (~keep).sum())

Step 6: Line and Polygon Deduplication via WKB Hash

For lines and polygons, exact duplicate detection via WKB hashing is fast and deterministic. Coordinate order matters: apply shapely.normalize first to canonicalise ring direction and vertex order so that geometrically identical features always produce the same hash regardless of which source wrote them.

import hashlib

def deduplicate_by_wkb_hash(frame: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """
    Drop exact geometric duplicates using a normalised WKB hash.
    Preserves the first occurrence of each unique geometry.
    """
    normalised = shapely.normalize(frame.geometry.values)
    wkb_bytes = shapely.to_wkb(normalised)  # vectorized — returns list of bytes objects
    hashes = [hashlib.md5(b).hexdigest() for b in wkb_bytes]
    frame = frame.copy()
    frame["_geom_hash"] = hashes
    deduped = frame.drop_duplicates(subset=["_geom_hash"], keep="first")
    removed = len(frame) - len(deduped)
    logger.info("WKB-hash deduplication: removed %d exact duplicates.", removed)
    return deduped.drop(columns=["_geom_hash"]).reset_index(drop=True)

gdf_lines = deduplicate_by_wkb_hash(gdf_lines)
gdf_polys = deduplicate_by_wkb_hash(gdf_polys)

Step 7: Topology-Preserving Simplification

After deduplication, reduce vertex count using the Ramer-Douglas-Peucker algorithm. The critical parameter is preserve_topology=True: without it, the algorithm can collapse polygon holes or introduce self-intersections, producing geometries that fail PostGIS’s ST_IsValid check.

Tolerance must be expressed in the CRS unit (metres for UTM). A tolerance of 2.5 m is appropriate for urban cadastral data at 1:5 000 scale. Environmental or regional boundaries can tolerate 5–25 m without perceptible shape loss.

SIMPLIFY_TOLERANCE_M = 2.5

def simplify_layer(
    frame: gpd.GeoDataFrame,
    tolerance: float,
) -> gpd.GeoDataFrame:
    """
    Topology-preserving simplification with pre/post vertex count logging.
    Returns the simplified frame or raises if validity is broken post-simplification.
    """
    original_vertex_counts = shapely.get_num_coordinates(frame.geometry.values)

    simplified_geoms = shapely.simplify(
        frame.geometry.values,
        tolerance=tolerance,
        preserve_topology=True,
    )

    still_invalid = ~shapely.is_valid(simplified_geoms)
    if still_invalid.any():
        raise RuntimeError(
            f"Simplification produced {still_invalid.sum()} invalid geometries "
            f"at tolerance={tolerance}. Reduce tolerance or pre-repair inputs."
        )

    simplified_vertex_counts = shapely.get_num_coordinates(simplified_geoms)
    reduction_pct = (
        1.0 - simplified_vertex_counts.sum() / original_vertex_counts.sum()
    ) * 100

    logger.info(
        "Simplification at %.1f m — vertex reduction: %.1f%% (%d → %d vertices)",
        tolerance,
        reduction_pct,
        original_vertex_counts.sum(),
        simplified_vertex_counts.sum(),
    )

    result = frame.copy()
    result.geometry = simplified_geoms
    return result

gdf_polys = simplify_layer(gdf_polys, SIMPLIFY_TOLERANCE_M)
gdf_lines = simplify_layer(gdf_lines, SIMPLIFY_TOLERANCE_M)

Step 8: Validate Area and Length Drift

Simplification is lossy by design. Automated QA gates catch cases where tolerance was set too aggressively relative to feature size.

def validate_area_drift(
    original: gpd.GeoDataFrame,
    simplified: gpd.GeoDataFrame,
    max_pct: float = 0.5,
) -> None:
    """
    Assert that no polygon loses more than max_pct% of its original area.
    Logs all violating features for downstream triage.
    """
    orig_areas = shapely.area(original.geometry.values)
    simp_areas = shapely.area(simplified.geometry.values)

    # Avoid division by zero on degenerate slivers
    with np.errstate(divide="ignore", invalid="ignore"):
        drift_pct = np.where(
            orig_areas > 0,
            np.abs(orig_areas - simp_areas) / orig_areas * 100,
            0.0,
        )

    violators = drift_pct > max_pct
    if violators.any():
        logger.error(
            "%d features exceed %.1f%% area drift threshold (max observed: %.2f%%).",
            violators.sum(),
            max_pct,
            drift_pct.max(),
        )
        # Surface violators for manual review — do not silently drop them
        problem_ids = simplified[violators].index.tolist()
        raise ValueError(
            f"Area drift QA gate failed for feature ids: {problem_ids[:10]} "
            f"{'...' if len(problem_ids) > 10 else ''}"
        )

    logger.info("Area drift QA gate passed — max drift: %.3f%%.", drift_pct.max())

validate_area_drift(gdf_polys_before_simplify, gdf_polys, max_pct=0.5)

Advanced Patterns and Edge Cases

Snapping Near-Coincident Polygon Boundaries

Adjacent polygons sharing a boundary (cadastral parcels, ecological zones) that were edited by different operators often drift by sub-millimetre amounts. After simplification, these micro-gaps become sliver polygons that fail topology checks and inflate sjoin row counts. Snap shared vertices to a common grid before simplifying:

SNAP_GRID_M = 0.001  # 1 mm grid for cadastral data

snapped = shapely.snap_to_grid(gdf_polys.geometry.values, SNAP_GRID_M)
gdf_polys = gdf_polys.copy()
gdf_polys.geometry = snapped

This combines naturally with the Handling Precision and Coordinate Rounding patterns that standardise decimal precision across mixed-source feeds before geometry comparison.

Multipart Geometry Splitting and Reassembly

shapely.make_valid occasionally converts a self-intersecting polygon to a GeometryCollection or MultiPolygon where a single Polygon is expected. Schema-strict pipelines (PostGIS table with geometry(Polygon,32633)) will reject these. Explode multiparts and re-filter:

# Identify features make_valid promoted to multi-geometry
orig_type_ids = shapely.get_type_id(gdf_polys.geometry.values)
multi_mask = orig_type_ids >= 4  # 4=MultiPoint, 5=MultiLineString, 6=MultiPolygon, 7=Collection

if multi_mask.any():
    multi_gdf = gdf_polys[multi_mask].explode(index_parts=False)
    # Re-filter to only Polygon type_id=3
    single_mask = shapely.get_type_id(multi_gdf.geometry.values) == 3
    multi_gdf = multi_gdf[single_mask]
    gdf_polys = gpd.GeoDataFrame(
        pd.concat([gdf_polys[~multi_mask], multi_gdf], ignore_index=True),
        geometry="geometry",
        crs=gdf_polys.crs,
    )

Chunked Processing for Datasets Exceeding 1 M Features

Sequential deduplication and simplification on million-feature datasets exhausts 32 GB RAM on standard compute nodes. Partition by a spatial index (H3 hexagons or quadtree tiles), process each tile independently, and merge at shared boundaries:

import dask_geopandas as dgpd

dask_gdf = dgpd.from_geopandas(gdf_polys, npartitions=16)

def process_partition(part: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    part = deduplicate_by_wkb_hash(part)
    return simplify_layer(part, SIMPLIFY_TOLERANCE_M)

result_dask = dask_gdf.map_partitions(process_partition)
gdf_out = result_dask.compute()

Boundary features split across tile edges require a second-pass merge and re-simplification step at the seams; for attribute harmonisation across those boundaries, see Attribute Mapping and Schema Harmonisation.

Performance Optimisation: Vectorized Spatial Indexing

Avoid the deprecated sindex.query_bulk method. GeoPandas 1.0+ exposes a predicate-based sindex.query that calls the GEOS tree directly without Python-level iteration:

import time

# Benchmark: spatial index query for proximity-based deduplication
candidate_geoms = shapely.buffer(gdf_polys.geometry.values, 0.1)  # 10 cm buffer

t0 = time.perf_counter()
idx_pairs = gdf_polys.sindex.query(candidate_geoms, predicate="intersects")
elapsed = time.perf_counter() - t0

# idx_pairs shape: (2, N) — row 0 = query index, row 1 = tree index
logger.info(
    "sindex.query found %d candidate pairs in %.3f s for %d features.",
    idx_pairs.shape[1],
    elapsed,
    len(gdf_polys),
)

Additional throughput gains:

Write to GeoParquet (gdf.to_parquet("clean.parquet")) between stages rather than GPKG — columnar I/O is 3–8× faster for large frames.
Drop temporary columns immediately after use to avoid memory copies during sort/merge operations.
Use shapely.STRtree directly for custom proximity queries where GeoPandas index abstraction is insufficient.

Integration into ETL Pipelines

These steps compose cleanly into an Airflow DAG or Prefect flow as a single clean_geometry task between ingest_raw and load_to_postgis. The dead-letter queue pattern — routing quarantined features to a side table rather than raising exceptions — is the key to maintaining pipeline availability:

def clean_geometry_stage(
    raw_path: str,
    output_path: str,
    quarantine_path: str,
    target_crs: str = "EPSG:32633",
    simplify_tolerance: float = 2.5,
) -> dict:
    """
    End-to-end geometry cleaning stage for ETL orchestration.
    Returns a dict of counts for pipeline observability hooks.
    """
    gdf = gpd.read_file(raw_path)
    input_count = len(gdf)

    # 1. Reproject
    if gdf.crs is None or gdf.crs.to_epsg() != int(target_crs.split(":")[1]):
        gdf = gdf.to_crs(target_crs)

    # 2. Diagnose + quarantine
    problem_mask = ~shapely.is_valid(gdf.geometry.values) | shapely.is_empty(gdf.geometry.values)
    if problem_mask.any():
        gdf[problem_mask].to_file(quarantine_path, driver="GPKG")
        gdf = gdf[~problem_mask].copy()

    # 3. Repair remaining invalids
    gdf.geometry = shapely.make_valid(gdf.geometry.values)

    # 4. Deduplicate
    gdf = deduplicate_by_wkb_hash(gdf)

    # 5. Simplify polygons only
    poly_mask = shapely.get_type_id(gdf.geometry.values) == 3
    if poly_mask.any():
        polys = gdf[poly_mask].copy()
        polys = simplify_layer(polys, simplify_tolerance)
        gdf = gpd.GeoDataFrame(
            pd.concat([gdf[~poly_mask], polys], ignore_index=True),
            geometry="geometry", crs=gdf.crs,
        )

    gdf.to_parquet(output_path)

    return {
        "input": input_count,
        "quarantined": int(problem_mask.sum()),
        "output": len(gdf),
        "removed_duplicates": input_count - int(problem_mask.sum()) - len(gdf),
    }

Schema enforcement — ensuring output column names match the target PostGIS table schema — integrates with the patterns in Attribute Mapping and Schema Harmonisation. For raster alignment that must follow vector cleaning in a mixed pipeline, see Raster Alignment and Resampling Techniques.

Removing Duplicate Spatial Points with Tolerance Thresholds — detailed cKDTree and DBSCAN implementation for point-layer deduplication
Deduplicating Overlapping Polygons with Spatial Joins — a GeoPandas self-join with an IoU threshold to pick a single survivor per overlap
Simplifying Polygons While Preserving Shared Boundaries — topology-aware simplification that avoids gaps and overlaps along shared edges
Geometry Repair with Shapely & GeoPandas — fixing self-intersecting polygons, unclosed rings, and bowtie geometries
CRS Normalization Across Mixed Datasets — automated EPSG detection and re-projection before topology operations
Attribute Mapping and Schema Harmonisation — aligning column names and types across deduplicated outputs
Handling Precision and Coordinate Rounding — grid-snapping and decimal truncation strategies that precede deduplication