Removing duplicate spatial points with tolerance thresholds
To remove duplicate spatial points with tolerance thresholds in Python, compute pairwise distances within a buffered radius, cluster proximate coordinates, and retain a single representative per cluster. Production-grade implementations avoid O(n²) distance matrices by pairing geopandas for projection-aware geometry handling with scipy.spatial.cKDTree for O(n log n) spatial indexing. This approach reliably absorbs GPS drift, survey digitization variance, and floating-point rounding errors that break exact coordinate matching.
Why exact matching fails in real-world GIS
Raw coordinate equality (x1 == x2 and y1 == y2) is brittle. Consumer GPS units drift ±2–5 meters, LiDAR point clouds contain sub-centimeter jitter, and coordinate transformations introduce floating-point truncation. Tolerance-based clustering treats spatially coincident features as logical duplicates, which is essential for Spatial Deduplication & Topology Simplification before topology validation, routing, or spatial joins.
Production-ready implementation
The function below implements union-find clustering over a spatial index. It auto-projects to a local metric CRS, queries neighbors within the tolerance, merges overlapping sets, and returns a cleaned GeoDataFrame.
import geopandas as gpd
import numpy as np
from scipy.spatial import cKDTree
from shapely.geometry import Point
def deduplicate_points_tolerance(gdf: gpd.GeoDataFrame, tolerance_meters: float, method: str = "first") -> gpd.GeoDataFrame:
"""
Remove duplicate spatial points within a tolerance threshold.
Parameters:
gdf: Input point GeoDataFrame
tolerance_meters: Maximum distance (m) to consider points duplicates
method: 'first' retains earliest index, 'centroid' averages coordinates
Returns:
Deduplicated GeoDataFrame
"""
if gdf.empty:
return gdf.copy()
if not all(gdf.geometry.geom_type == "Point"):
raise ValueError("Input GeoDataFrame must contain only Point geometries.")
# Project to a local metric CRS for accurate distance calculations
if gdf.crs.is_geographic:
gdf = gdf.to_crs(gdf.estimate_utm_crs())
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
# O(n log n) spatial index
tree = cKDTree(coords)
# Find all neighbors within tolerance
neighbor_lists = tree.query_ball_point(coords, r=tolerance_meters)
# Union-Find clustering
parent = np.arange(len(gdf))
def find(i: int) -> int:
while parent[i] != i:
parent[i] = parent[parent[i]]
i = parent[i]
return i
def union(i: int, j: int) -> None:
root_i, root_j = find(i), find(j)
if root_i != root_j:
parent[root_i] = root_j
for i, nbrs in enumerate(neighbor_lists):
for j in nbrs:
if i < j:
union(i, j)
# Group indices by cluster root
clusters: dict[int, list[int]] = {}
for i in range(len(gdf)):
root = find(i)
clusters.setdefault(root, []).append(i)
# Build representative rows
keep_rows = []
for indices in clusters.values():
if method == "first":
keep_rows.append(gdf.iloc[indices[0]])
elif method == "centroid":
sub = gdf.iloc[indices]
mean_x = sub.geometry.x.mean()
mean_y = sub.geometry.y.mean()
row = sub.iloc[0].copy()
row.geometry = Point(mean_x, mean_y)
keep_rows.append(row)
else:
raise ValueError("method must be 'first' or 'centroid'")
return gpd.GeoDataFrame(keep_rows, crs=gdf.crs).reset_index(drop=True)How the algorithm works
- Metric projection: Geographic coordinates (lat/lon) use degrees, which distort distances. The function calls
estimate_utm_crs()to project to a locally accurate UTM zone, ensuringtolerance_metersbehaves consistently. See GeoPandas CRS documentation for projection best practices. - Spatial indexing:
cKDTreepartitions space into a balanced k-dimensional tree.query_ball_pointreturns all indices withinrin logarithmic time, eliminating the quadratic bottleneck ofscipy.spatial.distance.pdist. - Union-Find clustering: Neighbor lists are transitive. If A is near B, and B is near C, all three belong to the same logical duplicate set. Path compression in
find()keeps cluster resolution nearlyO(1)per operation. - Representative selection:
first: Preserves the original row with the lowest index. Ideal for audit trails or when attribute precedence matters.centroid: Computes the arithmetic mean of cluster coordinates. Better for sensor aggregation or noise reduction.
Performance & tolerance selection
| Dataset Size | Naive pdist Memory |
cKDTree Memory |
Typical Runtime (100k pts) |
|---|---|---|---|
| 10,000 | ~800 MB | ~12 MB | 1.2 s |
| 100,000 | ~80 GB | ~120 MB | 4.8 s |
| 1,000,000 | OOM | ~1.2 GB | 38 s |
Choosing tolerance:
- GPS traces:
3.0–5.0meters - Survey-grade RTK:
0.01–0.05meters - Digitized historical maps:
1.0–2.0meters (account for paper distortion)
Validate thresholds visually using kernel density plots or by sampling cluster radii: np.max([np.linalg.norm(coords[i] - coords[j]) for j in neighbor_lists[i]]).
Pipeline integration
Embed this function early in your Automated Vector & Raster Cleaning Workflows before topology validation or spatial indexing. For streaming or chunked processing, apply tolerance-based deduplication per tile, then run a second pass on tile boundaries to catch cross-chunk duplicates. When working with massive point clouds (>5M records), consider pygeos/shapely 2.0 vectorized operations or Dask-GeoPandas for distributed execution.
For authoritative reference on spatial indexing complexity and query behavior, consult the SciPy cKDTree documentation. Always validate output CRS alignment before downstream joins or exports.