Best Practices for STAC Catalog Pagination in Python
Implementing best practices for STAC catalog pagination in Python requires generator-based streaming, explicit limit/max_items controls, and resilient retry logic. Never rely on implicit full-collection loads. Instead, iterate through pystac-client search results using .items(), respect server-side next links, and implement exponential backoff for rate-limited endpoints. This pattern minimizes memory overhead, prevents pipeline stalls, and aligns with the official STAC API pagination specification.
How STAC Pagination Works Under the Hood
STAC APIs expose paginated results through HTTP Link headers or inline links arrays where rel="next". When querying large catalogs, the server returns a subset of items alongside a pointer to the subsequent page. Python ETL pipelines must traverse these pointers sequentially without loading the entire dataset into memory.
The pystac-client library abstracts this traversal, but default configurations often fetch aggressively. Calling .get_all_items() or converting results directly to a list forces the client to resolve every next pointer synchronously, causing out-of-memory (OOM) crashes in containerized environments or CI/CD runners. Understanding how the pystac-client search interface handles cursors is critical for building scalable ingestion workflows.
Production-Ready Implementation Patterns
1. Stream with Generators
Always use iterator methods like .items() instead of .items_as_list() or list comprehensions. Generators yield one page at a time, keeping your memory footprint constant regardless of catalog size. This lazy evaluation pattern ensures your process only holds the current batch of pystac.Item objects in RAM.
2. Cap Page and Total Limits
Set limit to match your provider’s optimal page size (typically 100–1000 items). Use max_items to enforce a hard ceiling on total results, preventing runaway queries when spatial or temporal filters are too broad. Explicit limits also protect against misconfigured endpoints that ignore standard pagination tokens.
3. Handle Rate Limits Gracefully
STAC endpoints frequently throttle requests. Wrap pagination in a retry loop with exponential backoff, respecting Retry-After headers when present. Transient 503 Service Unavailable or 429 Too Many Requests responses should never crash a pipeline; they should trigger a calculated delay before resuming traversal.
4. Validate Conformance
Before assuming standard pagination, check the API’s conformsTo array for https://api.stacspec.org/v1.0.0/core. Non-conforming endpoints may require custom cursor handling or offset-based workarounds. Always verify conformance during client initialization to avoid silent pagination failures.
5. Log Pagination State
Emit structured logs for each page fetch (page number, item count, elapsed time). This enables observability in Airflow, Prefect, or Dagster workflows and simplifies debugging when pipelines stall. Structured logging also provides audit trails for compliance and cost-tracking in cloud environments.
Complete Production Code Example
The following snippet demonstrates production-ready pagination with streaming, retries, and explicit limits. It processes items in memory-efficient batches and logs progress for pipeline monitoring.
import logging
import time
from typing import Iterator, Generator
import pystac_client
from pystac import Item
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from requests.exceptions import RequestException
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s"
)
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type(RequestException),
reraise=True
)
def fetch_stac_page(client: pystac_client.Client, search_kwargs: dict) -> Generator:
"""Execute a STAC search with resilient retry logic."""
logger.info("Initiating STAC search with params: %s", search_kwargs)
return client.search(**search_kwargs)
def stream_paginated_items(
catalog_url: str,
collections: list[str],
bbox: tuple[float, float, float, float],
datetime_range: str,
page_limit: int = 500,
max_total_items: int = 10000
) -> Iterator[Item]:
"""
Generator that streams paginated STAC items with explicit limits and observability.
"""
client = pystac_client.Client.open(catalog_url)
# Validate conformance before querying
if "https://api.stacspec.org/v1.0.0/core" not in client.get_conforms_to():
logger.warning("Catalog does not fully conform to STAC API Core. Pagination behavior may vary.")
search_kwargs = {
"collections": collections,
"bbox": bbox,
"datetime": datetime_range,
"limit": page_limit,
"max_items": max_total_items
}
try:
search_result = fetch_stac_page(client, search_kwargs)
except RequestException as e:
logger.error("Failed to establish STAC connection after retries: %s", e)
raise
page_count = 0
items_yielded = 0
# .items() returns a generator that respects next links automatically
for item in search_result.items():
page_count += 1
items_yielded += 1
# Log progress every 5 pages or at completion
if page_count % 5 == 0 or items_yielded >= max_total_items:
logger.info(
"Pagination checkpoint | Pages: %d | Items yielded: %d | Limit: %d",
page_count, items_yielded, max_total_items
)
yield item
# Hard safety break if generator exceeds expected bounds
if items_yielded >= max_total_items:
logger.info("Reached max_items ceiling. Terminating pagination.")
break
logger.info("Pagination complete. Total items streamed: %d", items_yielded)
# Usage example
if __name__ == "__main__":
for item in stream_paginated_items(
catalog_url="https://planetarycomputer.microsoft.com/api/stac/v1",
collections=["sentinel-2-l2a"],
bbox=(-122.5, 37.7, -122.3, 37.9),
datetime_range="2023-01-01/2023-01-31",
page_limit=250,
max_total_items=2000
):
# Process each item without loading the full catalog into memory
asset_urls = {k: v.href for k, v in item.assets.items()}
logger.debug("Processing item %s", item.id)Integrating Pagination into Data Pipelines
When integrating these patterns into larger workflows, align your pagination strategy with broader catalog synchronization routines. Properly chunked requests dramatically improve the reliability of Syncing STAC Catalogs with pystac-client across distributed ETL environments. By decoupling network traversal from downstream processing, you prevent backpressure from cascading into your compute layer.
For teams scaling geospatial ingestion, consider pairing generator-based pagination with async I/O or multiprocessing pools. Always validate that your downstream consumer (e.g., Pandas, GeoPandas, or Apache Arrow) can consume iterators natively. If you’re building foundational data infrastructure, explore the broader architectural patterns outlined in Mastering Geospatial Data Ingestion in Python to standardize retry policies, schema validation, and metadata tracking across your entire catalog ecosystem.