How to structure pytest-geo for large shapefiles

When validating multi-gigabyte .shp datasets in automated pipelines, naive pytest configurations routinely trigger memory exhaustion, I/O bottlenecks, and non-deterministic timeout failures. The core challenge is not merely running spatial assertions, but architecting a test harness that respects filesystem constraints, enforces strict validation boundaries, and scales across parallel CI runners. Learning how to structure pytest-geo for large shapefiles requires a deliberate separation of data ingestion, spatial indexing, assertion logic, and pipeline orchestration. This guide provides a production-ready blueprint for GIS QA engineers, data engineers, and platform teams deploying geospatial QA automation at scale.

Architectural Blueprint & Directory Layout

A scalable spatial test repository must isolate heavy binaries from version control, enforce deterministic fixture lifecycles, and align with established Geospatial QA Fundamentals & Architecture. The following structure decouples test logic from raw data while maintaining strict reproducibility:

project-root/
├── conftest.py                 # Root-level fixtures, session-scoped setup
├── pyproject.toml              # pytest config, xdist, timeout, markers
├── tests/
│   ├── geo/
│   │   ├── conftest.py         # Spatial-specific fixtures, lazy loaders
│   │   ├── test_topology.py
│   │   ├── test_schema.py
│   │   └── test_crs_alignment.py
│   └── unit/
│       └── test_transforms.py
├── data/                       # .gitignored, mounted via CI artifact/cache
│   └── large_dataset/
│       ├── boundaries.shp
│       ├── boundaries.shx
│       ├── boundaries.dbf
│       └── boundaries.prj
└── ci/
    └── spatial_cache_policy.yaml

This layout enforces a clear boundary between test execution and data provisioning. Large shapefiles should never be committed to Git; instead, they are provisioned via CI artifact storage, cloud object mounts, or deterministic synthetic generators. The tests/geo/conftest.py layer becomes the single source of truth for spatial fixture injection, while root conftest.py handles session-level resource pooling and teardown.

Lazy-Loading Fixtures & Memory Management

Loading a 500MB shapefile into memory per test function is unsustainable. The fixture layer must implement lazy evaluation, spatial indexing, and chunked iteration. Production-grade setups leverage fiona for low-level reading and geopandas only when vectorized operations are strictly necessary. Consult the official Fiona documentation for driver configuration and streaming best practices.

# tests/geo/conftest.py
import pytest
import fiona
from pathlib import Path
from shapely.geometry import shape

@pytest.fixture(scope="session")
def large_shapefile_path(tmp_path_factory):
    """Provision path to large shapefile. In CI, resolves from artifact cache."""
    return Path("/opt/ci/data/large_dataset/boundaries.shp")

@pytest.fixture(scope="session")
def spatial_index(large_shapefile_path):
    """Build an R-tree index without loading geometries into RAM."""
    with fiona.open(large_shapefile_path) as src:
        # Stream features, extract bounds, and build STRtree
        # Implementation uses `shapely` 2.0 STRtree for O(log N) lookups
        pass

By streaming features via fiona and deferring geometry instantiation until assertion time, you eliminate the $O(N)$ memory spike. When full DataFrame operations are unavoidable, use chunked iteration (gpd.read_file(..., chunksize=10000)) and explicitly drop intermediate references to trigger garbage collection. For high-performance spatial indexing, leverage Shapely’s STRtree implementation to cache bounding boxes without materializing full geometries.

Spatial Assertion Strategy & Test Pyramid Alignment

Structuring spatial tests requires strict adherence to validation tiers. Aligning your suite with Understanding the GIS Test Pyramid ensures that heavy integration tests only execute after fast, deterministic unit checks pass.

At the base, unit tests validate coordinate transformations, projection math, and geometry constructors using lightweight mocking strategies. Mid-tier integration tests apply spatial assertion patterns to verify topology rules, CRS alignment, and attribute schema compliance. Heavy end-to-end validations run against cached production snapshots, enforcing strict scoping rules to prevent redundant full-dataset scans and assertion drift.

# tests/geo/test_topology.py
def test_polygon_validity(spatial_index):
    """Validate topology without full geometry materialization."""
    for feature in spatial_index.query_bounds():
        geom = shape(feature["geometry"])
        assert geom.is_valid, f"Invalid geometry at ID {feature['id']}"

CI/CD Orchestration & Parallel Execution

Parallelizing spatial tests introduces race conditions around shared file handles and temporary index files. Use pytest-xdist with explicit worker isolation and session-scoped fixtures that generate read-only copies per worker. Configure pyproject.toml to enforce strict timeouts and disable flaky network-dependent assertions. Refer to the official pytest-xdist documentation for worker distribution strategies and --dist modes.

# pyproject.toml
[tool.pytest.ini_options]
addopts = "-n auto --timeout=300 --strict-markers"
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "integration: requires full dataset cache"
]

Implement a deterministic cache policy in your CI runner to mount read-only shapefile bundles. When cache misses occur, fallback to synthetic generators that produce statistically representative geometries rather than downloading multi-gigabyte archives. Isolate worker environments using ephemeral containers or isolated temp directories to prevent cross-test state pollution.

Security & Scoping Boundaries

Processing untrusted spatial data in automated pipelines introduces path traversal, CRS injection, and malformed geometry risks. Enforce strict security boundaries by validating file signatures before ingestion, restricting fiona/gdal driver capabilities, and sandboxing test runners in ephemeral containers. Always normalize CRS inputs to a canonical EPSG code before running spatial joins, and strip external metadata that could leak pipeline secrets.

By combining lazy fixture loading, strict assertion scoping, and hardened CI orchestration, teams can validate enterprise-scale shapefiles without compromising pipeline velocity or infrastructure stability.