Scoping Rules for Map Data Validation

8 min read

Scoping rules are the deterministic boundaries that decide which spatial extents, attribute schemas, coordinate reference systems, and topology tolerances a validation run actually evaluates — and, just as importantly, which ones it skips. Without them, geospatial test suites drift toward full-dataset scans that exhaust runner memory, blur the line between a warning and a hard failure, and make CI results impossible to reproduce. As a foundational pattern within Geospatial QA Fundamentals & Architecture, a scope is a versioned contract: it pins a bounding envelope, a list of required fields, an expected EPSG code, and a set of spatial tolerance thresholds so that every commit triggers bounded, memory-safe, configuration-driven verification rather than ad-hoc manual checks. This page covers the taxonomy of scope dimensions, runnable config-driven implementations in pytest and Great Expectations, their PostGIS counterparts, and the failure modes that quietly corrupt spatial pipelines when scopes are left implicit.

Scope Dimensions and Tolerance Strategy

A scope is never a single number; it is a small matrix of independent dimensions, each with its own tolerance strategy and CRS-unit sensitivity. Treating them as one undifferentiated “validation config” is the most common reason scopes become unmaintainable. The table below is the taxonomy this page builds on — every dimension maps to a distinct predicate and a distinct failure class.

Scope dimension	Tolerance strategy	Typical threshold range	CRS-unit sensitivity
Spatial extent (bbox / clip mask)	Containment with snap buffer	`0` – `1e-4` (envelope buffer)	High — degrees vs metres
Coordinate precision	Relative epsilon on ordinates	`1e-9` – `1e-6` (deg), `1e-3` – `1e-2` (m)	High
Attribute completeness	Null-rate ceiling per field	`0%` – `5%` null tolerance	None
CRS / SRID contract	Exact EPSG match, axis-order aware	Exact (no tolerance)	N/A — identity check
Topology gaps / overlaps	Snap distance + sliver area	`1e-6` deg, `< 1 m²` sliver	High
Geometry validity	DE-9IM / OGC validity, boolean	Pass/fail (no tolerance)	None

The decisive rule across every dimension is that floating-point ordinates are never compared with exact equality. Where a numeric comparison appears, bound it with a relative error tolerance keyed to the CRS unit:

\frac{\lvert v_{\text{computed}} - v_{\text{expected}}\rvert}{\lvert v_{\text{expected}}\rvert} < \varepsilon

A scope picks $\varepsilon$ to match the source data’s acquisition accuracy — 1e-6 decimal degrees is roughly 0.1 m at the equator, while a projected dataset in metres might tolerate 0.01. Mixing the two without converting is the classic unit-mismatch defect, so the CRS dimension is validated first and gates every geometric comparison downstream.

Spatial Extent and Clip Scoping

The extent dimension answers “where does this run look?” It is expressed as a bounding box (or an arbitrary clip polygon) in a declared CRS, and it is applied before any expensive predicate runs so the validation set is materialised lazily. The predicate is containment with a small snap buffer to absorb digitising noise at the envelope edge:

from shapely.geometry import box
from shapely import STRtree

scope_bbox = box(-3.30, 50.70, -3.45, 50.78)  # EPSG:4326, Exeter extent
tree = STRtree(features)                       # spatial index over candidate geoms
in_scope_idx = tree.query(scope_bbox, predicate="intersects")

Building the STRtree once and querying it with predicate="intersects" keeps the candidate set bounded — only geometries whose envelopes touch the scope are ever realised into memory. The same idea drives the database-side WHERE ST_Intersects(geom, ST_MakeEnvelope(...)) filter shown later, and it is what makes scoping the entry point for the memory-safe execution model described in Understanding the GIS Test Pyramid, where cheap envelope filters run before any overlay or join.

Attribute and Schema Scoping

Attribute scoping declares the required fields, their dtypes, and the maximum tolerated null rate per field. It is deliberately decoupled from geometry scoping so a topology failure can never mask a missing business attribute, and vice versa. A declarative fragment makes the contract auditable:

attributes:
  required: [feature_id, land_use_code, last_surveyed]
  null_tolerance:
    land_use_code: 0.00   # never null
    last_surveyed: 0.05   # up to 5% null permitted
  domains:
    land_use_code: [residential, commercial, agricultural, water]

The null-rate ceiling is a relative threshold, not a boolean, because real survey data carries known, bounded gaps. The full taxonomy of these checks — and how attribute predicates differ from geometric ones — is set out in Spatial Assertion Types Explained; scoping rules simply decide which of those assertions fire for a given dataset tier.

CRS and Topology Scoping

The CRS dimension is an identity check, not a tolerance check: the scope names an expected EPSG code, asserts axis order, and fails fast on any datum or authority mismatch before a single coordinate is compared. Because PROJ 6+ defaults to always_xy axis handling, the scope must declare the convention explicitly rather than inferring it.

from pyproj import CRS

expected = CRS.from_epsg(27700)            # British National Grid, metres
declared = CRS.from_user_input(layer_crs)
assert declared == expected, f"CRS contract breach: {declared.to_epsg()} != 27700"

Topology scoping then layers snap distance and sliver-area limits on top of a valid CRS. A gap smaller than the snap distance is tolerated; a sliver polygon below the area floor is collapsed rather than flagged. Because both thresholds are in CRS units, an incorrect CRS makes every topology tolerance meaningless — which is why deterministic CRS gating is treated as its own discipline in Automating CRS validation in CI pipelines.

Production-Grade Implementation in pytest

The reference implementation loads every threshold from config, never from a literal in the test body. This keeps strictness version-controlled and lets staging and production run the same harness with different scopes. The example uses Shapely 2.x and GeoPandas; tolerances arrive as a plain dict parsed from the YAML above.

import json
import geopandas as gpd
import pytest
from pyproj import CRS
from shapely.geometry import box


def load_scope(path: str) -> dict:
    with open(path, "r", encoding="utf-8") as fh:
        return json.load(fh)


@pytest.fixture(scope="session")
def scope() -> dict:
    return load_scope("scopes/exeter_parcels.json")


@pytest.fixture(scope="session")
def scoped_gdf(scope) -> gpd.GeoDataFrame:
    gdf = gpd.read_parquet("data/parcels.parquet")
    # 1. CRS contract — identity check, gates everything else.
    assert CRS.from_user_input(gdf.crs) == CRS.from_epsg(scope["epsg"])
    # 2. Spatial extent — lazy clip via spatial index, never a full scan.
    xmin, ymin, xmax, ymax = scope["bbox"]
    envelope = box(xmin, ymin, xmax, ymax)
    return gdf.iloc[list(gdf.sindex.query(envelope, predicate="intersects"))]


def test_attribute_completeness(scoped_gdf, scope):
    for field, max_null in scope["null_tolerance"].items():
        null_rate = scoped_gdf[field].isna().mean()
        assert null_rate <= max_null, f"{field}: {null_rate:.3f} > {max_null}"


def test_geometry_validity(scoped_gdf):
    invalid = scoped_gdf[~scoped_gdf.geometry.is_valid]
    assert invalid.empty, f"{len(invalid)} invalid geometries in scope"


def test_coordinate_precision(scoped_gdf, scope):
    eps = scope["coordinate_epsilon"]            # CRS-unit aware
    snapped = scoped_gdf.geometry.set_precision(eps)
    drift = scoped_gdf.geometry.hausdorff_distance(snapped)
    assert (drift <= eps).all(), "snap-to-grid drift exceeds scope epsilon"

For teams already standardised on Great Expectations 0.18+, the same scope drives a suite where each dimension becomes an expectation, so results land in the shared Data Docs alongside non-spatial checks:

import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_parquet("data/parcels.parquet")

validator.expect_column_values_to_not_be_null("land_use_code")
validator.expect_column_proportion_of_unique_values_to_be_between(
    "feature_id", min_value=1.0, max_value=1.0       # primary-key contract
)
validator.expect_column_values_to_be_in_set(
    "land_use_code", ["residential", "commercial", "agricultural", "water"]
)
validator.save_expectation_suite("exeter_parcels.scope")

PostGIS and Database-Side Counterparts

When the data already lives in PostGIS, push the scope down to the server so the envelope filter runs against a GiST index instead of materialising rows in the runner. The SQL mirrors the Python predicates one-to-one — extent, validity, and CRS are all checked server-side:

-- Extent + validity scope, evaluated against the GiST index.
SELECT feature_id, ST_IsValidReason(geom) AS reason
FROM   parcels
WHERE  ST_Intersects(
         geom,
         ST_MakeEnvelope(-3.45, 50.70, -3.30, 50.78, 4326)
       )
  AND  NOT ST_IsValid(geom);

-- CRS / SRID contract — fails the gate if any row drifts from EPSG:27700.
SELECT count(*) AS srid_violations
FROM   parcels
WHERE  ST_SRID(geom) <> 27700;

Driving these from Python with psycopg keeps the threshold values bound as parameters rather than interpolated into the query string, which also closes the WKT/WKB injection surface that Security Boundaries in Spatial QA treats as a first-class risk:

import psycopg

with psycopg.connect(dsn) as conn:
    rows = conn.execute(
        """
        SELECT count(*) FROM parcels
        WHERE ST_SRID(geom) <> %(epsg)s
        """,
        {"epsg": 27700},
    ).fetchone()
assert rows[0] == 0, "SRID contract breach"

Pipeline Integration

Scoping rules only pay off when they are wired into deterministic CI gates. The integration contract has three parts. First, the scope file is version-controlled next to the data contract, so a threshold change shows up in code review rather than silently in a runner. Second, the runtime is pinned — libgeos, PROJ, and GDAL are fixed to exact versions in the test image, because a PROJ minor bump can change axis handling and quietly move every coordinate. Third, each scope dimension maps to a typed gate outcome: extent and attribute breaches outside the merge-blocking tier may warn, while CRS and validity breaches always hard-fail.

# .github/workflows/spatial-validation.yml (excerpt)
jobs:
  validate:
    runs-on: ubuntu-latest
    container: ghcr.io/org/geo-runner:gdal3.8-proj9.3-geos3.12
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/scopes/ --json-report --json-report-file=scope-report.json
      - run: python ci/emit_scope_metrics.py scope-report.json   # structured log → observability

The harness should serialise results as structured JSON — scope id, dimension, threshold, observed value, and outcome — so DevOps teams can track tolerance-violation rates over time. Synthetic fixtures that reproduce each scope without a live database come from Mocking Geospatial Data for Tests, which lets the gates run in seconds without downloading multi-gigabyte sources.

Common Failure Modes and Gotchas

CRS checked last, or not at all. If extent and topology tolerances are evaluated before the SRID contract, a dataset in the wrong CRS produces tolerances measured in the wrong unit — a 0.01 metre snap distance applied to decimal degrees silently passes everything. Always gate CRS first.
Exact-equality coordinate comparisons. Comparing ordinates with == instead of a relative epsilon fails on bit-level round-trip drift through WKB or GeoParquet. Use set_precision plus a Hausdorff bound keyed to the scope epsilon.
Envelope filter without a snap buffer. Features that touch the scope boundary are dropped or double-counted when the envelope has zero buffer. A small 1e-4 buffer absorbs digitising noise at the edge.
Sliver polygons below the area floor flagged as topology errors. Snap-to-grid operations create zero- or near-zero-area slivers; scoping must collapse them rather than hard-fail, or every run drowns in false positives.
Anti-meridian and polar extents. A bbox spanning ±180° longitude wraps incorrectly under naive containment, and polar CRSs distort area thresholds. Scopes covering these regions need an explicit wrap-aware envelope and projected-area validation.
Unpinned PROJ/GEOS in the runner. A floating dependency version changes axis order or topology snapping between runs, breaking the reproducibility the whole scoping discipline exists to guarantee. Pin the container.

Conclusion

Scoping rules convert geospatial validation from an unpredictable, memory-heavy scan into a deterministic, pipeline-native contract: a versioned matrix of extent, attribute, CRS, and topology dimensions, each with an explicit tolerance and a typed gate outcome. Get the CRS identity check, the relative-epsilon comparisons, and the lazy index-driven clip right, and the result is faster CI, bounded memory, and spatial data that fails before it ever reaches production. For the wider architecture these scopes plug into — fixtures, the test pyramid, and CI gating — return to Geospatial QA Fundamentals & Architecture.

Understanding the GIS Test Pyramid — where scoped checks sit in the validation hierarchy.
Spatial Assertion Types Explained — the predicate taxonomy a scope selects from.
Security Boundaries in Spatial QA — restricting scopes to sanitised, non-sensitive extents.
Mocking Geospatial Data for Tests — synthetic fixtures that reproduce each scope.
Automating CRS validation in CI pipelines — the deterministic CRS gate this page’s contract depends on.