Edge Case Spatial Data Creation

10 min read

Edge case spatial data creation is the practice of deliberately synthesising the geometries that break naive validators — zero-area slivers, self-intersecting rings, near-coincident vertices, anti-meridian crossings, and empty or NaN-bearing coordinates — so that a pipeline’s failure handling is proven before a real dataset triggers it in production. As one of the three generation modalities under Test Data Generation & Mocking Strategies, it complements synthetic vector data generation, which produces nominal features, by producing the deliberately malformed ones. Where nominal fixtures confirm the happy path, edge-case fixtures confirm that make_valid, snap-to-grid, and rejection routing behave deterministically. This page covers the taxonomy of spatial edge cases, a config-driven generator built on the Shapely 2.x API, its PostGIS counterparts, how the suite slots into memory-safe CI gates, and the failure modes that quietly let degenerate geometry slip through.

Why Edge Cases Need a Generator, Not a Snapshot

Harvesting malformed geometry from production extracts is non-reproducible: the broken feature that crashed yesterday’s run may be cleaned, re-projected, or deleted before you can pin it as a regression fixture. A generator instead encodes the defect class as code, seeds it with a fixed PRNG state, and emits byte-identical output on every runner. This makes edge cases first-class artifacts that version alongside the validator they exercise, exactly as nominal fixtures do. The discipline decomposes into five families, each with its own synthesis strategy and tolerance policy.

Taxonomy and Tolerance Strategy

The table below maps each edge-case family to how it is generated, the tolerance strategy a downstream validator should apply, a typical threshold range, and the CRS units that threshold is expressed in. Thresholds are unit-sensitive: a 0.001 snap distance means one millimetre in EPSG:32618 (metres) but roughly 111 metres in EPSG:4326 (decimal degrees), so every threshold must be paired with the CRS that gives it meaning.

Edge-case family	How it is generated	Tolerance strategy	Typical threshold	CRS units
Degenerate (zero-area, zero-length)	Collapse vertices to a line or point; emit a one- or two-point ring	Area / length floor below which the feature is collapsed, not failed	$A_{min}\in[10^{-9},10^{-6}]$	projected (m², m)
Topology-violating (self-intersection, bowtie)	Reorder ring vertices to cross; overlap two rings	DE-9IM predicate check, then `make_valid` repair	exact predicate	CRS-independent
Precision-truncated	Round ordinates to a coarse grid; inject ULP-level drift	Relative epsilon on coordinate equality	$\varepsilon\in[10^{-9},10^{-6}]$	CRS map units
Boundary-condition (anti-meridian, pole)	Place vertices at $\pm180°$ longitude or $\pm90°$ latitude	Wrap-aware envelope; projected-area check	$\pm180°,\pm90°$	degrees
Empty / null	Emit empty `GeometryCollection`; set ordinate to `NaN`/`Inf`	Explicit empty/finite guard before any predicate	n/a	n/a

The relative-epsilon strategy used for precision-truncated cases compares two ordinates $a$ and $b$ as equal when

|a - b| \le \varepsilon \cdot \max(|a|,\,|b|,\,1)

which keeps the comparison stable across the wide dynamic range of projected coordinates, where a fixed absolute tolerance would be too loose near the origin and too tight far from it. For geometry-level drift after a snap or round-trip, bound the Hausdorff distance $d_H(G, G')$ rather than comparing vertices pairwise, since snapping can add or remove vertices while keeping the shape within tolerance.

Degenerate Geometries

A degenerate geometry has collapsed below the dimensionality its type implies: a polygon whose ring encloses zero area, or a line whose two endpoints are coincident. The defining test is an area or length floor, not validity — a zero-area polygon can be topologically valid yet semantically meaningless. Generate the canonical case by feeding a Polygon constructor a ring whose vertices are collinear, then assert the floor with geom.area < A_min.

import shapely  # Shapely 2.x

collapsed = shapely.Polygon([(0, 0), (1, 1), (2, 2), (0, 0)])  # collinear → zero area
assert collapsed.area == 0.0  # degenerate, even though is_valid may be True

Topology Violations

Topology violations are the geometries that fail the OGC Simple Features validity model: self-intersecting linestrings, “bowtie” polygons whose exterior ring crosses itself, rings wound in the wrong orientation, and duplicate consecutive vertices. These are the cases make_valid exists to repair, so the fixture must reliably trip is_valid.

bowtie = shapely.Polygon([(0, 0), (1, 1), (1, 0), (0, 1), (0, 0)])  # exterior crosses itself
assert not shapely.is_valid(bowtie)
repaired = shapely.make_valid(bowtie)  # → MultiPolygon of the two triangles

Precision-Truncated Coordinates

Precision truncation models what happens when coordinates round-trip through a coarse serialisation grid (GeoJSON’s default six decimals, a shapefile’s float bound) or accumulate floating-point drift across transformations. Generate it with shapely.set_precision, which snaps ordinates to a fixed grid size and can itself collapse near-coincident vertices — exactly the artifact you want to test against.

fine = shapely.Polygon([(0, 0), (0, 1e-7), (1, 1), (1, 0), (0, 0)])
snapped = shapely.set_precision(fine, grid_size=1e-3)  # near-coincident edge collapses

Boundary Conditions

Boundary-condition cases sit on the discontinuities of the coordinate system itself: the $\pm180°$ anti-meridian, the poles, and the valid-range limits of a projected CRS. A polygon spanning the dateline wraps incorrectly under naive longitude containment, so generate it explicitly with vertices on both sides of $\pm180°$ and validate it with a wrap-aware envelope rather than a min/max bounding box.

antimeridian = shapely.Polygon([(179, 0), (-179, 0), (-179, 1), (179, 1), (179, 0)])
# minx/maxx span ~358° — a naive bbox filter will treat this as global

Empty and Non-Finite Geometries

The final family covers geometries that carry no usable coordinates: empty GeometryCollection objects, and ordinates set to NaN or Inf. These crash predicates that assume finite input, so a robust validator must guard for them before any spatial operation. Generate the empty case directly and the non-finite case with a NumPy coordinate array.

import numpy as np
empty = shapely.GeometryCollection()
nan_pt = shapely.points(np.array([[np.nan, np.nan]]))[0]
assert empty.is_empty and not np.isfinite(nan_pt.x)

Production-Grade Generator

The generator below is a deterministic factory that reads its tolerance matrix from config, emits one fixture per edge-case family, and gates each output to confirm it is expectedly invalid before writing it. It uses the Shapely 2.x top-level functional API throughout and a seeded numpy.random.Generator for reproducibility. Pair it with pytest-geo to drive the suite.

# edge_case_factory.py — Shapely 2.x, NumPy-backed, deterministic
from dataclasses import dataclass
import numpy as np
import shapely
import yaml


@dataclass(frozen=True)
class Tolerances:
    area_floor: float        # A_min — collapse below this
    grid_size: float         # set_precision snap grid
    rel_epsilon: float       # relative ordinate equality

    @classmethod
    def from_config(cls, path: str) -> "Tolerances":
        cfg = yaml.safe_load(open(path))["tolerance"]
        return cls(cfg["area_floor"], cfg["grid_size"], cfg["rel_epsilon"])


def make_edge_cases(tol: Tolerances, seed: int = 1729) -> dict[str, object]:
    """Return one deterministic fixture per edge-case family."""
    rng = np.random.default_rng(seed)  # seeded → byte-identical output
    jitter = rng.uniform(0, tol.grid_size / 4)  # sub-grid noise, snapped away below

    cases = {
        "degenerate": shapely.Polygon([(0, 0), (1, 1), (2, 2), (0, 0)]),
        "bowtie": shapely.Polygon([(0, 0), (1, 1), (1, 0), (0, 1), (0, 0)]),
        "precision": shapely.set_precision(
            shapely.Polygon([(0, 0), (jitter, 1e-7), (1, 1), (1, 0), (0, 0)]),
            grid_size=tol.grid_size,
        ),
        "antimeridian": shapely.Polygon(
            [(179, 0), (-179, 0), (-179, 1), (179, 1), (179, 0)]
        ),
        "empty": shapely.GeometryCollection(),
    }
    return cases


def classify(geom, tol: Tolerances) -> str:
    """Deterministic bucket used for downstream routing and structured logs."""
    if geom.is_empty:
        return "empty"
    if geom.area < tol.area_floor and geom.geom_type.endswith("Polygon"):
        return "degenerate"
    if not shapely.is_valid(geom):
        return "topology-violating"
    minx, _, maxx, _ = geom.bounds
    if maxx - minx > 180:
        return "boundary-condition"
    return "nominal"

The matching pytest module asserts each fixture lands in its expected bucket — a regression here means the generator silently produced a valid geometry and the edge case is no longer being exercised.

# test_edge_cases.py — pytest 7+
import pytest
from edge_case_factory import Tolerances, make_edge_cases, classify

TOL = Tolerances.from_config("tolerances.yaml")

@pytest.fixture(scope="session")
def cases():
    return make_edge_cases(TOL, seed=1729)

@pytest.mark.parametrize("name,bucket", [
    ("degenerate", "degenerate"),
    ("bowtie", "topology-violating"),
    ("antimeridian", "boundary-condition"),
    ("empty", "empty"),
])
def test_case_lands_in_expected_bucket(cases, name, bucket):
    assert classify(cases[name], TOL) == bucket  # fails if case became valid

def test_factory_is_deterministic():
    a = make_edge_cases(TOL, seed=1729)
    b = make_edge_cases(TOL, seed=1729)
    assert all(a[k].equals(b[k]) for k in a)  # byte-identical across runs

The companion tolerances.yaml keeps the thresholds out of code so they can be tuned per CRS without touching the generator:

# tolerances.yaml
crs_authority: "EPSG:32618"   # UTM 18N — metres, so floors are in m²/m
tolerance:
  area_floor: 1.0e-9          # collapse polygons below 1 nm²
  grid_size: 1.0e-3           # snap to the millimetre
  rel_epsilon: 1.0e-9         # relative ordinate equality

PostGIS and Database-Side Counterparts

When the pipeline validates inside a database rather than in process, the same edge cases must be generated and checked server-side so the SQL gate matches the Python one. PostGIS exposes the equivalent validity model through ST_IsValid, the human-readable ST_IsValidReason, and the repair routine ST_MakeValid, mirroring Shapely’s make_valid.

-- Generate and inspect a self-intersecting polygon server-side
WITH bowtie AS (
  SELECT ST_GeomFromText(
    'POLYGON((0 0, 1 1, 1 0, 0 1, 0 0))', 32618
  ) AS geom
)
SELECT
  ST_IsValid(geom)                AS valid,        -- false
  ST_IsValidReason(geom)          AS reason,       -- 'Self-intersection[...]'
  ST_Area(ST_MakeValid(geom))     AS repaired_area
FROM bowtie;

For the degenerate floor and anti-meridian families, the same classification logic lives in SQL — ST_Area(geom) < :area_floor for collapse, and ST_XMax(geom) - ST_XMin(geom) > 180 to flag a dateline-spanning extent (normalise first with ST_ShiftLongitude where the source uses a 0–360° convention). Run these through psycopg2 with parameterised tolerances so the threshold matrix is shared with the Python generator rather than duplicated:

import psycopg2

CHECK = """
SELECT ST_IsValid(g), ST_Area(g) < %(area_floor)s
FROM (SELECT ST_GeomFromWKB(%(wkb)s, 32618) AS g) s;
"""

with psycopg2.connect(dsn) as conn, conn.cursor() as cur:
    cur.execute(CHECK, {"wkb": geom.wkb, "area_floor": 1e-9})
    is_valid, is_degenerate = cur.fetchone()

Pipeline Integration

Edge-case generation is a first-class pipeline stage, not an ad-hoc script. Each run loads the version-pinned tolerance matrix, hashes it into the artifact manifest, and emits a structured log so a failure is traceable to the exact config and seed that produced it. The synthesised fixtures feed both synthetic vector workflows — to build multi-layer GeoPackages that mix nominal and malformed features — and raster mocking techniques, where degenerate boundaries and precision limits propagate into nodata handling and extent clipping.

Memory safety is the constraint that shapes the whole stage. High-cardinality edge-case synthesis routinely triggers out-of-memory conditions in naive implementations, so the generator yields features iteratively rather than materialising a full collection, reads and writes through chunked I/O with pyogrio or fiona, and runs inside a container with a hard memory cgroup and swap disabled — so an allocation defect surfaces as a deterministic OOM kill instead of being masked by OS paging. Each CI run produces an immutable manifest carrying:

the configuration hash and tolerance-matrix snapshot;
the PRNG seed and generation timestamp;
per-bucket validation counts (how many degenerate, topology-violating, precision-truncated, boundary, and empty cases were emitted);
peak resident memory and batch throughput.

Pin libgeos and PROJ in the runner image: a floating dependency version can change axis order or topology snapping between runs, which silently alters whether a borderline fixture is classified as valid and breaks the reproducibility the whole stage exists to guarantee. Wire the suite into the same gate the rest of Test Data Generation & Mocking Strategies uses, running it on every PR that touches a geospatial transformation module.

Common Failure Modes and Gotchas

The edge case quietly becomes valid. A refactor of the generator — or a bumped GEOS — can turn a bowtie into a valid MultiPolygon, so the fixture stops exercising the failure path. Always assert the expected invalidity (assert not is_valid), not just that the geometry exists.
Tolerance threshold without a CRS. An area_floor of 1e-9 is one square nanometre in metres but a vast region in decimal degrees. Pin the crs_authority next to every threshold and validate that fixtures are generated in the CRS the floor assumes.
set_precision collapsing more than intended. Snapping to a coarse grid_size can merge vertices you meant to keep distinct, emitting an empty or degenerate result instead of the precision-truncated one. Verify the snapped geometry’s type after generation.
Anti-meridian handled with a naive bounding box. A polygon spanning $\pm180°$ produces a minx/maxx envelope ~358° wide, so a plain bbox filter treats it as global and either drops or double-counts it. Use a wrap-aware envelope and normalise with ST_ShiftLongitude server-side.
NaN/Inf ordinates reaching a predicate. Non-finite coordinates crash or silently corrupt DE-9IM evaluation. Guard with np.isfinite and an is_empty check before any spatial operation, and route non-finite input to explicit rejection.
Materialising the whole fixture set in memory. Building every edge case into one GeoDataFrame for convenience defeats the memory-safety goal. Yield features iteratively and stream them to a GeoPackage so the OOM behaviour you are testing for is real.

Conclusion

Edge case spatial data creation turns the geometries that crash production into deterministic, version-controlled fixtures: five defect families, each generated from a seeded factory, classified against a CRS-aware tolerance matrix, and gated in memory-safe CI. Get the expected-invalidity assertions, the relative-epsilon comparisons, and the pinned GEOS/PROJ runtime right, and your validators are proven against the dateline, the sliver, and the NaN before a real dataset finds them first. For the wider generation discipline these fixtures belong to — nominal vectors, mock rasters, and the controls that govern them — return to Test Data Generation & Mocking Strategies.

Synthetic Vector Data Generation — the nominal-feature counterpart these malformed fixtures are mixed with.
Raster Mocking Techniques — where precision limits and degenerate boundaries propagate into nodata and clipping tests.
Spatial Assertion Types Explained — the predicate taxonomy that classifies each edge case.
Setting Up Spatial Tolerance Thresholds in Assertions — how the tolerance matrix in this page is consumed downstream.
How to Structure pytest-geo for Large Shapefiles — wiring the generator into a memory-safe test suite.