Test Data Generation & Mocking Strategies

13 min read

Test data generation and mocking is the engineering discipline that produces deterministic, schema-constrained spatial fixtures so that every geometric assertion, projection, and spatial index operation can be validated without touching production extracts. It is the supply chain for everything else in geospatial QA: this discipline sits at the root of geospatial data testing alongside the sibling disciplines of Geospatial QA Fundamentals & Architecture and Spatial Test Pattern Design & Implementation, feeding both with reproducible inputs. Traditional QA that relies on static production snapshots or hand-crafted fixtures introduces non-determinism, coordinate drift, and hidden topological defects that surface only at scale. A rigorous implementation enforces deterministic seeding, explicit tolerance boundaries, and schema-first mocking so that engineering teams can validate spatial logic across ephemeral environments with byte-level reproducibility.

What This Discipline Covers

In the context of a spatial data pipeline, test data generation is the stage that materialises inputs before validation runs: it converts a declarative contract (geometry types, CRS, attribute schema, edge-case requirements) into concrete fixtures that flow into unit, integration, and end-to-end stages. It is distinct from validation itself — the assertions live in Spatial Test Pattern Design & Implementation — and distinct from the mocking of connections and services, which is covered under mocking geospatial data for tests. Here the concern is the data itself: synthetic vectors, mock rasters, and deliberately malformed geometries, each generated programmatically and versioned as a first-class artifact.

The discipline decomposes into three generation modalities and a set of cross-cutting controls. The modalities are Synthetic Vector Data Generation for points, lines, and polygons; Raster Mocking Techniques for multi-band and multi-dimensional arrays; and Edge Case Spatial Data Creation for the degenerate geometries that break naive validators. The cross-cutting controls — precision policy, fixture versioning, CI/CD integration, and governance — apply uniformly across all three.

Architectural Foundations for Spatial Test Data

Production spatial testing requires a fundamental shift from ad-hoc sample files to programmatic fixture generation. The architecture must decouple test data creation from validation logic, treating spatial mocks as versioned artifacts with explicit metadata contracts. That separation of concerns rests on three non-negotiable principles:

Deterministic Seeding: All coordinate generation, attribute sampling, and topology construction must derive from fixed random seeds or closed-form mathematical generators. This guarantees identical outputs across local workstations, GitHub Actions runners, and staging clusters.
Schema-First Mocking: GeoJSON, (Geo)Parquet, PostGIS, and GDAL-compatible formats require strict schema enforcement before geometry construction. Attribute types, nullable constraints, and spatial reference identifiers (SRIDs) are validated prior to geometry instantiation to prevent silent type coercion during serialization.
Ephemeral Fixture Lifecycle: Test data should be generated in-memory or within isolated temporary directories, with explicit teardown hooks to prevent disk bloat and cross-test contamination. Memory-mapped arrays and GDAL Virtual Datasets (VRTs) are strongly preferred for large vector/raster mocks to minimise I/O overhead.

The diagram above shows the canonical flow: a schema contract and a seed enter the generator, which fans out to the three modalities, converges on an ephemeral fixture, and is validated against its own contract before teardown. The remaining sections walk each stage in production detail.

Deterministic Vector Synthesis

The first generation stage is vector synthesis, and it is where Synthetic Vector Data Generation provides the foundational patterns for deterministic point, line, and polygon fixtures that align with production indexing strategies. Vector mocking requires precise control over geometry primitives, coordinate sequences, and spatial-index behaviour. Production implementations typically lean on Shapely 2.x and GeoPandas 0.14+ to construct geometries programmatically while bypassing file-I/O bottlenecks, enforcing CRS consistency, validating topology at generation time, and simulating realistic attribute distributions without sanitised production extracts.

When architecting synthetic vector pipelines, engineers must prioritise coordinate precision, bounding-box constraints, and spatial-index alignment. Key implementation considerations include:

WKT/GeoJSON serialization control: Avoid implicit float truncation by explicitly fixing decimal precision during serialization. Use shapely.set_precision() on a defined grid size before shapely.to_wkt() so the same precision model governs both construction and output, preventing the precision loss that silently breaks spatial joins.
CRS & SRID enforcement: Mock geometries must carry explicit EPSG codes. Relying on an implicit None CRS causes downstream projection failures in pyproj and PostGIS ST_Transform. A GeoDataFrame constructed without crs= should be treated as a generation error, not a warning.
Spatial-index pre-alignment: Generate bounding boxes that intentionally overlap or align with the production R-tree / STRtree partitioning scheme to exercise index-pruning logic under CI conditions rather than only in production.

The snippet below shows the version-correct Shapely 2.x precision idiom — note that set_precision returns a new geometry and is applied before serialization, not after:

import numpy as np
from shapely import Point, set_precision, to_wkt

rng = np.random.default_rng(42)               # deterministic seed
xs = rng.uniform(500_000, 600_000, 1_000)     # EPSG:32618 easting (metres)
ys = rng.uniform(4_500_000, 4_600_000, 1_000) # northing (metres)

# 1 mm grid: every coordinate snaps to the same precision model the
# production pipeline uses, so WKT round-trips are byte-stable.
geoms = [set_precision(Point(x, y), grid_size=0.001) for x, y in zip(xs, ys)]
wkt = [to_wkt(g, rounding_precision=3) for g in geoms]

Vector outputs frequently double as ground truth for raster validation, so they are generated first and their extents and seeds are shared downstream — covered next.

Raster & Multi-Dimensional Mocking

The second stage is raster synthesis, and Raster Mocking Techniques covers production-ready approaches for simulating multi-band imagery, DEMs, and time-series stacks without storage penalties. Raster mocking introduces complexity that vector mocking does not: band alignment, affine transform matrices, compression codecs, nodata handling, and CRS projection artifacts. Unlike vectors, raster fixtures require deterministic pixel arrays, an explicit geotransform, and metadata sidecars to simulate real sensor outputs.

Engineers should generate raster mocks with rasterio and xarray, constructing in-memory datasets with controlled dimensions, dtypes, and chunking. Critical patterns include:

Affine-transform determinism: Hardcode the Affine transform so pixel-to-world mapping is identical across runs. The transform — origin, pixel size, rotation — is the raster equivalent of a CRS contract and must be pinned, never inferred.
Compression & tiling simulation: Mock compressed, tiled formats (LZW, DEFLATE, ZSTD) to validate read/write throughput and memory footprint under constrained CI runner resources, where an uncompressed full-resolution array would exhaust the runner.
Chunked I/O validation: Use rasterio.windows to verify that spatial clipping and windowed reads behave identically across local and distributed execution.

The example below builds a deterministic two-band raster entirely in memory using a MemoryFile, with a pinned transform and an explicit CRS — no disk artifact is created:

import numpy as np
import rasterio
from rasterio.io import MemoryFile
from rasterio.transform import from_origin

rng = np.random.default_rng(7)
band = rng.integers(0, 255, size=(256, 256), dtype="uint8")
transform = from_origin(500_000, 4_600_000, 10.0, 10.0)  # 10 m pixels, pinned

profile = dict(driver="GTiff", height=256, width=256, count=2,
               dtype="uint8", crs="EPSG:32618", transform=transform,
               compress="deflate", tiled=True, blockxsize=128, blockysize=128)

with MemoryFile() as mem:
    with mem.open(**profile) as dst:
        dst.write(band, 1)
        dst.write(band[::-1, :], 2)   # second band: deterministic mirror
    with mem.open() as src:
        assert src.transform == transform   # transform survives round-trip

Because the raster shares its seed and extent with the vector stage, cross-modal checks — rasterising a synthetic polygon layer and asserting zonal-statistic parity — become deterministic, which feeds directly into the parity strategy in cross-format parity testing.

Edge Cases & Topological Stress Testing

The third stage deliberately manufactures failure. Edge Case Spatial Data Creation details methodologies for synthesising sliver polygons, precision-collapse scenarios, and topology violations, and it pairs with the assertion side of geometry validation patterns and topology rule enforcement. Real-world spatial logic routinely fails on degenerate geometries, self-intersections, invalid rings, and CRS boundary crossings. Relying solely on happy-path fixtures masks critical vulnerabilities in topology validation and spatial-predicate evaluation, so stress fixtures must intentionally violate OGC Simple Features validity rules to prove that validators catch and route errors correctly.

Implementation priorities include:

Self-intersecting & bowtie polygons: Generate geometries that trigger ST_IsValid / GEOS validity failures to verify error-handling middleware actually rejects them rather than serialising garbage downstream.
CRS boundary & dateline crossings: Mock features spanning ±180° longitude or near-polar latitudes to validate coordinate wrapping and reprojection edge cases that silently corrupt area and length calculations.
Precision loss & floating-point drift: Introduce sub-millimetre coordinate perturbations to test tolerance thresholds in spatial joins and overlay operations.
Empty and mixed-dimension geometries: Emit empty geometries, GEOMETRYCOLLECTIONs, and mixed Z/M coordinate features to confirm that serializers and validators degrade gracefully rather than raising opaque exceptions.

The anti-meridian case is the canonical trap and is worth generating explicitly. A LineString from 179.9° to -179.9° is a ~0.2° hop in reality but a ~359.8° span if longitude wrapping is mishandled — a fixture that asserts the difference is exactly the kind of artifact this stage must produce.

Precision and Tolerance Policy

Every generation modality shares one cross-cutting concern: how much numerical deviation is acceptable, and in what units. Spatial fixtures must never be compared with exact floating-point equality; instead, every assertion that consumes a fixture works against an explicit tolerance envelope, configured centrally and pinned per CRS. The policy below is the contract that synthetic generators emit and that the validators in spatial assertion types explained consume.

Concern	Geographic CRS (EPSG:4326, degrees)	Projected CRS (EPSG:32618, metres)	Notes
XY snap / grid size	`1e-7` (~1.1 cm at equator)	`1e-3` (1 mm)	Applied via `set_precision` at generation time
Round-trip reprojection epsilon	`1e-7` deg	`1e-3` m	Validated after `to_crs()` round-trips
Minimum polygon area	`1e-9` deg²	`0.5` m²	Below this, drop as a sliver artifact
Hausdorff tolerance (shape parity)	`1e-6` deg	`0.01` m	Shape similarity, not vertex equality

Tolerance math is bounded, not aspirational. For a generated geometry $g$ and its round-tripped counterpart $g'$ after a forward/inverse CRS transform, the fixture is accepted when the directional Hausdorff distance stays inside the configured envelope $\varepsilon$ :

d_H(g, g') = \max_{a \in g}\, \min_{b \in g'}\, \lVert a - b \rVert \;\le\; \varepsilon

For area-based parity between a synthetic source and its rasterised or reprojected derivative, the relative area delta is bounded by a separate threshold $\tau$ :

\delta_A = \frac{\lvert A_{\text{out}} - A_{\text{in}} \rvert}{A_{\text{in}}} \;\le\; \tau

CRS units matter because the same numeric tolerance means radically different things in degrees versus metres — a 1e-3 tolerance is a millimetre in EPSG:32618 but roughly 100 metres in EPSG:4326. The policy table is therefore always indexed by CRS, never global.

Version pinning is the other half of precision determinism. GEOS, PROJ, and GDAL evolve their geometry and projection algorithms between releases, so an unpinned runtime can change a fixture’s coordinates without any code change. Pin the native stack (libgeos, PROJ, GDAL) at the container level and the Python bindings (shapely==2.x, geopandas>=0.14, pyproj, rasterio) in pyproject.toml, and record the resolved versions in the fixture manifest so a divergence is attributable to a library bump rather than a logic regression.

Test Data and Fixture Strategy

With precision pinned, fixtures themselves must be treated as versioned artifacts rather than transient scratch files. A fixture is defined by its declarative manifest — spatial extent, feature density, attribute schema, CRS, seed, and the required edge-case set — and the manifest, not the binary output, is the source of truth. Hashing the manifest yields a stable artifact identity: if the manifest hash matches an existing artifact’s checksum, generation is skipped, which makes the whole pipeline idempotent and cache-friendly.

Synthetic-data requirements break down by intent:

Schema-realistic fixtures mirror the production attribute contract (types, enums, nullability) at reduced cardinality, used by unit and integration stages.
Topology-realistic fixtures preserve the spatial relationships (adjacency, containment, index density) that production queries depend on, generated with controlled clustering so spatial indexes fragment the way they do in production.
Adversarial fixtures carry the deliberate edge cases — anti-meridian crossings, degenerate and multi-part geometries, mixed Z/M coordinates, empty geometries — that prove validators fail closed.

Fixture versioning follows the artifact, not the test. Each generation run emits a manifest containing SHA-256 checksums, CRS metadata, the resolved library versions, the seed, and a topology-validity report. Downstream consumers verify the checksum before ingestion and resolve fixtures by immutable tag, so a test pinned to fixture@v3 never silently consumes a regenerated v4. Edge-case completeness is itself asserted: the manifest enumerates which adversarial categories the fixture set covers, and a meta-test fails the build if a required category is absent.

How aggressively each stage scales its fixtures is governed by the GIS test pyramid and bounded by the scoping rules for map data validation: unit tests consume tiny in-memory fixtures, integration tests consume scoped synthetic layers, and end-to-end tests consume scaled-down snapshots that preserve schema complexity without full-dataset cost.

CI/CD Integration & Observability

Embedding generation into automated pipelines demands strict lifecycle management and parallel-execution safety. Python-based runners combine pytest fixtures with tempfile/contextlib to guarantee deterministic setup and teardown, and pair synthetic outputs with schema-assertion libraries and spatial-predicate checks before any artifact is promoted.

import pytest
import tempfile
import geopandas as gpd
from shapely.geometry import Point

@pytest.fixture(scope="session")
def deterministic_vector_fixture():
    with tempfile.TemporaryDirectory() as tmpdir:
        gdf = gpd.GeoDataFrame(
            {"id": [1, 2, 3], "geometry": [Point(0, 0), Point(1, 1), Point(2, 2)]},
            crs="EPSG:4326",
        )
        path = f"{tmpdir}/mock_vector.parquet"
        gdf.to_parquet(path)
        yield path

Parallel CI execution demands fixture isolation to prevent race conditions during concurrent spatial indexing or database ingestion. Use pytest-xdist-compatible temporary directories, scope each worker to its own PostGIS schema or database, and enforce connection-pool limits so concurrent ingestion does not exhaust the server. The same parallelism concerns that govern large-fixture generation are explored under async execution for large datasets.

The pipeline runs generation in tiered gates:

Pre-merge gate (fast): Generate small in-memory fixtures, assert schema and ST_IsValid topology, and verify the manifest checksum. This must complete in seconds so it can block every pull request.
Nightly job (deep): Regenerate the full adversarial fixture set at production cardinality, run cross-format and zonal-statistic parity, and profile spatial-index build times against SLA thresholds.
Promotion: Artifacts passing both tiers are pushed to an immutable registry (S3, Azure Blob, or Git LFS) with version tags that downstream suites resolve deterministically.

Observability turns generation from a black box into a measured stage. Emit metrics in a Prometheus/OpenTelemetry-compatible shape — fixture_generation_seconds, fixture_feature_count, topology_invalid_total, precision_drift_max — and tag them with the fixture version and CRS so dashboards can trend drift over time. Generation events should be logged as structured JSON rather than free text, for example:

{
  "event": "fixture_generated",
  "fixture": "synthetic_vectors@v7",
  "crs": "EPSG:32618",
  "seed": 42,
  "feature_count": 50000,
  "topology_invalid": 0,
  "precision_drift_max_m": 0.0008,
  "geos_version": "3.12.1",
  "proj_version": "9.3.1",
  "sha256": "9f2c…",
  "duration_s": 4.81
}

A consistent log schema makes a regression attributable: a jump in topology_invalid or precision_drift_max is immediately tied to a seed, a CRS, and a library version, which is the same evidence the version-pinning policy depends on.

Security and Governance

Synthetic data is the strongest privacy control a spatial QA program has, but only if the generation boundary is enforced. Mock environments must never inherit production credentials, network routes, or write permissions — the guarantees detailed under security boundaries in spatial QA apply directly to the generation stage. Ephemeral runners should provision read-only roles, block outbound production endpoints by network policy, and rotate any mock credentials at pipeline initialization.

Spatial injection is a generation-specific risk that ordinary input fuzzing misses. When fixtures are produced from templated WKT/WKB or rendered into SQL for PostGIS ingestion, a malformed or hostile coordinate string can break out of its statement. Generators must therefore parameterise every geometry through the database driver (psycopg’s parameter binding, never string interpolation), validate WKT/WKB through a parser before it reaches the server, and treat any geometry that fails to parse as a rejected fixture rather than a literal to forward. The adversarial fixtures from the edge-case stage double as the corpus for these injection tests.

Governance scopes generation by data classification. Even though outputs are synthetic, the schema and spatial extent of a fixture can leak structural information about sensitive sources, so generation rules are tiered: public-classification fixtures may use real attribute enums and extents; restricted-classification fixtures must perturb extents, generalise attribute distributions, and strip any identifying schema fields. Every generation run records, in its manifest, which classification it targeted and which scoping rules it applied, producing an audit trail that ties each fixture back to an approved policy.

Conclusion

By treating test data as a first-class engineering artifact — deterministically seeded, schema-first, precision-pinned, and governance-scoped — geospatial teams eliminate the non-deterministic failures that plague snapshot-based QA. The three generation modalities of synthetic vectors, mock rasters, and adversarial edge cases, bound together by an explicit tolerance policy and an observable CI/CD lifecycle, give every downstream assertion a reproducible input. That reproducibility is what lets a spatial pipeline ship changes with confidence rather than hope, and it is the foundation the rest of production-grade spatial engineering is built on.

For authoritative validity definitions, consult the OGC Simple Features specification, and structure session-level versus function-level generation around pytest fixture scopes.

Synthetic Vector Data Generation — deterministic point, line, and polygon fixtures.
Raster Mocking Techniques — in-memory multi-band, DEM, and time-series mocks.
Edge Case Spatial Data Creation — slivers, dateline crossings, and topology violations.
Generating synthetic GeoJSON for edge-case testing — parser-resilience fixtures.
Mocking geospatial data for tests — mocking the connections and services that consume these fixtures.
Cross-format parity testing — asserting fixtures survive format and CRS round-trips.

Up one level: Geospatial Data Testing & QA.