Synthetic Vector Data Generation

9 min read

Synthetic vector data generation is the practice of materialising deterministic, reproducible point, line, and polygon fixtures programmatically — without relying on production extracts or sensitive PII. It is one of the three generation modalities inside Test Data Generation & Mocking Strategies, the parent discipline that supplies reproducible inputs to every downstream assertion, sitting alongside Raster Mocking Techniques and Edge Case Spatial Data Creation. Treated rigorously, it is not an ad-hoc scripting exercise but a version-controlled, CI-integrated artifact that feeds directly into geometry validation patterns, performance benchmarking, and schema compliance gates. GIS QA engineers, data engineers, and platform teams use it to guarantee byte-identical fixtures across local, staging, and CI environments.

Because every generated geometry will eventually be measured against a predicate, the generator and the validator must agree on the same numeric contract. Configure spatial tolerance thresholds once, in the native CRS units of the dataset, and propagate them into both the construction step and the assertion step so that snap-to-grid rounding never produces geometry the test suite then rejects.

Generation Taxonomy and Tolerance Strategy

Each geometry family carries a distinct failure surface and therefore a distinct tolerance strategy. The table below is the reference the generator config draws from; the threshold ranges assume a projected CRS in metres unless the row states otherwise.

Geometry family	Primary failure mode	Recommended tolerance strategy	Typical threshold range	CRS units
Point field	Coincident / duplicate points breaking spatial index	Snap-to-grid then de-duplicate on grid cell	`1e-6` to `1e-3`	metres (projected)
Linestring / network	Near-zero-length segments, self-touching vertices	Densify cap + minimum segment length filter	`0.01` to `1.0`	metres (projected)
Polygon / boundary	Slivers, ring self-intersection, wrong winding order	`set_precision` grid + minimum-area drop	`1e-4` to `0.5` m² area	metres (projected)
Multi-part / collection	Mixed dimensionality, empty sub-parts	Per-part validity then collection-level assembly	inherits part rules	matches member CRS
Geographic (lat/lon)	Anti-meridian wrap, polar degeneracy	Degree-unit grid, anti-meridian split	`1e-7` to `1e-5`	degrees (geographic)

The snap grid size $g$ and the XY precision budget $\epsilon$ are related by $g \le 2\epsilon$ : choosing a grid coarser than twice the precision tolerance guarantees that two coordinates judged equal under $\epsilon$ collapse to the same grid node, eliminating the floating-point drift that otherwise produces phantom duplicate vertices.

Pipeline-First Architecture and Idempotent Execution

A production-grade synthetic vector pipeline must be stateless, idempotent, and fully parameterised. The generation process accepts a declarative configuration manifest (YAML/TOML) that defines spatial extents, feature densities, attribute distributions, CRS specifications, and deterministic random seeds. By decoupling configuration from execution logic, teams run identical generation jobs across local development, staging, and CI runners while guaranteeing byte-identical outputs for regression testing.

Pipeline-first design mandates that generation steps are orchestrated via workflow engines (GitHub Actions, GitLab CI, Airflow, or Prefect) with explicit artifact promotion. Each run produces a manifest file containing SHA-256 checksums, CRS metadata, topology validation reports, and generation timestamps. Downstream consumers verify dataset integrity before ingestion into test environments. Idempotency is enforced by hashing the configuration manifest and skipping generation if the target artifact already exists with a matching checksum, reducing redundant compute and storage.

# synthetic_vector_config.yaml
crs: "EPSG:32618"
tolerance:
  xy_precision: 1e-6
  snap_threshold: 0.001
  topology_rules:
    - "no_self_intersections"
    - "valid_polygon_rings"
    - "no_duplicate_vertices"
    - "min_area_m2: 0.5"
generation:
  seed: 42
  feature_count: 50000
  distribution: "clustered_poisson"
  attribute_schema:
    id: "int64"
    status: "categorical"
    confidence: "float32"

Point Field Generation

Point fields are the simplest family and the most common ground truth for spatial-join and nearest-neighbour tests. Determinism comes from a single seeded numpy.random.Generator; spatial realism comes from the distribution model (uniform, clustered Poisson, or grid-jittered). The construction predicate is a snap-then-deduplicate pass so that no two synthetic points share a grid cell:

pts = shapely.set_precision(points, grid_size=config["tolerance"]["snap_threshold"])

For clustered fields, draw cluster centres first, then offset child points by a bounded Gaussian so the density structure is reproducible rather than incidental.

Linestring and Network Generation

Linestrings model roads, rivers, and utility networks, and their dominant defect is the near-zero-length segment that survives naive checks but breaks length-weighted aggregations. Enforce a minimum segment length during densification rather than after, and reject any vertex pair closer than the configured threshold:

clean = line if line.length >= cfg["min_segment_m"] else None

When generating connected networks, snap shared endpoints to the same grid node before assembly so that topology rule enforcement downstream — see topology rule enforcement — sees genuine node sharing rather than coordinates that merely look equal.

Polygon and Boundary Generation

Polygons carry the richest failure surface: slivers, self-intersecting rings, and incorrect winding order. Apply shapely.set_precision() at the configured grid size, then drop geometries below the minimum-area threshold to eliminate zero-area artifacts that corrupt spatial indexes. Ring orientation must follow OGC Simple Features — outer rings counter-clockwise, inner rings clockwise:

poly = shapely.set_precision(raw, grid_size=g)
valid = poly if poly.is_valid and poly.area >= cfg["min_area_m2"] else None

When generating administrative or boundary polygons, cross-reference Edge Case Spatial Data Creation so that slivers, self-intersections, and ring-orientation failures are produced deliberately and captured by the validator rather than leaking through silently.

Deterministic Attribute Mocking and Schema Compliance

Attribute generation requires strict type enforcement, bounded distributions, and deterministic seeding. Categorical fields map to predefined enums with controlled cardinality, numeric fields to truncated normal or uniform distributions, and temporal fields to ISO-8601 sequences. Python pipelines leverage numpy.random.Generator with explicit seed propagation to guarantee reproducibility across environments, and attribute drift is one of the failure classes covered by attribute and metadata checks.

import geopandas as gpd
import numpy as np
from shapely.geometry import Point

def generate_synthetic_points(config: dict) -> gpd.GeoDataFrame:
    rng = np.random.default_rng(config["generation"]["seed"])
    n = config["generation"]["feature_count"]
    x = rng.uniform(500000, 600000, n)
    y = rng.uniform(4500000, 4600000, n)

    geoms = [Point(xi, yi) for xi, yi in zip(x, y)]
    gdf = gpd.GeoDataFrame(
        geometry=geoms,
        crs=config["crs"],
        data={
            "id": np.arange(n, dtype="int64"),
            "status": rng.choice(["active", "pending", "archived"], size=n),
            "confidence": rng.uniform(0.0, 1.0, size=n).astype("float32"),
        },
    )
    return gdf

For lightweight interchange formats, generating synthetic GeoJSON for edge case testing provides a standardized approach to validate parser resilience against malformed coordinates, missing properties, and non-compliant CRS declarations. Schema validation is enforced via JSON Schema or Pydantic models before artifact promotion.

Production-Grade Validation: pytest

A generator is only trustworthy if a test asserts its contract. The following pytest module loads tolerance parameters from the same config the generator consumes, then asserts determinism, validity, CRS, and snap-grid compliance. It targets Shapely 2.x and GeoPandas 0.14+.

import numpy as np
import pytest
import shapely
import yaml

from generator import generate_synthetic_points


@pytest.fixture
def config():
    with open("synthetic_vector_config.yaml") as fh:
        return yaml.safe_load(fh)


def test_generation_is_deterministic(config):
    a = generate_synthetic_points(config)
    b = generate_synthetic_points(config)
    # identical seed -> byte-identical coordinates
    assert np.array_equal(
        shapely.get_coordinates(a.geometry.values),
        shapely.get_coordinates(b.geometry.values),
    )


def test_all_geometries_valid(config):
    gdf = generate_synthetic_points(config)
    assert gdf.geometry.is_valid.all()
    assert not gdf.geometry.is_empty.any()


def test_crs_is_pinned(config):
    gdf = generate_synthetic_points(config)
    assert gdf.crs is not None
    assert gdf.crs.to_epsg() == 32618


def test_snap_grid_compliance(config):
    grid = config["tolerance"]["snap_threshold"]
    gdf = generate_synthetic_points(config)
    snapped = shapely.set_precision(gdf.geometry.values, grid_size=grid)
    # coordinates already on-grid: snapping is a no-op within tolerance
    assert shapely.equals_exact(
        gdf.geometry.values, snapped, tolerance=grid
    ).all()

The determinism test is the load-bearing one: it is what converts a flaky generator into a regression baseline. If you split generation across workers for very large fixtures, route that through async execution for large datasets so seed partitioning stays reproducible, and structure the suite per how to structure pytest-geo for large shapefiles.

PostGIS and Database-Side Counterparts

When fixtures must live inside a spatial database — for integration tests against a live engine — generate server-side so the data never round-trips through the network. PostGIS exposes ST_GeneratePoints for deterministic point fields within a constraining polygon, and the same validity predicates the Python path uses:

-- Deterministic synthetic point field inside a bounding polygon (PostGIS 3.x)
SELECT setseed(0.42);

CREATE TABLE synthetic_points AS
SELECT
    gen.id,
    ST_SnapToGrid((dp).geom, 0.001) AS geom,
    (ARRAY['active', 'pending', 'archived'])[1 + floor(random() * 3)::int] AS status
FROM (
    SELECT 1 AS id,
           ST_GeneratePoints(
               ST_MakeEnvelope(500000, 4500000, 600000, 4600000, 32618),
               50000,
               42                              -- explicit seed for reproducibility
           ) AS mp
) gen,
LATERAL ST_Dump(gen.mp) AS dp;

-- Server-side validity gate, mirroring the pytest assertion
SELECT count(*) AS invalid_count
FROM synthetic_points
WHERE NOT ST_IsValid(geom);

Drive this from psycopg2 (or psycopg 3) with setseed() issued in the same transaction as the generation query so the seed governs every random() call. For connection-level isolation in CI, follow best practices for mocking PostGIS connections and the broader guidance on mocking geospatial data for tests.

Pipeline Integration and Observability

Validation gates require automated topology checks, schema validation, and performance profiling. Tools like ogrinfo, geopandas, and pytest chain together to assert geometry validity, attribute completeness, and spatial index efficiency. Authoritative references such as the GDAL/OGR command-line documentation and Shapely precision handling guidelines provide baseline configurations for tolerance enforcement and geometry repair.

A typical CI validation stage executes:

Schema assertion — validate column types, null constraints, and enum compliance.
Topology verification — run ST_IsValid or GEOS validity checks; fail on self-intersections or invalid rings.
Checksum verification — compare SHA-256 of generated artifacts against the configuration hash.
Performance profiling — measure read/write throughput and spatial index build times against SLA thresholds.

Pin the geometry stack at the container level so GEOS, PROJ, and GDAL versions are identical across every runner — a libgeos minor-version bump can shift set_precision rounding at the last decimal and silently change checksums. Record the exact versions in the run manifest, and emit structured logs so failures are queryable rather than scraped from console output:

{
  "event": "synthetic_vector_gate",
  "artifact": "synthetic_points_v3.parquet",
  "feature_count": 50000,
  "invalid_geometries": 0,
  "crs": "EPSG:32618",
  "snap_grid_m": 0.001,
  "libgeos": "3.12.1",
  "proj": "9.4.0",
  "gdal": "3.8.4",
  "config_sha256": "9f2c…",
  "status": "pass"
}

CRS verification belongs in this gate too; wire it in following automating CRS validation in CI pipelines. Artifacts that pass all gates are promoted to a versioned registry (S3, Azure Blob, or Git LFS) with immutable tags, and downstream suites consume them via deterministic paths so QA environments stay isolated from production drift.

While vector generation focuses on discrete features, comprehensive geospatial QA requires cross-modal validation. Synthetic vector outputs frequently serve as ground truth for Raster Mocking Techniques, enabling pixel-to-geometry alignment tests, zonal statistics verification, and multi-resolution resampling benchmarks. Synchronize vector and raster generation jobs using shared seeds, identical spatial extents, and aligned CRS definitions to maintain deterministic alignment across modalities. Cross-modal gates should verify rasterization fidelity (vector-to-raster conversion preserves topology and attribute aggregation), zonal-statistics accuracy within tolerance, and coordinate-transformation consistency under re-projection. The same comparison rigour applies across serialization formats — see comparing GeoJSON vs Shapefile outputs in tests.

Common Failure Modes and Gotchas

Snap-grid drift after a GEOS upgrade. A libgeos minor bump can change set_precision rounding at the final decimal, breaking checksum-based idempotency. Pin the stack and record versions in the manifest.
Seed leakage across modalities. Reusing one Generator for both attributes and geometry couples them; consuming attributes first shifts the geometry stream. Spawn independent child generators with rng.spawn().
Winding-order assumptions. GeoJSON (RFC 7946) wants right-hand-rule winding; some Shapely constructions produce clockwise outer rings. Normalize orientation before serializing to interchange formats.
Anti-meridian wrap in geographic CRS. Points near $\pm180°$ longitude produce envelopes that wrap incorrectly; split features at the anti-meridian or generate in a projected CRS and re-project last.
Float precision bloat in serialization. IEEE 754 doubles emit 15+ decimals, inflating GeoJSON and breaking deduplication hashes. Apply the snap grid before write, not after read.
Minimum-area threshold below grid resolution. If the area floor is smaller than $g^2$ , snapped slivers slip past the filter. Keep $\text{min\_area} \ge g^2$ .

Conclusion

Synthetic vector data generation is a disciplined engineering practice, not an ad-hoc scripting exercise. By enforcing idempotent pipelines, strict tolerance boundaries, deterministic seeding, and version-pinned geometry stacks, teams eliminate flaky tests, accelerate CI feedback, and guarantee byte-level reproducibility. As one generation modality within Test Data Generation & Mocking Strategies, it supplies the trustworthy inputs that every spatial assertion, projection check, and index operation depends on.

Test Data Generation & Mocking Strategies — parent discipline and the controls that apply across all generation modalities.
Raster Mocking Techniques — the gridded counterpart that consumes synthetic vectors as ground truth.
Edge Case Spatial Data Creation — deliberate degenerate geometries that exercise validator resilience.
Generating Synthetic GeoJSON for Edge Case Testing — interchange-format generation for parser-resilience tests.
Validating Polygon Topology with GeoPandas — the validation pass synthetic polygons feed into.