Generating Synthetic GeoJSON for Edge Case Testing

In modern geospatial data pipelines, validation failures rarely originate from well-formed, production-grade datasets. They surface when parsers, spatial indexes, or transformation engines encounter malformed coordinates, degenerate geometries, or RFC 7946 boundary violations that were never exercised during development. Generating synthetic GeoJSON for edge case testing is a critical discipline within Test Data Generation & Mocking Strategies that enables QA engineers, data engineers, and platform teams to deterministically reproduce spatial anomalies, validate parser resilience, and enforce strict schema compliance before data reaches staging or production environments.

Root-Cause Analysis of GeoJSON Validation Failures

When spatial ETL pipelines or vector ingestion services fail in CI/CD, the underlying causes typically fall into three categories:

Topological Degeneracy

Self-intersecting polygons, duplicate consecutive vertices, unclosed linear rings, or zero-area features frequently bypass naive coordinate checks but fail OGC Simple Features validation. Spatial databases like PostGIS or rendering engines like Mapbox GL JS will either throw hard exceptions, silently drop features, or produce corrupted tiles. Without explicit topological guards, downstream consumers inherit silent data corruption.

Coordinate System & Precision Drift

Mixed WGS84/CRS assumptions, floating-point precision mismatches, and anti-meridian crossing artifacts break naive bounding-box calculations. IEEE 754 double-precision floats can introduce 15+ decimal places, causing serialization bloat, truncation during database round-trips, or hash mismatches in deduplication layers. Anti-meridian splits (e.g., coordinates crossing ±180° longitude) require explicit handling; otherwise, bounding boxes wrap incorrectly and spatial joins fail.

RFC 7946 Non-Compliance

Missing type fields, improperly nested features arrays, invalid bbox ordering ([minX, minY, maxX, maxY] vs. [minY, minX, maxY, maxX]), or mixed geometry types without explicit FeatureCollection wrapping violate the GeoJSON specification (RFC 7946). Production data rarely contains these anomalies because upstream systems sanitize or reject them, but downstream consumers must still handle them gracefully.

Step-by-Step Resolution Architecture

To systematically address these failure modes, adopt a deterministic, schema-first generation pipeline aligned with Synthetic Vector Data Generation best practices.

1. Define Strict Validation Boundaries

Map RFC 7946 requirements alongside OGC Simple Features constraints before writing generation logic. Use Pydantic models or JSON Schema to enforce structural compliance, then layer geometry validation on top. This two-tier approach catches malformed JSON early and prevents invalid geometries from reaching spatial indexes.

from pydantic import BaseModel, field_validator
from typing import Literal, List

class GeoJSONFeature(BaseModel):
    type: Literal["Feature"]
    geometry: dict
    properties: dict | None = None

    @field_validator("geometry")
    @classmethod
    def validate_geometry_structure(cls, v):
        if "type" not in v or "coordinates" not in v:
            raise ValueError("Geometry must contain 'type' and 'coordinates'")
        return v

2. Implement Deterministic Seeding

Use fixed random seeds or property-based testing frameworks to guarantee reproducible edge cases across CI runs. Python’s random module or hypothesis can generate coordinate arrays, but seeding ensures identical outputs across environments.

import random
import json

SEED = 42
random.seed(SEED)

def generate_polygon_vertices(count: int = 5) -> list[list[float]]:
    return [
        [round(random.uniform(-180, 180), 6), round(random.uniform(-90, 90), 6)]
        for _ in range(count)
    ]

3. Apply Controlled Perturbations

Inject known failure vectors into otherwise valid base geometries. Common perturbations include:

  • Self-intersection: Swap non-adjacent vertices to create bowtie polygons.
  • Precision truncation: Clamp coordinates to 2–4 decimal places to simulate legacy system exports.
  • Empty geometries: Generate {"type": "Polygon", "coordinates": []} or null geometries.
  • Anti-meridian splits: Force longitude values across ±180° to test bounding-box wrapping logic.

4. Validate Pre-Serialization

Run geometries through shapely.is_valid and coordinate precision clamping before converting to GeoJSON. The Shapely library provides robust topology checks and repair utilities.

import shapely
from shapely.geometry import shape, mapping
from shapely.validation import make_valid

def validate_and_clamp(geo_dict: dict, precision: int = 6) -> dict:
    geom = shape(geo_dict)
    if not geom.is_valid:
        geom = make_valid(geom)
    # Snap coordinates to the requested decimal precision (e.g. 6 -> 1e-6 grid)
    clamped = shapely.set_precision(geom, grid_size=10 ** -precision)
    return mapping(clamped)

5. Serialize & Inject into CI/CD

Output standardized GeoJSON fixtures, attach metadata tags for test categorization, and route to CI test runners. Store fixtures in a version-controlled fixtures/ directory alongside pytest markers for parameterized execution.

import pytest
import json
from pathlib import Path

@pytest.fixture(params=["valid_polygon", "self_intersecting", "precision_drift", "anti_meridian_split"])
def geojson_edge_case(request):
    fixture_path = Path("tests/fixtures/geojson") / f"{request.param}.geojson"
    with open(fixture_path) as f:
        return json.load(f)

Pipeline Integration & QA Workflows

Deterministic synthetic GeoJSON generation integrates seamlessly into modern data engineering workflows:

  1. Pre-commit Hooks: Run lightweight schema validation on generated fixtures before committing. Tools like pre-commit with jsonschema or geojsonlint prevent malformed payloads from entering the repository.
  2. CI/CD Test Matrices: Parameterize spatial ingestion tests across multiple GeoJSON variants. Use GitHub Actions or GitLab CI to run parallel validation jobs against PostGIS, DuckDB spatial, and cloud-native vector stores.
  3. Mock Service Injection: Serve synthetic fixtures via lightweight HTTP servers (e.g., pytest-httpserver or FastAPI mocks) to simulate third-party spatial APIs without network dependencies.
  4. Regression Tracking: Tag each fixture with a failure_mode and expected_behavior metadata field. When a parser update introduces a regression, the tagged fixture immediately surfaces the exact topological or precision violation.

Conclusion

Generating synthetic GeoJSON for edge case testing transforms spatial QA from reactive debugging to proactive validation. By enforcing deterministic seeding, applying controlled topological perturbations, and validating against RFC 7946 and OGC standards before serialization, engineering teams eliminate flaky production snapshots and guarantee reproducible spatial test suites. Integrating these fixtures into CI/CD pipelines ensures that parsers, spatial databases, and tiling engines gracefully handle degenerate geometries, precision drift, and specification violations long before they impact production environments.