Comparing GeoJSON vs Shapefile Outputs in Tests

When geospatial pipelines are required to emit dual-format artifacts, QA engineers routinely encounter silent divergence between GeoJSON and Shapefile outputs. The core challenge in Comparing GeoJSON vs Shapefile outputs in tests is not merely asserting equality, but normalizing two fundamentally different data models that encode spatial and tabular information under divergent constraints. Shapefiles enforce rigid schema limits, implicit coordinate reference system (CRS) declarations, and single-geometry-type layers. GeoJSON follows RFC 7946, defaults to WGS84, supports nested JSON attributes, and preserves floating-point precision. Without a structured normalization layer, parity assertions fail on trivial formatting differences rather than actual data corruption. This article details a production-grade debugging workflow, minimal reproducible test harness, and CI prevention strategies aligned with modern Spatial Test Pattern Design & Implementation practices.

Root-Cause Analysis: Format-Specific Divergence

Failures in cross-format parity tests typically trace to five deterministic root causes:

  1. Coordinate Precision & Rounding: GeoJSON retains full IEEE 754 double precision. Shapefile .shp coordinates are often truncated during export or rounded by GDAL/OGR drivers to 15 decimal places or fewer. Direct coordinate equality checks will fail even when geometries are topologically identical.
  2. Schema & Attribute Coercion: Shapefile .dbf restricts field names to 10 ASCII characters, caps string lengths at 254 bytes, and lacks native NULL support (often substituting empty strings or zeros). GeoJSON preserves original keys, nested objects, and explicit null values without truncation.
  3. CRS Declaration Mismatch: GeoJSON mandates EPSG:4326 (WGS84 longitude/latitude). Shapefiles rely on an external .prj file. If the pipeline exports GeoJSON in a projected CRS or omits .prj, spatial joins and distance calculations diverge immediately.
  4. Geometry Collection & Type Flattening: Shapefiles cannot store mixed geometry types in a single layer. Exporters often flatten GeometryCollection or Multi* types into single-part geometries, dropping vertices or splitting features across multiple records.
  5. Metadata & Encoding Drift: GeoJSON uses UTF-8 natively. Shapefiles historically use system-dependent encodings (e.g., ISO-8859-1, CP1252), causing character corruption in attribute strings during round-trip validation. The GDAL Shapefile Driver Documentation explicitly notes these encoding fallbacks.

Normalization Architecture for Cross-Format Validation

Robust parity testing requires a deterministic normalization pipeline before assertion. This aligns with Cross-Format Parity Testing by decoupling format-specific serialization from logical data equivalence. The normalization layer must execute three sequential operations:

  • CRS Harmonization: Project both datasets to a common CRS (typically EPSG:4326 for validation). Use pyproj to strip .prj ambiguity and force GeoJSON longitude/latitude ordering.
  • Geometry Canonicalization: Apply shapely operations to snap coordinates to a fixed precision, remove duplicate vertices, and standardize vertex ordering (e.g., clockwise for polygons). This step directly implements Geometry Validation Patterns and Topology Rule Enforcement to ensure structural equivalence regardless of export artifacts.
  • Attribute Schema Mapping: Truncate or hash GeoJSON keys to 10 characters, coerce null to Shapefile-compatible placeholders, and enforce UTF-8 decoding. Attribute & Metadata Checks must validate type casting (float vs string) and length constraints before comparison.

Production-Grade Test Harness Implementation

The following Python harness demonstrates how to operationalize normalization and assertion using geopandas, shapely, and pytest. It integrates Async Execution for Large Datasets to prevent blocking CI runners during heavy spatial comparisons.

import asyncio
import geopandas as gpd
import shapely
import pytest
from shapely.geometry import box
from shapely.validation import make_valid

# Configuration constants
COORD_PRECISION = 8
TOLERANCE_M = 0.01  # ~1.11 meters at equator in EPSG:4326

def normalize_gdf(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Apply deterministic normalization for cross-format parity."""
    geom_col = gdf.geometry.name

    # 1. CRS Harmonization
    if gdf.crs is not None and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    # 2. Geometry Canonicalization
    gdf[geom_col] = gdf[geom_col].apply(
        lambda geom: shapely.set_precision(make_valid(geom), grid_size=10**-COORD_PRECISION)
    )

    # 3. Attribute Coercion (Shapefile truncates field names to 10 chars;
    #    leave the active geometry column name untouched)
    gdf.columns = [c if c == geom_col else c[:10].upper() for c in gdf.columns]
    gdf = gdf.fillna("")

    # 4. Deterministic ordering by geometry WKT, then remaining attributes
    #    (geometries are not orderable directly, so sort on their WKT)
    attr_cols = [c for c in gdf.columns if c != geom_col]
    gdf = gdf.assign(_wkt=gdf.geometry.to_wkt())
    return (
        gdf.sort_values(by=["_wkt", *attr_cols])
        .drop(columns="_wkt")
        .reset_index(drop=True)
    )

async def compare_datasets_async(geojson_path: str, shp_path: str) -> bool:
    """Async wrapper for large dataset parity validation."""
    loop = asyncio.get_running_loop()
    gdf_geojson = await loop.run_in_executor(None, gpd.read_file, geojson_path)
    gdf_shp = await loop.run_in_executor(None, gpd.read_file, shp_path)
    
    norm_geojson = normalize_gdf(gdf_geojson)
    norm_shp = normalize_gdf(gdf_shp)
    
    # Topology & Attribute Parity Assertion
    assert norm_geojson.shape == norm_shp.shape, "Feature count mismatch"
    assert norm_geojson.equals(norm_shp), "Attribute or geometry divergence detected"
    return True

@pytest.mark.asyncio
async def test_dual_format_parity(tmp_path):
    # Simulate pipeline output paths
    geojson_out = tmp_path / "output.geojson"
    shp_out = tmp_path / "output.shp"
    
    # In production, these are generated by the pipeline under test
    # gdf.to_file(geojson_out, driver="GeoJSON")
    # gdf.to_file(shp_out, driver="ESRI Shapefile")
    
    parity = await compare_datasets_async(str(geojson_out), str(shp_out))
    assert parity

CI/CD Integration & Pipeline Prevention

Embedding cross-format validation into CI requires strategic thresholding and artifact caching to avoid flaky builds:

  • Tolerance-Based Assertions: Replace strict equality with spatial tolerance (shapely.equals_exact or Hausdorff distance thresholds). This accounts for unavoidable floating-point drift during coordinate transformations.
  • Deterministic Export Flags: Force pipeline exporters to use explicit flags: COORDINATE_PRECISION=15, ENCODING=UTF-8, and WRITE_PRJ=YES. Document these in pipeline configuration as immutable QA contracts.
  • Parallelized Validation: For datasets exceeding 100k features, partition by spatial index (e.g., sindex.query_bulk) and run validation chunks concurrently. This aligns with Async Execution for Large Datasets to keep CI feedback loops under 3 minutes.
  • Fail-Fast on Schema Drift: Implement pre-flight checks that validate .dbf header constraints before geometry comparison. If field names exceed 10 characters or types mismatch, fail immediately with actionable logs rather than waiting for geometry assertion timeouts.

By treating format divergence as an expected engineering constraint rather than a bug, teams can build resilient validation layers that catch actual data corruption while ignoring serialization noise. This approach ensures that Comparing GeoJSON vs Shapefile outputs in tests becomes a deterministic, pipeline-ready checkpoint rather than a source of flaky CI failures.