Comparing GeoJSON vs Shapefile Outputs in Tests

6 min read

When a pipeline is required to emit the same feature collection as both GeoJSON and an ESRI Shapefile, the two artifacts almost never compare equal byte-for-byte — and a naive assert gdf_a.equals(gdf_b) fails on serialization noise long before it ever catches real corruption. The specific problem this page solves is asserting logical parity between a GeoJSON driver export and an ESRI Shapefile driver export using geopandas.read_file, Shapely 2.x geometry predicates, and pytest, so the test flags genuine data loss while ignoring format-mandated rewrites. It sits beneath Cross-Format Parity Testing and reuses the same tolerance discipline you apply to every other geometric assertion. The two formats encode the same features under fundamentally different constraints: Shapefile enforces rigid .dbf schema limits, an external .prj coordinate reference system, and single-geometry-type layers, while GeoJSON follows RFC 7946, defaults to WGS84, and preserves nested attributes and full floating-point precision. Without a normalization layer between read and assertion, parity tests fail on formatting, not fidelity.

Why format divergence happens at the engineering level

The divergence is deterministic, not random — five mechanisms in the driver layer rewrite data on write, and each one has a distinct failure signature:

Coordinate precision and rounding. GeoJSON retains full IEEE 754 double precision; the GeoJSON driver’s default COORDINATE_PRECISION further truncates to 7 decimal places unless overridden, while Shapefile stores doubles but is fed by drivers that round during projection. Direct coordinate equality fails even when geometries are topologically identical, which is why this check belongs next to Geometry Validation Patterns.
Schema and attribute coercion. The Shapefile .dbf caps field names at 10 ASCII characters, caps string fields at 254 bytes, and has no native NULL (drivers substitute empty strings or zeros). GeoJSON preserves original keys, nested objects, and explicit null. This is the domain of Attribute & Metadata Checks.
CRS declaration mismatch. RFC 7946 mandates EPSG:4326 longitude/latitude ordering; Shapefile relies on a sidecar .prj. If the pipeline writes GeoJSON in a projected CRS or omits the .prj, distance and join results diverge instantly.
Geometry-type flattening. A Shapefile layer cannot mix geometry types. Exporters flatten GeometryCollection and promote singles to Multi* (or vice versa), changing the WKT class even when vertices are intact — the concern that Topology Rule Enforcement guards against downstream.
Encoding drift. GeoJSON is UTF-8 by contract; Shapefile historically uses system encodings (ISO-8859-1, CP1252), corrupting accented attribute strings on round-trip. The GDAL Shapefile driver documentation documents this fallback explicitly.

Tolerance model and parameter reference

Parity is asserted under an explicit tolerance, never bit-exact equality. For coordinate-level comparison the predicate reduces to a per-vertex distance bound $\tau$ in CRS units:

\lVert p_{\text{geojson}} - p_{\text{shp}} \rVert_2 \le \tau

For polygon parity, vertex distance is the wrong metric; use a symmetric-difference area ratio, where $A_\triangle$ is the area of geojson_geom.symmetric_difference(shp_geom):

\frac{A_\triangle}{\max(A_{\text{geojson}},\ A_{\text{shp}})} \le \tau_{\text{area}}

The grid size used to snap both datasets to a common precision should be at least one decimal place coarser than the GeoJSON export precision so rounding never lands on a tie:

\tau = 10^{-d}, \quad d \le \texttt{COORDINATE\_PRECISION} - 1

Parameter / flag	Layer	Recommended value	Why it matters
`COORDINATE_PRECISION`	GeoJSON write	`15` for parity fixtures	Stops the driver truncating to 7 dp before comparison
`ENCODING`	Shapefile write/read	`UTF-8`	Prevents `CP1252` mojibake on accented attributes
`set_precision(grid_size=τ)`	Shapely 2.x	`1e-8` (≈1 mm at equator)	Common snap grid for both geometries
`equals_exact(tol)`	Shapely 2.x	`τ`	Vertex-order-sensitive structural compare
`to_crs(epsg=4326)`	GeoPandas	always	Removes `.prj` vs RFC 7946 ambiguity
field name truncation	`.dbf`	`[:10].upper()`	Mirrors driver behavior on GeoJSON keys

Step-by-step implementation

The harness below normalizes both reads through one deterministic function, then asserts under tolerance. It targets GeoPandas 0.14+, Shapely 2.x, and pytest 7+, and offloads the blocking I/O so it composes with Async Execution for Large Datasets in CI.

Step 1 — Harmonize CRS. Project both frames to EPSG:4326 so a missing or projected .prj cannot cause a phantom mismatch.

Step 2 — Canonicalize geometry. Repair with make_valid, then set_precision both sides onto the same grid so float drift collapses to equal vertices.

Step 3 — Coerce the schema. Truncate GeoJSON keys to the 10-character .dbf ceiling, upper-case them, and fill NULL with the empty-string placeholder Shapefile would have written.

Step 4 — Sort deterministically. Geometries are not orderable, so sort on WKT plus the remaining attributes and reset the index before comparing.

import asyncio
import geopandas as gpd
import shapely
import pytest
from shapely import make_valid

COORD_PRECISION = 8
TOLERANCE_DEG = 10 ** -COORD_PRECISION  # ~1 mm at the equator in EPSG:4326

def normalize_gdf(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Deterministic normalization for GeoJSON vs Shapefile parity."""
    gdf = gdf.copy()
    geom_col = gdf.geometry.name

    # Step 1: CRS harmonization
    if gdf.crs is not None and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    # Step 2: geometry canonicalization (repair + snap to common grid)
    gdf[geom_col] = gdf[geom_col].apply(
        lambda g: shapely.set_precision(make_valid(g), grid_size=TOLERANCE_DEG)
    )

    # Step 3: attribute coercion (10-char .dbf names; NULL -> "")
    gdf.columns = [c if c == geom_col else c[:10].upper() for c in gdf.columns]
    gdf = gdf.fillna("")

    # Step 4: deterministic ordering by geometry WKT, then attributes
    attr_cols = [c for c in gdf.columns if c != geom_col]
    gdf = gdf.assign(_wkt=gdf.geometry.to_wkt())
    return (
        gdf.sort_values(by=["_wkt", *attr_cols])
        .drop(columns="_wkt")
        .reset_index(drop=True)
    )

async def read_parity_pair(geojson_path: str, shp_path: str) -> bool:
    """Read both exports off the event loop, then assert logical parity."""
    loop = asyncio.get_running_loop()
    gdf_geojson = await loop.run_in_executor(None, gpd.read_file, geojson_path)
    gdf_shp = await loop.run_in_executor(None, gpd.read_file, shp_path)

    norm_geojson = normalize_gdf(gdf_geojson)
    norm_shp = normalize_gdf(gdf_shp)

    assert norm_geojson.shape == norm_shp.shape, "feature/column count mismatch"
    assert norm_geojson.equals(norm_shp), "attribute or geometry divergence"
    return True

Verification pattern

Drive the harness from a pytest test that writes both formats from one in-memory frame, then asserts they survive the round trip. Force COORDINATE_PRECISION=15 on the GeoJSON write so the driver does not pre-truncate the fixture below your tolerance grid:

@pytest.mark.asyncio
async def test_geojson_shapefile_parity(tmp_path):
    src = gpd.read_file("fixtures/parcels.geojson")  # known-good input

    geojson_out = tmp_path / "out.geojson"
    shp_out = tmp_path / "out.shp"
    src.to_file(geojson_out, driver="GeoJSON",
                COORDINATE_PRECISION=15)
    src.to_file(shp_out, driver="ESRI Shapefile",
                ENCODING="UTF-8")

    assert await read_parity_pair(str(geojson_out), str(shp_out))

A single pytest -q tests/test_parity.py::test_geojson_shapefile_parity run is the fast confirmation that normalization is doing its job: flip COORDINATE_PRECISION to 3 and the test should fail, proving the gate actually detects precision loss rather than passing vacuously.

Failure modes and edge cases

Anti-meridian splitting. Geometries crossing ±180° longitude are split by some GeoJSON writers but stored as a single ring in Shapefile (or vice versa), so feature counts diverge even though no data is lost. Test parcels that straddle the date line separately and compare on merged geometry, not row count.
Field-name collision after truncation. Two GeoJSON keys like population_2020 and population_2021 both truncate to POPULATION in the .dbf, silently overwriting one column. Assert that the set of truncated names is unique before comparing values, or the parity check passes on corrupt data.
Empty and null geometries. Shapefile cannot store a true NULL geometry the way GeoJSON can ("geometry": null); the driver may write an empty geometry or drop the feature. Guard with is_empty / is None before the area ratio, because symmetric_difference on an empty geometry returns the other side and passes the bound by accident.
Single vs multi promotion. A layer with one Polygon and one MultiPolygon forces the Shapefile driver to promote everything to MultiPolygon. Normalize both sides with shapely.geometry.shape-level multi-coercion before equals_exact, or identical geometries compare unequal on type alone.
Mixed Z/M coordinates. Shapefile has dedicated PointZ/PolygonZ subtypes; a GeoJSON [lon, lat] pair drops the Z that a [lon, lat, elev] Shapefile retained. equals_exact ignores Z, so assert has_z parity explicitly when elevation is load-bearing.

Conclusion

Treat format divergence as an expected engineering constraint, not a bug: normalize CRS, snap precision, mirror the .dbf schema rules, then assert under an explicit tolerance so the test catches real corruption and ignores serialization noise. For the full parity assertion family and its PostGIS counterparts, return to Cross-Format Parity Testing.