Attribute & Metadata Checks

10 min read

Attribute and metadata checks are the contract layer within Spatial Test Pattern Design & Implementation: the assertions that validate everything about a spatial feature except the geometry itself. Where geometry validation patterns guard the coordinate tuple, this pattern guards the columns beside it — the zoning_code, the survey_date, the epsg declaration, the dataset-level lineage record — and the catalog metadata that makes a layer discoverable and auditable. These fields drive joins, styling, routing logic, regulatory reporting, and access control, so a silent type coercion or a missing controlled-vocabulary value corrupts downstream analytics just as surely as a self-intersecting polygon. This reference targets GIS QA engineers, data engineers, Python developers, and platform/DevOps teams who need a declarative, version-controlled way to assert the non-geometric contract and gate it before data reaches production.

The governing principle is schema-as-code: validation rules are declarative artifacts, version-controlled alongside the pipeline, evaluated deterministically, and decoupled from any single execution engine so the same contract runs against PostGIS, GeoParquet, GeoPackage, or a Shapefile. Like the strict tolerance thresholds that govern geometric assertions, attribute thresholds — acceptable null rates, numeric drift bounds, enum sets — are parameters, never hard-coded literals buried inside a check.

Assertion taxonomy

Every attribute or metadata rule resolves to one of a small set of assertion families. Each family has a natural tolerance strategy: some demand exact equality, others tolerate bounded numeric drift, and a few are governed by a rate threshold rather than a per-row verdict. The table maps the families to a recommended strategy, a typical threshold range, and the scope at which the check runs.

Assertion family	Tolerance strategy	Typical threshold	Scope	Severity
Data type / casting	exact, no implicit coercion	0 mismatches	per feature	hard gate
Domain / enum membership	exact set / regex	0 out-of-set values	per feature	hard gate
Numeric range / bounds	inclusive min–max	per-field config	per feature	hard gate
Numeric drift vs reference	relative error bound	`1e-6`–`1e-3`	per feature	hard gate
Nullability rate	proportion threshold	`0`–`0.05` null fraction	per column	warn → gate
Cardinality / uniqueness	exact duplicate count	0 duplicate keys	per column	hard gate
Cross-field predicate	boolean assertion	0 violations	per feature	hard gate
CRS / SRID declaration	exact match to expected	single authoritative code	per dataset	hard gate
Metadata completeness	required-field presence	100% of mandatory fields	per dataset	hard gate
Lineage checksum	byte-exact SHA-256	exact match	per dataset	hard gate
Vocabulary conformance	exact set / taxonomy	0 unmapped terms	per dataset	warn → gate

The recurring divide is per-feature versus per-dataset. Per-feature checks are embarrassingly parallel and slot directly into the chunked worker model described in async execution for large datasets; per-dataset metadata checks run once at the manifest level and are cheap. Numeric drift and null-rate thresholds are the only families that carry real tolerance math — everything else is an exact-match predicate.

Data-type and domain assertions

A type assertion enforces the declared storage type without silent coercion: an integer column must reject "12.0", a date column must reject "0000-00-00", and a float column must not absorb a string that pandas would happily upcast to object. A domain assertion narrows the type further to an allowed set — an enumerated land_use taxonomy, a regex for a parcel identifier, or a referential check against a lookup table. The predicate is exact: zero out-of-set values pass.

SCHEMA = {
    "parcel_id": {"dtype": "string", "regex": r"^[0-9]{2}-[0-9]{4}-[0-9]{3}$"},
    "land_use":  {"dtype": "string", "enum": {"residential", "commercial", "industrial", "vacant"}},
    "assessed_value": {"dtype": "int64", "min": 0, "max": 50_000_000},
}

The critical anti-pattern is letting the reader infer types. geopandas.read_file and pyarrow will silently promote a malformed integer column to object or double; the type check must run against an explicitly declared expectation, not against whatever dtype the loader guessed.

Nullability, cardinality, and sentinel handling

Missingness in spatial data is rarely a clean NULL. It hides as an empty string, whitespace padding, or a sentinel value — -9999 for no-data elevation, 0000-00-00 for an unset date, 0 for an unmeasured field. A nullability check must first normalize these to a true null, then assert against a rate threshold rather than demanding zero nulls, because most real columns tolerate some missingness. For a column with $n$ rows and $m$ effective nulls, the column passes when the null fraction stays under the configured ceiling $\rho$ :

\frac{m}{n} \le \rho, \qquad \rho \in [0,\ 0.05]\ \text{typical}

Cardinality is the complementary check: primary-key columns (parcel_id, feature_uuid) must be unique, so the duplicate count is asserted at exactly zero.

SENTINELS = {"", " ", "-9999", "0000-00-00", "N/A", "null"}

def effective_nulls(series):
    normalized = series.astype("string").str.strip()
    return normalized.isna() | normalized.isin(SENTINELS)

Cross-field and business-logic assertions

The highest-value attribute checks are relational: they assert an invariant between columns rather than within one. A temporal interval must be ordered (end_date >= start_date); a derived field must equal its formula (area_ha == ST_Area(geom) / 10000 within tolerance); a conditional rule fires only when a predicate holds (if status == 'sold' then sale_price is not null). These encode domain rules that no single-column schema can express.

def cross_field_violations(gdf):
    bad_interval = gdf["end_date"] < gdf["start_date"]
    sold_no_price = (gdf["status"] == "sold") & gdf["sale_price"].isna()
    return gdf.index[bad_interval | sold_no_price]

Where a cross-field rule depends on the geometry — a computed area, a centroid-in-region test — it becomes the natural prerequisite for topology rule enforcement: a parcel with an invalid zoning_code should fail the attribute gate before any adjacency or coverage rule consumes a spatial index slot on it.

CRS / SRID declaration consistency

A spatial dataset carries its coordinate reference system in two places that must agree: the declared metadata (the GeoParquet CRS field, the .prj sidecar, the PostGIS srid column) and the actual coordinate magnitudes. A declaration check asserts the declared code matches the single authoritative EPSG code the pipeline expects, and fails fast rather than reprojecting silently — because a silent reprojection turns a metadata bug into corrupted coordinates. This is the attribute-side mirror of the CRS work covered in automating CRS validation in CI pipelines.

EXPECTED_CRS = "EPSG:4326"

def assert_crs(gdf):
    declared = None if gdf.crs is None else gdf.crs.to_string()
    assert declared == EXPECTED_CRS, f"CRS declaration {declared!r} != {EXPECTED_CRS}"

A coordinate-magnitude sanity check complements the declaration: if the CRS claims EPSG:4326 but X values exceed ±180 or Y exceeds ±90, the declaration is lying. Full precision-loss detection across conversions is its own topic — see testing coordinate precision loss during conversion.

Metadata completeness and standards conformance

Dataset-level metadata is validated separately from per-feature attributes. A completeness check asserts that every mandatory field required by the target standard — ISO 19115, FGDC, INSPIRE, or an internal catalog schema — is present and non-empty: title, abstract, spatial_reference, temporal_extent, lineage, contact. A conformance check goes further and asserts each field’s value belongs to a controlled vocabulary (topic categories, keyword thesauri, role codes) so the dataset is discoverable and federates cleanly across catalogs.

ISO_19115_MANDATORY = {
    "title", "abstract", "spatial_reference",
    "temporal_extent", "lineage", "responsible_party",
}

def missing_metadata(record: dict) -> set:
    present = {k for k, v in record.items() if v not in (None, "", [])}
    return ISO_19115_MANDATORY - present

Lineage, provenance, and checksum verification

Lineage validation detects silent corruption and unauthorized modification between pipeline stages. The pattern computes a SHA-256 digest of each source artifact at ingestion, embeds it in the metadata record, and re-verifies it before any downstream stage consumes the file. A mismatch means the bytes changed in transit — a partial upload, a re-serialization, or tampering — regardless of what the file extension claims.

import hashlib

def sha256(path, chunk=1 << 20):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for block in iter(lambda: f.read(chunk), b""):
            h.update(block)
    return h.hexdigest()

def verify_lineage(path, manifest):
    actual = sha256(path)
    assert actual == manifest["sha256"], f"lineage mismatch for {path}"

Production-grade Python implementation

The harness below compiles the schema into a pytest suite that runs each assertion family against a loaded GeoDataFrame, with all thresholds injected from a frozen config object rather than hard-coded. It targets GeoPandas 0.14+, Shapely 2.x, and pytest 7+, and returns a structured failure list (feature ID + rule code) so the result is machine-readable for CI artifacts.

# test_attribute_contract.py
from dataclasses import dataclass, field

import geopandas as gpd
import pytest


@dataclass(frozen=True)
class AttributeContract:
    expected_crs: str = "EPSG:4326"
    null_rate_ceiling: float = 0.05
    enums: dict = field(default_factory=lambda: {
        "land_use": {"residential", "commercial", "industrial", "vacant"},
    })
    numeric_bounds: dict = field(default_factory=lambda: {
        "assessed_value": (0, 50_000_000),
    })


SENTINELS = {"", " ", "-9999", "0000-00-00", "N/A", "null"}


def _effective_nulls(series):
    s = series.astype("string").str.strip()
    return s.isna() | s.isin(SENTINELS)


def validate(gdf: gpd.GeoDataFrame, c: AttributeContract) -> list[tuple]:
    failures: list[tuple] = []

    # 1. CRS declaration — fail fast, never reproject silently.
    declared = None if gdf.crs is None else gdf.crs.to_string()
    if declared != c.expected_crs:
        failures.append(("__dataset__", f"crs_mismatch:{declared}"))

    # 2. Enum / domain membership.
    for col, allowed in c.enums.items():
        bad = gdf.index[~gdf[col].isin(allowed) & gdf[col].notna()]
        failures += [(fid, f"enum_violation:{col}") for fid in bad]

    # 3. Numeric bounds.
    for col, (lo, hi) in c.numeric_bounds.items():
        bad = gdf.index[(gdf[col] < lo) | (gdf[col] > hi)]
        failures += [(fid, f"out_of_bounds:{col}") for fid in bad]

    # 4. Null-rate ceiling (per column, rate-based not per row).
    for col in gdf.columns.drop("geometry"):
        rate = _effective_nulls(gdf[col]).mean()
        if rate > c.null_rate_ceiling:
            failures.append(("__dataset__", f"null_rate:{col}={rate:.3f}"))

    # 5. Cross-field business logic.
    bad_interval = gdf.index[gdf["end_date"] < gdf["start_date"]]
    failures += [(fid, "interval_disorder") for fid in bad_interval]

    return failures


@pytest.mark.parametrize("dataset", ["fixtures/parcels.gpkg"])
def test_attribute_contract(dataset):
    contract = AttributeContract()  # frozen thresholds, version-controlled
    gdf = gpd.read_file(dataset)
    failures = validate(gdf, contract)
    assert not failures, f"{len(failures)} attribute violations: {failures[:10]}"

For teams already standardized on expectation suites, the same contract expresses cleanly as Great Expectations checks (0.18+), which produce a Data Docs artifact alongside the pass/fail verdict:

import great_expectations as gx

ctx = gx.get_context()
batch = ctx.sources.pandas_default.read_dataframe(gdf.drop(columns="geometry"))

batch.expect_column_values_to_be_in_set(
    "land_use", ["residential", "commercial", "industrial", "vacant"])
batch.expect_column_values_to_be_between("assessed_value", 0, 50_000_000)
batch.expect_column_values_to_not_be_null("parcel_id", mostly=1.0)
batch.expect_column_values_to_be_unique("parcel_id")

Because both forms read thresholds from a single declared source, fixtures that exercise the edge cases — empty enums, sentinel-laden nulls, disordered intervals — should be generated with the test data generation and mocking strategies patterns rather than hand-copied from production.

PostGIS / database-side counterparts

When the layer lives in PostGIS, the cheapest place to assert most of this contract is the database, pushing set membership, null rates, and SRID checks into SQL so only the verdict crosses the wire. A CHECK constraint or a validation query enforces the same rules the Python harness does, server-side.

-- Per-feature domain + cross-field violations in one pass.
SELECT parcel_id,
       (land_use NOT IN ('residential','commercial','industrial','vacant')) AS bad_enum,
       (assessed_value < 0 OR assessed_value > 50000000)                    AS bad_bounds,
       (end_date < start_date)                                              AS bad_interval
FROM   parcels
WHERE  land_use NOT IN ('residential','commercial','industrial','vacant')
   OR  assessed_value < 0 OR assessed_value > 50000000
   OR  end_date < start_date;

-- SRID declaration must match the authoritative code for every row.
SELECT DISTINCT ST_SRID(geom) AS srid
FROM   parcels
WHERE  ST_SRID(geom) <> 4326;

-- Column-level null rate, evaluated as a single aggregate.
SELECT count(*) FILTER (WHERE land_use IS NULL)::float / count(*) AS null_rate
FROM   parcels;

A psycopg2 wrapper runs these as part of the same orchestration as the Python checks, so a dataset already resident in the database is validated without a full round-trip into a GeoDataFrame. Keeping the SQL and the Python contract in sync is itself a parity concern, closely related to cross-format parity testing: the same enum set and bounds must survive whether the check runs in pandas or in Postgres.

Pipeline integration

Attribute and metadata checks are the earliest hard gate in the pipeline, slotting beneath the core geospatial QA architecture as the prerequisite that lets every later geometric and topological stage assume a clean contract. Three integration controls keep them deterministic and observable.

Pre-merge and ingestion gates. Run the contract as a pre-commit hook on fixture changes and as a required check in the orchestration DAG (Airflow, Prefect, Dagster, or GitHub Actions). Reject non-compliant payloads at the ingestion edge: route valid features to staging and quarantine failures with their structured (feature_id, rule_code) payload rather than dropping them silently.
Schema-drift monitoring. Version the contract object and diff it across runs. Alert on unexpected column additions, type mutations, or a deprecated field reappearing — drift in the schema is as significant as a row-level violation, and a versioned expectation suite makes the change reviewable in a pull request.
Observability and pinned runtimes. Emit pass/fail ratios, per-column null distributions, and check latency to OpenTelemetry or Prometheus for SLO tracking, and pin the native stack (GDAL, PROJ, libgeos) in the runner image so type inference and CRS resolution behave identically between local and CI. Idempotent, side-effect-free checks make safe retries and parallel reprocessing trivial.

Common failure modes and gotchas

Loader-inferred types mask violations. read_file and pyarrow upcast a malformed integer column to object or double, so a naive dtype check passes against the guessed type. Assert against an explicitly declared schema, never against the loaded dtype.
Sentinels counted as valid data. -9999, 0, and 0000-00-00 are not null to pandas, so a null-rate check reads them as present and a bounds check may pass them. Normalize the sentinel set before any nullability or numeric assertion.
Silent CRS reprojection. A check that auto-reprojects a mismatched CRS hides a metadata bug and corrupts coordinates downstream. Fail fast on a declaration mismatch; never reproject inside a validation step.
Whitespace and Unicode in enums. A trailing space or a non-breaking space makes "vacant " fail set membership against "vacant". Strip and NFC-normalize string columns before enum and regex checks.
Encoding-driven false failures. A Shapefile .dbf read with the wrong code page (CP1252 vs UTF-8) mangles accented domain values and fails conformance for a non-data reason. Pin the encoding in the reader and test against an accented fixture.
Float equality on derived fields. Asserting area_ha == ST_Area(geom)/10000 with == fails on floating-point noise. Use a relative error bound, the same way strict tolerance thresholds govern geometric comparisons.
Metadata checked but lineage trusted. Validating that a sha256 field exists is not the same as recomputing and comparing it. A present-but-stale checksum passes a completeness check while the bytes have already drifted — always re-verify, never just assert presence.
Per-row gate on a tolerant column. Hard-failing a column that legitimately carries 3% nulls floods the quarantine and erodes trust in the gate. Reserve hard gates for keys and domains; use a rate ceiling for naturally sparse fields.

Conclusion

Attribute and metadata checks are what turn a geometrically valid dataset into a trustworthy one: a schema-as-code contract that asserts types, domains, cross-field invariants, CRS declarations, and standards-conformant lineage with the same determinism and version control that govern spatial tolerances. Treat them as the first hard gate of Spatial Test Pattern Design & Implementation, and every downstream geometric and topological assertion inherits a clean, audit-ready contract instead of inheriting silent data debt.