Glossary

Esker terms with grounded definitions.

Definitions favor "what it actually is" over conceptual.

binding

A bare name → <owner>/<name> mapping. Lives in pyproject.toml [tool.esker.datasets] (project-scoped) or ~/.esker/config.toml [datasets] (global). Project bindings can be version-pinned via esker.lock. See Bindings.

cadence

A free-form string ClassVar on a pipeline indicating how often it should run ("daily", "weekly"). Not behavior — Esker has no scheduler. Surfaced in esker list only.

CompatReport

Output of the schema diff. A classification: breaking, additive, required_bump. See Compatibility.

consumer

Anyone who reads a published dataset (via esker.get, esker pull, esker view, etc.). The opposite of a publisher.

DatasetManifest

Sidecar metadata for a published dataset. One per run. The hub's source of truth for "what does this dataset look like at version V". POSTed to /datasets/<owner>/<name> on push. See Manifests.

DatasetRef

<owner>/<name>[@<version>]. Frozen value object. Versioned for write paths; versionless = "latest" for reads. Bare-name shortcut lives in bindings.resolve, not in the parser. Internal — users pass strings.

DOMAIN_ID

The bare name of a dataset. Lowercase a–z0–9 dot-separated, ≥ 2 segments. First segment is the jurisdiction. Examples: us.sec.companies, ca.corporations.registry. Pattern: ^[a-z0-9]+(\.[a-z0-9]+)+$.

EskerEntity

The "what is this thing" record (entity-resolver concept). Carries esker_id, entity_type, jurisdiction, name, discovered_at. Currently not consumed by any path; deferred to a future release.

esker_id

The cross-dataset join key on a single conceptual entity. Format: esker:<jurisdiction>:<entity_type>:<native_id>. Two records about the same corporation share one esker_id.

Built at run time by EskerPipeline.run().

esker_lineage_id

Per-record reference to a lineage batch (a unique (source_url, fetched_at) pair). UUID, stored as a column in parquet, joined against <domain>.lineage.json's batches[*].lineage_id.

EskerModel

Pydantic base for every domain record. frozen=True, extra="forbid". Subclass declares DOMAIN_ID and schema_version as ClassVars. Variants: draft (no esker_*) and published (with the three injected fields, via .published()). See Records.

EskerPipeline

ABC binding a SOURCE to a SCHEMA via a transform. Implementing class declares 6 ClassVars. .run(output_dir) produces a RunResult and writes parquet + lineage. See Three-class form.

EskerSource

ABC for raw data origins. Yields Fetched envelopes via fetch_all() (and optionally per-id fetch(id)). Provides per-id caching via fetch_cached. Subclasses declare SOURCE_ID. See Sources.

entity_type

Lowercase-letters-only token identifying the kind of entity a record describes (corp, rocket, curve). Embedded in esker_id.

Fetched

A (raw, source_url, fetched_at) envelope yielded by sources. The thing pipelines transform.

fixture

A (raw_<label>.json, expected_<label>.json) pair under a fixture dir. The harness diffs pipeline.transform(raw).model_dump(mode="json") against expected. See Fixtures.

hub

The commercial half of Esker — esker.so. The SDK talks to it over HTTP via esker.client.hub. Default localhost:3001.

jurisdiction

First dot-segment of DOMAIN_ID. Coded as 2+ lowercase letters. Used to construct esker_id. Conventional values: us, ca, global, un, eu.

key (or _KEY)

The field on the model holding the native identifier (e.g. "cik" for SEC, "rocket_id" for SpaceX). Used by pipeline.run() to extract native_id and build esker_id.

LineageBatch / LineageBundle

LineageBatch: a unique (source_url, fetched_at, lineage_id) triple. LineageBundle: the manifest_version + run_id + list of batches written as <domain>.lineage.json. See Lineage.

owner / OwnerHandle

A publishing namespace. GitHub-style handle (1–39 chars, lowercase a–z0–9 with non-doubled hyphens). Single global namespace shared by users and orgs. Reserved words blocked. See Handles.

pipeline (decorator)

@pipeline("<domain>@<semver>", entity_type=, key=, url=|source=) — the shorthand that synthesizes the three classes (Source, Schema, Pipeline) from a plain class with annotations and a transform classmethod. The 80% authoring path. See Pipelines.

published variant

The EskerModel subclass with esker_id, esker_source_url, esker_lineage_id fields added. Class name: Published<X>. Generated by cls.published() and process-cached.

REGISTRY

Process-global dict[str, type[EskerPipeline]] populated at import time by @register / @pipeline. Discovered via the esker.pipelines entry-point group.

RunResult

The dataclass returned by EskerPipeline.run(). Carries identity (run_id, paths), timing, hashes, and a manifest() builder.

SemVer

<major>.<minor>.<patch> only. No pre-releases, no build metadata. Pattern ^\d+\.\d+\.\d+$.

schema_version

A ClassVar on EskerModel subclasses giving the current version of this record's shape. Distinct from pipeline_version (which versions the transform code). Both go on the manifest. See Manifests.

source_id

SOURCE.SOURCE_ID — string identifying the origin. The decorator overrides this to domain_id. Default for BulkJsonSource / BulkCsvSource is "bulk-json" / "bulk-csv" if not overridden.

source_digest

sha256(<url1>\x00<iso1>\n<url2>\x00<iso2>\n…) — fingerprint of the unique fetch set in a run. On the manifest. Stable across reruns iff URLs and timestamps are stable (almost never; fetched_at = datetime.now()).

supersedes

run_id of the previous publish at the same schema_version. Backlink for the run-chain within a single version. None for first publish or first publish at a new version.

transform

User code. transform(self|cls, raw: dict) -> EskerModel. Pure function; the fixture harness compares its output byte-for-byte. No clocks, no RNG, no network.

unbound

A bare name with no project or global binding. bindings.resolve raises UnboundDatasetError.

See also