Manifests

The metadata document that describes a published dataset. What's in it, why each field is there, and how to use it.

A manifest is a single JSON document that describes one published version of a dataset. Every release has exactly one. It carries the things you can't get from looking at the parquet alone — who published it, when, against which schema, and a hash that lets you verify the bytes you downloaded match what was published.

You'll encounter manifests in three places:

  • The hub serves them at <owner>/<name>/manifest.json (or @<version>/manifest.json for a specific version).
  • The SDK fetches them on every esker.get() and esker pull to know which parquet to download and what hash to verify against.
  • The CLI prints a formatted view with esker manifest <ref>.

Reading a manifest

$ esker manifest archie/us.treasury.yields
  archie/us.treasury.yields@2.0.0
  pipeline   2.0.0
  source     us.treasury.yields
  records    4,205

  ingested   2026-05-03 23:00 UTC
  published  2026-05-03 23:00 UTC

  by         archiewyles@gmail.com

  content    sha256:79eeb5cb20…3e
  digest     sha256:5a7d2c1f8b…ab
  lineage    sha256:5ef35cc434…da

  run        a7076934-be56-477b-a030-250e4492ec93

The grouping reflects how you'd usually scan it: identity, then time, then who, then integrity, then the run ID for cross-referencing.

For the raw JSON, add --json or hit the URL directly:

esker manifest archie/us.treasury.yields --json
curl https://esker.so/archie/us.treasury.yields/manifest.json

Fields

class DatasetManifest(BaseModel):
    manifest_version: Literal["1.0"] = "1.0"
    owner: OwnerHandle
    domain_id: Name
    schema_version: SemVer
    pipeline_version: SemVer
    run_id: UUID
    published_at: AwareDatetime
    ingested_at: AwareDatetime
    record_count: int
    content_hash: str
    source_id: str
    source_digest: str
    produced_by: str
    lineage_hash: str | None = None
    supersedes: UUID | None = None
field meaning when you'll care
manifest_version Schema version of the manifest itself. Pinned to "1.0". Almost never. Bumping it is a coordinated protocol change.
owner The publishing handle. Display, routing.
domain_id The dataset's bare name. Display, routing.
schema_version The version of the record shape. Picking which version to consume; reasoning about compatibility.
pipeline_version The version of the transform code. Investigating "did the data shape change or just the cleanup?" Often equal to schema_version.
run_id UUID of this specific publish. Cross-referencing logs, distinguishing two re-publishes of the same version.
published_at When the hub recorded the publish. "How fresh is this?"
ingested_at When the source was actually fetched. "Is the underlying data stale even if the publish is recent?"
record_count Row count in the parquet. Sanity checks; quick comparisons across releases.
content_hash sha256 of the parquet file. Integrity. The SDK verifies your downloaded bytes against this on every read.
source_id String identifying the upstream source. Investigating where the data came from.
source_digest Hash of the unique fetch set. Detecting whether two runs hit the same upstream URLs.
produced_by Email of the user who ran the publish. Auditing.
lineage_hash sha256 of lineage.json. Integrity check on the lineage sidecar.
supersedes run_id of the previous publish at the same schema_version. Walking the run-chain within a version.

schema_version vs pipeline_version

Two versions on the same manifest. They mean different things:

  • schema_version is a promise to consumers about the record shape. Bumping it triggers the compatibility gate.
  • pipeline_version is a note to yourself about the transform code. A bug fix that changes record values but not the schema bumps pipeline_version while leaving schema_version alone.

Most pipelines keep them in sync. Diverge them when:

  • You shipped a transform fix and want consumers to be able to tell "this version's data is different from last week's."
  • You're publishing two pipelines under the same schema (uncommon but supported).

When using the @pipeline decorator, pipeline_version defaults to whatever schema_version is in the ref — pass pipeline_version= explicitly to diverge them.

supersedes

supersedes is the backlink from this run to the previous run at the same schema_version.

  • First publish ever: supersedes = None.
  • Re-publish v1.0.0 with new data: supersedes = <previous run_id>.
  • First publish of v2.0.0 after v1.0.0: supersedes = None (different version, different chain).

So you can walk back in time within a version to see every re-publish, and the chain breaks at the version boundary.

This is useful when investigating "when did this row first appear" or "what was the data three publishes ago."

Hashes

Three hashes on the manifest. Every value is sha256:<hex>.

content_hash — sha256 of the entire parquet file. The SDK recomputes this on every esker.get() and esker pull to verify the bytes you have match the bytes that were published. A mismatch is fail-closed: the SDK raises rather than returning suspect data.

source_digest — sha256 of <url1>\x00<iso1>\n<url2>\x00<iso2>\n… for each unique fetch in this run. A fingerprint of "which URLs did we hit and when," not a content hash. It's stable across reruns iff URLs and timestamps are identical, which is rare since fetched_at defaults to datetime.now(). Useful as a debugging signal — two runs with the same source_digest definitely hit the same upstream; two runs with different digests probably didn't.

lineage_hash — sha256 of the lineage.json bytes. Lets a consumer verify the lineage sidecar wasn't tampered with using only the manifest.

The CLI shows hashes in short form (sha256:abcdef…1234); the JSON is always full-length.

Why two consecutive publishes of the same code give different content_hash

Worth flagging because it surprises people. Each row in the parquet carries a fresh esker_lineage_id UUID, minted at run time. So:

  • Same source data + same code → different esker_lineage_id column → different parquet bytes → different content_hash.

The compatibility checker doesn't care — it diffs schemas, not bytes. The supersedes chain handles re-publishes properly. But if you expected "identical inputs produce identical hashes," now you know why they don't.

Hub URLs

GET  /datasets/<owner>/<name>                          latest manifest
GET  /datasets/<owner>/<name>@<version>                manifest at version
POST /datasets/<owner>/<name>                          upload (manifest body carries the version)

Versionless reads always return the latest published manifest. Uploads POST to the unversioned URL — the manifest's own schema_version and run_id are how the hub addresses the new release.

Programmatic use

from esker import DatasetManifest
from esker.client import hub
from esker.schemas.ref import DatasetRef

ref = DatasetRef.parse("archie/us.treasury.yields@2.0.0")
manifest: DatasetManifest = hub.fetch_manifest(ref)

print(manifest.record_count)
print(manifest.content_hash)

DatasetManifest is a frozen Pydantic model — hashable, immutable, extra="forbid". Useful when you want to cross-reference a downloaded parquet against its manifest in your own code.

Cross-language use

A TypeScript consumer can use the mirrored type directly:

import type { DatasetManifest } from "@esker/types";

const r = await fetch(
  "https://esker.so/archie/us.treasury.yields/manifest.json",
);
const manifest: DatasetManifest = await r.json();

The hub's @esker/types package mirrors the Python DatasetManifest shape. Both must stay in lockstep — manifest_version: "1.0" is the protocol pin that tells you they do.

See also

  • Compatibility — what schema_version bumps mean
  • Lineage — the bundle the manifest references
  • Reading — how esker.get() consumes the manifest
  • Publishing — the push flow that builds it