Manifests
The metadata document that describes a published dataset. What's in it, why each field is there, and how to use it.
A manifest is a single JSON document that describes one published version of a dataset. Every release has exactly one. It carries the things you can't get from looking at the parquet alone — who published it, when, against which schema, and a hash that lets you verify the bytes you downloaded match what was published.
You'll encounter manifests in three places:
- The hub serves them at
<owner>/<name>/manifest.json(or@<version>/manifest.jsonfor a specific version). - The SDK fetches them on every
esker.get()andesker pullto know which parquet to download and what hash to verify against. - The CLI prints a formatted view with
esker manifest <ref>.
Reading a manifest
$ esker manifest archie/us.treasury.yields
archie/us.treasury.yields@2.0.0
pipeline 2.0.0
source us.treasury.yields
records 4,205
ingested 2026-05-03 23:00 UTC
published 2026-05-03 23:00 UTC
by archiewyles@gmail.com
content sha256:79eeb5cb20…3e
digest sha256:5a7d2c1f8b…ab
lineage sha256:5ef35cc434…da
run a7076934-be56-477b-a030-250e4492ec93
The grouping reflects how you'd usually scan it: identity, then time, then who, then integrity, then the run ID for cross-referencing.
For the raw JSON, add --json or hit the URL directly:
esker manifest archie/us.treasury.yields --json
curl https://esker.so/archie/us.treasury.yields/manifest.json
Fields
class DatasetManifest(BaseModel):
manifest_version: Literal["1.0"] = "1.0"
owner: OwnerHandle
domain_id: Name
schema_version: SemVer
pipeline_version: SemVer
run_id: UUID
published_at: AwareDatetime
ingested_at: AwareDatetime
record_count: int
content_hash: str
source_id: str
source_digest: str
produced_by: str
lineage_hash: str | None = None
supersedes: UUID | None = None
| field | meaning | when you'll care |
|---|---|---|
manifest_version |
Schema version of the manifest itself. Pinned to "1.0". |
Almost never. Bumping it is a coordinated protocol change. |
owner |
The publishing handle. | Display, routing. |
domain_id |
The dataset's bare name. | Display, routing. |
schema_version |
The version of the record shape. | Picking which version to consume; reasoning about compatibility. |
pipeline_version |
The version of the transform code. | Investigating "did the data shape change or just the cleanup?" Often equal to schema_version. |
run_id |
UUID of this specific publish. | Cross-referencing logs, distinguishing two re-publishes of the same version. |
published_at |
When the hub recorded the publish. | "How fresh is this?" |
ingested_at |
When the source was actually fetched. | "Is the underlying data stale even if the publish is recent?" |
record_count |
Row count in the parquet. | Sanity checks; quick comparisons across releases. |
content_hash |
sha256 of the parquet file. | Integrity. The SDK verifies your downloaded bytes against this on every read. |
source_id |
String identifying the upstream source. | Investigating where the data came from. |
source_digest |
Hash of the unique fetch set. | Detecting whether two runs hit the same upstream URLs. |
produced_by |
Email of the user who ran the publish. | Auditing. |
lineage_hash |
sha256 of lineage.json. |
Integrity check on the lineage sidecar. |
supersedes |
run_id of the previous publish at the same schema_version. |
Walking the run-chain within a version. |
schema_version vs pipeline_version
Two versions on the same manifest. They mean different things:
schema_versionis a promise to consumers about the record shape. Bumping it triggers the compatibility gate.pipeline_versionis a note to yourself about the transform code. A bug fix that changes record values but not the schema bumpspipeline_versionwhile leavingschema_versionalone.
Most pipelines keep them in sync. Diverge them when:
- You shipped a transform fix and want consumers to be able to tell "this version's data is different from last week's."
- You're publishing two pipelines under the same schema (uncommon but supported).
When using the @pipeline decorator, pipeline_version defaults to whatever schema_version is in the ref — pass pipeline_version= explicitly to diverge them.
supersedes
supersedes is the backlink from this run to the previous run at the same schema_version.
- First publish ever:
supersedes = None. - Re-publish v1.0.0 with new data:
supersedes = <previous run_id>. - First publish of v2.0.0 after v1.0.0:
supersedes = None(different version, different chain).
So you can walk back in time within a version to see every re-publish, and the chain breaks at the version boundary.
This is useful when investigating "when did this row first appear" or "what was the data three publishes ago."
Hashes
Three hashes on the manifest. Every value is sha256:<hex>.
content_hash — sha256 of the entire parquet file. The SDK recomputes this on every esker.get() and esker pull to verify the bytes you have match the bytes that were published. A mismatch is fail-closed: the SDK raises rather than returning suspect data.
source_digest — sha256 of <url1>\x00<iso1>\n<url2>\x00<iso2>\n… for each unique fetch in this run. A fingerprint of "which URLs did we hit and when," not a content hash. It's stable across reruns iff URLs and timestamps are identical, which is rare since fetched_at defaults to datetime.now(). Useful as a debugging signal — two runs with the same source_digest definitely hit the same upstream; two runs with different digests probably didn't.
lineage_hash — sha256 of the lineage.json bytes. Lets a consumer verify the lineage sidecar wasn't tampered with using only the manifest.
The CLI shows hashes in short form (sha256:abcdef…1234); the JSON is always full-length.
Why two consecutive publishes of the same code give different content_hash
Worth flagging because it surprises people. Each row in the parquet carries a fresh esker_lineage_id UUID, minted at run time. So:
- Same source data + same code → different
esker_lineage_idcolumn → different parquet bytes → differentcontent_hash.
The compatibility checker doesn't care — it diffs schemas, not bytes. The supersedes chain handles re-publishes properly. But if you expected "identical inputs produce identical hashes," now you know why they don't.
Hub URLs
GET /datasets/<owner>/<name> latest manifest
GET /datasets/<owner>/<name>@<version> manifest at version
POST /datasets/<owner>/<name> upload (manifest body carries the version)
Versionless reads always return the latest published manifest. Uploads POST to the unversioned URL — the manifest's own schema_version and run_id are how the hub addresses the new release.
Programmatic use
from esker import DatasetManifest
from esker.client import hub
from esker.schemas.ref import DatasetRef
ref = DatasetRef.parse("archie/us.treasury.yields@2.0.0")
manifest: DatasetManifest = hub.fetch_manifest(ref)
print(manifest.record_count)
print(manifest.content_hash)
DatasetManifest is a frozen Pydantic model — hashable, immutable, extra="forbid". Useful when you want to cross-reference a downloaded parquet against its manifest in your own code.
Cross-language use
A TypeScript consumer can use the mirrored type directly:
import type { DatasetManifest } from "@esker/types";
const r = await fetch(
"https://esker.so/archie/us.treasury.yields/manifest.json",
);
const manifest: DatasetManifest = await r.json();
The hub's @esker/types package mirrors the Python DatasetManifest shape. Both must stay in lockstep — manifest_version: "1.0" is the protocol pin that tells you they do.
See also
- Compatibility — what
schema_versionbumps mean - Lineage — the bundle the manifest references
- Reading — how
esker.get()consumes the manifest - Publishing — the push flow that builds it