# Manifests

> The metadata document that describes a published dataset. What's in it, why each field is there, and how to use it.

A manifest is a single JSON document that describes one published version of a dataset. Every release has exactly one. It carries the things you can't get from looking at the parquet alone — who published it, when, against which schema, and a hash that lets you verify the bytes you downloaded match what was published.

You'll encounter manifests in three places:

- The hub serves them at `<owner>/<name>/manifest.json` (or `@<version>/manifest.json` for a specific version).
- The SDK fetches them on every `esker.get()` and `esker pull` to know which parquet to download and what hash to verify against.
- The CLI prints a formatted view with `esker manifest <ref>`.

## Reading a manifest

```sh
$ esker manifest archie/us.treasury.yields
  archie/us.treasury.yields@2.0.0
  pipeline   2.0.0
  source     us.treasury.yields
  records    4,205

  ingested   2026-05-03 23:00 UTC
  published  2026-05-03 23:00 UTC

  by         archiewyles@gmail.com

  content    sha256:79eeb5cb20…3e
  digest     sha256:5a7d2c1f8b…ab
  lineage    sha256:5ef35cc434…da

  run        a7076934-be56-477b-a030-250e4492ec93
```

The grouping reflects how you'd usually scan it: identity, then time, then who, then integrity, then the run ID for cross-referencing.

For the raw JSON, add `--json` or hit the URL directly:

```sh
esker manifest archie/us.treasury.yields --json
curl https://esker.so/archie/us.treasury.yields/manifest.json
```

## Fields

```python
class DatasetManifest(BaseModel):
    manifest_version: Literal["1.0"] = "1.0"
    owner: OwnerHandle
    domain_id: Name
    schema_version: SemVer
    pipeline_version: SemVer
    run_id: UUID
    published_at: AwareDatetime
    ingested_at: AwareDatetime
    record_count: int
    content_hash: str
    source_id: str
    source_digest: str
    produced_by: str
    lineage_hash: str | None = None
    supersedes: UUID | None = None
```

| field              | meaning                                                        | when you'll care                                                                                |
| ------------------ | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `manifest_version` | Schema version of the manifest itself. Pinned to `"1.0"`.      | Almost never. Bumping it is a coordinated protocol change.                                      |
| `owner`            | The publishing handle.                                         | Display, routing.                                                                               |
| `domain_id`        | The dataset's bare name.                                       | Display, routing.                                                                               |
| `schema_version`   | The version of the record shape.                               | Picking which version to consume; reasoning about compatibility.                                |
| `pipeline_version` | The version of the transform code.                             | Investigating "did the data shape change or just the cleanup?" Often equal to `schema_version`. |
| `run_id`           | UUID of this specific publish.                                 | Cross-referencing logs, distinguishing two re-publishes of the same version.                    |
| `published_at`     | When the hub recorded the publish.                             | "How fresh is this?"                                                                            |
| `ingested_at`      | When the source was actually fetched.                          | "Is the underlying data stale even if the publish is recent?"                                   |
| `record_count`     | Row count in the parquet.                                      | Sanity checks; quick comparisons across releases.                                               |
| `content_hash`     | sha256 of the parquet file.                                    | Integrity. The SDK verifies your downloaded bytes against this on every read.                   |
| `source_id`        | String identifying the upstream source.                        | Investigating where the data came from.                                                         |
| `source_digest`    | Hash of the unique fetch set.                                  | Detecting whether two runs hit the same upstream URLs.                                          |
| `produced_by`      | Email of the user who ran the publish.                         | Auditing.                                                                                       |
| `lineage_hash`     | sha256 of `lineage.json`.                                      | Integrity check on the lineage sidecar.                                                         |
| `supersedes`       | `run_id` of the previous publish at the same `schema_version`. | Walking the run-chain within a version.                                                         |

## `schema_version` vs `pipeline_version`

Two versions on the same manifest. They mean different things:

- `schema_version` is a **promise to consumers** about the record shape. Bumping it triggers the [compatibility gate](https://esker.so/docs/protocol/compatibility.md).
- `pipeline_version` is a **note to yourself** about the transform code. A bug fix that changes record values but not the schema bumps `pipeline_version` while leaving `schema_version` alone.

Most pipelines keep them in sync. Diverge them when:

- You shipped a transform fix and want consumers to be able to tell "this version's data is different from last week's."
- You're publishing two pipelines under the same schema (uncommon but supported).

When using the [`@pipeline` decorator](https://esker.so/docs/sdk/pipelines.md), `pipeline_version` defaults to whatever `schema_version` is in the ref — pass `pipeline_version=` explicitly to diverge them.

## `supersedes`

`supersedes` is the backlink from this run to the previous run at the same `schema_version`.

- First publish ever: `supersedes = None`.
- Re-publish v1.0.0 with new data: `supersedes = <previous run_id>`.
- First publish of v2.0.0 after v1.0.0: `supersedes = None` (different version, different chain).

So you can walk back in time within a version to see every re-publish, and the chain breaks at the version boundary.

This is useful when investigating "when did this row first appear" or "what was the data three publishes ago."

## Hashes

Three hashes on the manifest. Every value is `sha256:<hex>`.

**`content_hash`** — sha256 of the entire parquet file. The SDK recomputes this on every `esker.get()` and `esker pull` to verify the bytes you have match the bytes that were published. A mismatch is fail-closed: the SDK raises rather than returning suspect data.

**`source_digest`** — sha256 of `<url1>\x00<iso1>\n<url2>\x00<iso2>\n…` for each unique fetch in this run. A _fingerprint_ of "which URLs did we hit and when," not a content hash. It's stable across reruns iff URLs and timestamps are identical, which is rare since `fetched_at` defaults to `datetime.now()`. Useful as a debugging signal — two runs with the same `source_digest` definitely hit the same upstream; two runs with different digests probably didn't.

**`lineage_hash`** — sha256 of the `lineage.json` bytes. Lets a consumer verify the lineage sidecar wasn't tampered with using only the manifest.

The CLI shows hashes in short form (`sha256:abcdef…1234`); the JSON is always full-length.

## Why two consecutive publishes of the same code give different `content_hash`

Worth flagging because it surprises people. Each row in the parquet carries a fresh `esker_lineage_id` UUID, minted at run time. So:

- Same source data + same code → different `esker_lineage_id` column → different parquet bytes → different `content_hash`.

The compatibility checker doesn't care — it diffs schemas, not bytes. The supersedes chain handles re-publishes properly. But if you expected "identical inputs produce identical hashes," now you know why they don't.

## Hub URLs

```
GET  /datasets/<owner>/<name>                          latest manifest
GET  /datasets/<owner>/<name>@<version>                manifest at version
POST /datasets/<owner>/<name>                          upload (manifest body carries the version)
```

Versionless reads always return the latest published manifest. Uploads POST to the unversioned URL — the manifest's own `schema_version` and `run_id` are how the hub addresses the new release.

## Programmatic use

```python
from esker import DatasetManifest
from esker.client import hub
from esker.schemas.ref import DatasetRef

ref = DatasetRef.parse("archie/us.treasury.yields@2.0.0")
manifest: DatasetManifest = hub.fetch_manifest(ref)

print(manifest.record_count)
print(manifest.content_hash)
```

`DatasetManifest` is a frozen Pydantic model — hashable, immutable, `extra="forbid"`. Useful when you want to cross-reference a downloaded parquet against its manifest in your own code.

## Cross-language use

A TypeScript consumer can use the mirrored type directly:

```ts
import type { DatasetManifest } from "@esker/types";

const r = await fetch(
  "https://esker.so/archie/us.treasury.yields/manifest.json",
);
const manifest: DatasetManifest = await r.json();
```

The hub's `@esker/types` package mirrors the Python `DatasetManifest` shape. Both must stay in lockstep — `manifest_version: "1.0"` is the protocol pin that tells you they do.

## See also

- [Compatibility](https://esker.so/docs/protocol/compatibility.md) — what `schema_version` bumps mean
- [Lineage](https://esker.so/docs/protocol/lineage.md) — the bundle the manifest references
- [Reading](https://esker.so/docs/sdk/reading.md) — how `esker.get()` consumes the manifest
- [Publishing](https://esker.so/docs/sdk/publishing.md) — the push flow that builds it
