Overview
What Esker is, what it does, and the three nouns to hold in your head.
Esker is trying to become the default publish target for normalized public data. GitHub for code; HuggingFace for models; Esker for data.
This is the documentation for the protocol, the Python SDK, the CLI, and the hub. The SDK is what pip install esker gives you and is everything you need to author, publish, and consume datasets from a script, a notebook, a Lambda, or a long-running process. The hub at esker.so is where published datasets live.
What it does
Three jobs, in the order you'll meet them.
Authoring. You write a Python class describing one record of your dataset and a transform function turning one raw row from the source into one of those records. The framework handles schema emission, parquet writing, lineage capture, and manifest construction.
Publishing. esker push runs the pipeline, gates the schema diff against the last published version, and uploads six artifacts to the hub: parquet, JSON Schema, Arrow IPC schema, TypeScript interface, lineage bundle, manifest. From that moment your dataset is addressable as <owner>/<name>@<version>.
Consuming. Other code calls esker.get("you/your-dataset") and receives a polars DataFrame. The cache is content-hash verified on every read. A lockfile pins exact versions so the read is reproducible.
The model
Three nouns to hold in your head.
| noun | what | example |
|---|---|---|
| dataset | a published thing on the hub | archie/us.sec.companies@2.0.0 |
| schema | the record shape contract | {cik: str, ticker: str, ...} |
| entity | the real-world thing a record is about | a corporation, a rocket, a yield curve point |
A dataset has rows; each row describes (or is) an entity; the rows conform to a schema. Schemas evolve through SemVer; entities have stable IDs across datasets; datasets get re-published with new manifests over time.
How the pieces fit
your project
├── pyproject.toml entry-point group `esker.pipelines`
├── esker.lock pinned versions of consumed datasets
└── src/your_pipelines/
└── your_dataset.py @pipeline + transform
│
│ esker push your.domain
▼
esker.so/<owner>/<name> data.parquet, schema.json, schema.arrow,
schema.d.ts, lineage.json, manifest.json
│
│ esker.get("them/their-dataset")
▼
~/.esker/cache/<owner>/<name>/<version>/data.parquet
│
▼
polars.DataFrame
The SDK has no first-party knowledge of any specific data source. Pipelines live in your project — pip install esker gives you the abstractions, not the data.
What Esker is not
- Not an orchestrator. No DAGs, no triggers, no schedulers.
cadenceis metadata, not behavior. Use cron, Airflow, Dagster, or Prefect to schedule Esker runs. - Not a query engine.
esker.getreturns a polarsDataFrame. Bring your own analytics. - Not a transformation framework.
transformis a per-record pure function, not a SQL model. Esker is upstream of dbt. - Not a data catalog. No tags, no business glossary, no SLAs.
- Not a real-time stream. Bulk fetches, batch parquet, occasional runs.
Design principles
- Minimalism is a product-level decision. The default is no decoration. Color is signal: red for errors, dim for secondary info. No emoji, no icons, no exclamation, no spinners.
- Types carry invariants, not convention.
EskerModelisfrozen=True, extra="forbid". Records are values, not objects. Misuse fails loudly. - The CLI voice is
git push, notdbt run. Two-line success. No progress bars. No banners. - Bind once, then live in bare-name space. Owner choice is one explicit moment per dataset. Code reads bare names; bindings disambiguate.
- No hidden behavior. No background jobs, no implicit retries, no upstream caching for bulk sources. What you see in the script is what runs.
Where to go next
- Getting started for the end-to-end walkthrough.
- Pipelines if you have a dataset in mind.
- Reading if you only want to consume what others publish.
- CLI overview for every command and flag.