Caching

Every disk path the SDK touches and the env var that overrides it.

The SDK touches a handful of paths on disk. All are derived from esker.config accessors, which re-read the environment on every call (no in-process caching).

Paths

function	default	env var	what
`cache_dir()`	`./cache`	`ESKER_CACHE_DIR`	source-fetch cache (`fetch_cached`)
`consumer_cache_dir()`	`~/.esker/cache`	`ESKER_CONSUMER_CACHE_DIR`	downloaded-dataset cache (`esker.get`)
`output_dir()`	`./output`	`ESKER_OUTPUT_DIR`	default for `pipeline.run()` and CLI `--output`
`hub_url()`	`http://localhost:3001`	`ESKER_HUB_URL`	hub API base
`web_url()`	`http://localhost:3000`	`ESKER_WEB_URL`	hub web (used by `login`)
`credentials_path()`	`~/.esker/credentials`	`ESKER_CREDENTIALS_PATH`	login token, email, handle
`global_bindings_path()`	`~/.esker/config.toml`	`ESKER_GLOBAL_BINDINGS_PATH`	global `[datasets]` map
`http_timeout()`	`60` (seconds)	`ESKER_HTTP_TIMEOUT`	every outbound HTTP timeout

Two cache dirs, two purposes

Easy to confuse. They are not the same.

`cache_dir()` — pipeline-author cache

Used by EskerSource.fetch_cached(id). Layout:

./cache/
└── <SOURCE_ID>/
    └── <safe_id>.json    JSON envelope: {"raw": ..., "source_url": ..., "fetched_at": "..."}

safe_id replaces / and : with _. Other unsafe filesystem chars (*, ?) aren't handled.

Bulk sources don't use this — they re-fetch the whole payload on every run. Per-id sources use it to avoid re-hitting their origin.

`consumer_cache_dir()` — dataset-consumer cache

Used by esker.get(ref). Layout:

~/.esker/cache/
└── <owner>/
    └── <domain_id>/
        └── <schema_version>/
            └── data.parquet

compute_content_hash runs on every esker.get call — even cached files are verified against the manifest. See Reading for the verification cost.

Output dir

CLI and library share one default: output_dir() (./output). Override via ESKER_OUTPUT_DIR or --output.

CLI: each --output defaults to output_dir(); the env var applies.
Library: EskerPipeline.run() falls back to output_dir() when no output_dir= is passed.

Mixing CLI invocations and direct library calls writes to the same place.

After esker run (or pipeline.run()):

<output_dir>/
├── <DOMAIN_ID>.parquet
└── <DOMAIN_ID>.lineage.json

Filenames are unconditional — they always equal the dotted DOMAIN_ID. Re-running overwrites in place.

Auth files

~/.esker/credentials: JSON with token, user_email, owner_handle, expires_at. Mode 0600 (best-effort).

~/.esker/config.toml: TOML with a [datasets] table mapping bare names to <owner>/<name> (no version pinning at this scope).

The ~/.esker/ directory is created on first write, not eagerly.

Hub URL defaults are localhost

Out of the box the SDK assumes you're running esker-hub locally:

ESKER_HUB_URL = http://localhost:3001
ESKER_WEB_URL = http://localhost:3000

For production, set ESKER_HUB_URL=https://hub.esker.so (or wherever) and ESKER_WEB_URL=https://esker.so in your environment. There's no .env file convention; just env vars.

HTTP timeout

ESKER_HTTP_TIMEOUT (default 60s) applies to every outbound HTTP call:

Source fetches in BulkJsonSource / BulkCsvSource.
Hub API calls (fetch_manifest, download_artifact_to, upload_*, search_datasets).
auth.fetch_whoami (used by login and whoami).
The CLI commands that hand-roll requests (config set-handle, transfer, visibility).

A wedged origin trips the timeout and surfaces as <TypeName>: <msg> (e.g. RemoteDisconnected: Remote end closed connection without response).

Parquet outputs

What lands on disk after esker run:

data.parquet — record rows with the three injected esker_* columns.
data.lineage.json — LineageBundle, one batch per unique (source_url, fetched_at).

After esker push, the hub additionally receives schema.json, schema.arrow, schema.d.ts, and manifest.json. Local files are unchanged from run.

After esker pull <ref>:

<output_dir>/<DOMAIN_ID>.parquet

One file. No lineage.json on pull — pull only fetches data.parquet. To get lineage, hit the artifact URL directly.