Caching
Every disk path the SDK touches and the env var that overrides it.
The SDK touches a handful of paths on disk. All are derived from esker.config accessors, which re-read the environment on every call (no in-process caching).
Paths
| function | default | env var | what |
|---|---|---|---|
cache_dir() |
./cache |
ESKER_CACHE_DIR |
source-fetch cache (fetch_cached) |
consumer_cache_dir() |
~/.esker/cache |
ESKER_CONSUMER_CACHE_DIR |
downloaded-dataset cache (esker.get) |
output_dir() |
./output |
ESKER_OUTPUT_DIR |
default for pipeline.run() and CLI --output |
hub_url() |
http://localhost:3001 |
ESKER_HUB_URL |
hub API base |
web_url() |
http://localhost:3000 |
ESKER_WEB_URL |
hub web (used by login) |
credentials_path() |
~/.esker/credentials |
ESKER_CREDENTIALS_PATH |
login token, email, handle |
global_bindings_path() |
~/.esker/config.toml |
ESKER_GLOBAL_BINDINGS_PATH |
global [datasets] map |
http_timeout() |
60 (seconds) |
ESKER_HTTP_TIMEOUT |
every outbound HTTP timeout |
Two cache dirs, two purposes
Easy to confuse. They are not the same.
cache_dir() — pipeline-author cache
Used by EskerSource.fetch_cached(id). Layout:
./cache/
└── <SOURCE_ID>/
└── <safe_id>.json JSON envelope: {"raw": ..., "source_url": ..., "fetched_at": "..."}
safe_id replaces / and : with _. Other unsafe filesystem chars (*, ?) aren't handled.
Bulk sources don't use this — they re-fetch the whole payload on every run. Per-id sources use it to avoid re-hitting their origin.
consumer_cache_dir() — dataset-consumer cache
Used by esker.get(ref). Layout:
~/.esker/cache/
└── <owner>/
└── <domain_id>/
└── <schema_version>/
└── data.parquet
compute_content_hash runs on every esker.get call — even cached files are verified against the manifest. See Reading for the verification cost.
Output dir
CLI and library share one default: output_dir() (./output). Override via ESKER_OUTPUT_DIR or --output.
- CLI: each
--outputdefaults tooutput_dir(); the env var applies. - Library:
EskerPipeline.run()falls back tooutput_dir()when nooutput_dir=is passed.
Mixing CLI invocations and direct library calls writes to the same place.
After esker run (or pipeline.run()):
<output_dir>/
├── <DOMAIN_ID>.parquet
└── <DOMAIN_ID>.lineage.json
Filenames are unconditional — they always equal the dotted DOMAIN_ID. Re-running overwrites in place.
Auth files
~/.esker/credentials: JSON with token, user_email, owner_handle, expires_at. Mode 0600 (best-effort).
~/.esker/config.toml: TOML with a [datasets] table mapping bare names to <owner>/<name> (no version pinning at this scope).
The ~/.esker/ directory is created on first write, not eagerly.
Hub URL defaults are localhost
Out of the box the SDK assumes you're running esker-hub locally:
ESKER_HUB_URL = http://localhost:3001
ESKER_WEB_URL = http://localhost:3000
For production, set ESKER_HUB_URL=https://hub.esker.so (or wherever) and ESKER_WEB_URL=https://esker.so in your environment. There's no .env file convention; just env vars.
HTTP timeout
ESKER_HTTP_TIMEOUT (default 60s) applies to every outbound HTTP call:
- Source fetches in
BulkJsonSource/BulkCsvSource. - Hub API calls (
fetch_manifest,download_artifact_to,upload_*,search_datasets). auth.fetch_whoami(used byloginandwhoami).- The CLI commands that hand-roll requests (
config set-handle,transfer,visibility).
A wedged origin trips the timeout and surfaces as <TypeName>: <msg> (e.g. RemoteDisconnected: Remote end closed connection without response).
Parquet outputs
What lands on disk after esker run:
data.parquet— record rows with the three injectedesker_*columns.data.lineage.json—LineageBundle, one batch per unique(source_url, fetched_at).
After esker push, the hub additionally receives schema.json, schema.arrow, schema.d.ts, and manifest.json. Local files are unchanged from run.
After esker pull <ref>:
<output_dir>/<DOMAIN_ID>.parquet
One file. No lineage.json on pull — pull only fetches data.parquet. To get lineage, hit the artifact URL directly.