# Caching

> Every disk path the SDK touches and the env var that overrides it.

The SDK touches a handful of paths on disk. All are derived from `esker.config` accessors, which re-read the environment on every call (no in-process caching).

## Paths

| function                 | default                 | env var                      | what                                            |
| ------------------------ | ----------------------- | ---------------------------- | ----------------------------------------------- |
| `cache_dir()`            | `./cache`               | `ESKER_CACHE_DIR`            | source-fetch cache (`fetch_cached`)             |
| `consumer_cache_dir()`   | `~/.esker/cache`        | `ESKER_CONSUMER_CACHE_DIR`   | downloaded-dataset cache (`esker.get`)          |
| `output_dir()`           | `./output`              | `ESKER_OUTPUT_DIR`           | default for `pipeline.run()` and CLI `--output` |
| `hub_url()`              | `http://localhost:3001` | `ESKER_HUB_URL`              | hub API base                                    |
| `web_url()`              | `http://localhost:3000` | `ESKER_WEB_URL`              | hub web (used by `login`)                       |
| `credentials_path()`     | `~/.esker/credentials`  | `ESKER_CREDENTIALS_PATH`     | login token, email, handle                      |
| `global_bindings_path()` | `~/.esker/config.toml`  | `ESKER_GLOBAL_BINDINGS_PATH` | global `[datasets]` map                         |
| `http_timeout()`         | `60` (seconds)          | `ESKER_HTTP_TIMEOUT`         | every outbound HTTP timeout                     |

## Two cache dirs, two purposes

Easy to confuse. They are not the same.

### `cache_dir()` — pipeline-author cache

Used by `EskerSource.fetch_cached(id)`. Layout:

```
./cache/
└── <SOURCE_ID>/
    └── <safe_id>.json    JSON envelope: {"raw": ..., "source_url": ..., "fetched_at": "..."}
```

`safe_id` replaces `/` and `:` with `_`. Other unsafe filesystem chars (`*`, `?`) aren't handled.

Bulk sources don't use this — they re-fetch the whole payload on every run. Per-id sources use it to avoid re-hitting their origin.

### `consumer_cache_dir()` — dataset-consumer cache

Used by `esker.get(ref)`. Layout:

```
~/.esker/cache/
└── <owner>/
    └── <domain_id>/
        └── <schema_version>/
            └── data.parquet
```

`compute_content_hash` runs on every `esker.get` call — even cached files are verified against the manifest. See [Reading](https://esker.so/docs/sdk/reading.md) for the verification cost.

## Output dir

CLI and library share one default: `output_dir()` (`./output`). Override via `ESKER_OUTPUT_DIR` or `--output`.

- CLI: each `--output` defaults to `output_dir()`; the env var applies.
- Library: `EskerPipeline.run()` falls back to `output_dir()` when no `output_dir=` is passed.

Mixing CLI invocations and direct library calls writes to the same place.

After `esker run` (or `pipeline.run()`):

```
<output_dir>/
├── <DOMAIN_ID>.parquet
└── <DOMAIN_ID>.lineage.json
```

Filenames are unconditional — they always equal the dotted `DOMAIN_ID`. Re-running overwrites in place.

## Auth files

`~/.esker/credentials`: JSON with `token`, `user_email`, `owner_handle`, `expires_at`. Mode 0600 (best-effort).

`~/.esker/config.toml`: TOML with a `[datasets]` table mapping bare names to `<owner>/<name>` (no version pinning at this scope).

The `~/.esker/` directory is created on first write, not eagerly.

## Hub URL defaults are localhost

Out of the box the SDK assumes you're running esker-hub locally:

```
ESKER_HUB_URL = http://localhost:3001
ESKER_WEB_URL = http://localhost:3000
```

For production, set `ESKER_HUB_URL=https://hub.esker.so` (or wherever) and `ESKER_WEB_URL=https://esker.so` in your environment. There's no `.env` file convention; just env vars.

## HTTP timeout

`ESKER_HTTP_TIMEOUT` (default 60s) applies to every outbound HTTP call:

- Source fetches in `BulkJsonSource` / `BulkCsvSource`.
- Hub API calls (`fetch_manifest`, `download_artifact_to`, `upload_*`, `search_datasets`).
- `auth.fetch_whoami` (used by `login` and `whoami`).
- The CLI commands that hand-roll requests (`config set-handle`, `transfer`, `visibility`).

A wedged origin trips the timeout and surfaces as `<TypeName>: <msg>` (e.g. `RemoteDisconnected: Remote end closed connection without response`).

## Parquet outputs

What lands on disk after `esker run`:

- `data.parquet` — record rows with the three injected `esker_*` columns.
- `data.lineage.json` — `LineageBundle`, one batch per unique `(source_url, fetched_at)`.

After `esker push`, the hub additionally receives `schema.json`, `schema.arrow`, `schema.d.ts`, and `manifest.json`. Local files are unchanged from `run`.

After `esker pull <ref>`:

- `<output_dir>/<DOMAIN_ID>.parquet`

One file. **No lineage.json on pull** — `pull` only fetches `data.parquet`. To get lineage, hit the artifact URL directly.

## See also

- [Reading](https://esker.so/docs/sdk/reading.md) — `esker.get` and `_ensure_local`
- [Sources](https://esker.so/docs/sdk/sources.md) — `fetch_cached` mechanics
- [Auth](https://esker.so/docs/cli/auth.md) — credentials file and login