All checks were successful
Build and Publish Docker Image / build-and-push (push) Successful in 3m27s
Signed-off-by: Daniel Henry <iamdanhenry@gmail.com>
105 lines
3.0 KiB
Markdown
105 lines
3.0 KiB
Markdown
# notebook-tools
|
||
|
||
FastAPI service that:
|
||
- downloads PDFs from Paperless-ngx
|
||
- splits them into pages (JPEG)
|
||
- OCRs each page via your llama.cpp OpenAI-compatible endpoint
|
||
- converts each page back into a single-page PDF
|
||
- uploads **one Paperless document per page** (all uploads run **in parallel**; OCR stays **one page at a time** for VRAM)
|
||
- patches each uploaded document with:
|
||
- `content` = OCR text
|
||
- custom fields `notebook_id` (field id 1) and `notebook_page` (field id 2)
|
||
- `document_type` = Paperless document type id (default **3**, configurable)
|
||
|
||
## Setup
|
||
|
||
Install deps:
|
||
|
||
```bash
|
||
uv sync
|
||
```
|
||
|
||
Create a `.env` file (example below) and **do not commit it**.
|
||
|
||
## Run locally
|
||
|
||
```bash
|
||
uv run uvicorn notebook_tools.api:app --reload --host 0.0.0.0 --port 8080
|
||
```
|
||
|
||
Then open the docs at:
|
||
- `http://127.0.0.1:8080/docs` (same machine)
|
||
- `http://<your-lan-ip>:8080/docs` (other machines on your network)
|
||
|
||
If other machines still can’t connect, check your macOS firewall and any router/network rules.
|
||
|
||
## Docker
|
||
|
||
Build and run (pass env via file or `-e`; the app reads `.env` only if you mount it):
|
||
|
||
```bash
|
||
docker build -t notebook-tools:local .
|
||
docker run --rm -p 8080:8080 --env-file .env notebook-tools:local
|
||
```
|
||
|
||
`LLAMA_BASE_URL` / `PAPERLESS_BASE_URL` must be reachable **from inside the container** (use `host.docker.internal` on Docker Desktop, or your LAN IP, not `127.0.0.1` for services on the host).
|
||
|
||
### Docker Compose
|
||
|
||
Save as `compose.yaml` (any directory with your `.env`):
|
||
|
||
```yaml
|
||
services:
|
||
notebook-tools:
|
||
image: git.danhenry.dev/daniel/notebook-tools:latest
|
||
ports:
|
||
- "8080:8080"
|
||
env_file:
|
||
- .env
|
||
# Lets the container reach services bound on the host (e.g. llama on :9292).
|
||
# Linux: requires Docker 20.10+ / Compose v2; omit on Docker Desktop if already available.
|
||
extra_hosts:
|
||
- "host.docker.internal:host-gateway"
|
||
```
|
||
|
||
```bash
|
||
docker compose pull && docker compose up
|
||
```
|
||
|
||
Log in to `git.danhenry.dev` first if the registry requires auth: `docker login git.danhenry.dev`.
|
||
|
||
For llama running **on the host**, set in `.env`:
|
||
|
||
```bash
|
||
LLAMA_BASE_URL="http://host.docker.internal:9292"
|
||
```
|
||
|
||
`PAPERLESS_BASE_URL` can stay a normal `https://…` URL if the container has network access to it.
|
||
|
||
CI: on push to `main`, [.github/workflows/build-docker.yml](.github/workflows/build-docker.yml) builds and pushes using the same secrets pattern as your other Gitea repos (`DOCKER_REGISTRY`, `DOCKER_USERNAME`, `DOCKER_PASSWORD`). For Docker Hub, set `DOCKER_REGISTRY` to `docker.io` (or leave per your runner docs).
|
||
|
||
## Example `.env`
|
||
|
||
```bash
|
||
PAPERLESS_BASE_URL="https://paperless.example.com"
|
||
PAPERLESS_TOKEN="paste-token-here"
|
||
|
||
LLAMA_BASE_URL="http://127.0.0.1:9292"
|
||
LLAMA_MODEL="ggml-model-q4_k_m"
|
||
|
||
# Custom field ids in Paperless
|
||
PAPERLESS_CUSTOM_FIELD_NOTEBOOK_ID=1
|
||
PAPERLESS_CUSTOM_FIELD_NOTEBOOK_PAGE=2
|
||
PAPERLESS_DOCUMENT_TYPE_ID=3
|
||
|
||
# Optional: cap concurrent Paperless uploads (0 = unlimited)
|
||
PAPERLESS_UPLOAD_CONCURRENCY=4
|
||
|
||
# Rendering / OCR knobs
|
||
RENDER_DPI=200
|
||
OCR_MAX_TOKENS=1024
|
||
OCR_TEMPERATURE=0.0
|
||
```
|
||
|
||
|