2.2 KiB
2.2 KiB
notebook-tools
FastAPI service that:
- downloads PDFs from Paperless-ngx
- splits them into pages (JPEG)
- OCRs each page via your llama.cpp OpenAI-compatible endpoint
- converts each page back into a single-page PDF
- uploads one Paperless document per page (all uploads run in parallel; OCR stays one page at a time for VRAM)
- patches each uploaded document with:
content= OCR text- custom fields
notebook_id(field id 1) andnotebook_page(field id 2) document_type= Paperless document type id (default 3, configurable)
Setup
Install deps:
uv sync
Create a .env file (example below) and do not commit it.
Run locally
uv run uvicorn notebook_tools.api:app --reload --host 0.0.0.0 --port 8080
Then open the docs at:
http://127.0.0.1:8080/docs(same machine)http://<your-lan-ip>:8080/docs(other machines on your network)
If other machines still can’t connect, check your macOS firewall and any router/network rules.
Docker
Build and run (pass env via file or -e; the app reads .env only if you mount it):
docker build -t notebook-tools:local .
docker run --rm -p 8080:8080 --env-file .env notebook-tools:local
LLAMA_BASE_URL / PAPERLESS_BASE_URL must be reachable from inside the container (use host.docker.internal on Docker Desktop, or your LAN IP, not 127.0.0.1 for services on the host).
CI: on push to main, .gitea/workflows/build-docker.yml builds and pushes using the same secrets pattern as your other Gitea repos (DOCKER_REGISTRY, DOCKER_USERNAME, DOCKER_PASSWORD). For Docker Hub, set DOCKER_REGISTRY to docker.io (or leave per your runner docs).
Example .env
PAPERLESS_BASE_URL="https://paperless.example.com"
PAPERLESS_TOKEN="paste-token-here"
LLAMA_BASE_URL="http://127.0.0.1:9292"
LLAMA_MODEL="ggml-model-q4_k_m"
# Custom field ids in Paperless
PAPERLESS_CUSTOM_FIELD_NOTEBOOK_ID=1
PAPERLESS_CUSTOM_FIELD_NOTEBOOK_PAGE=2
PAPERLESS_DOCUMENT_TYPE_ID=3
# Optional: cap concurrent Paperless uploads (0 = unlimited)
PAPERLESS_UPLOAD_CONCURRENCY=4
# Rendering / OCR knobs
RENDER_DPI=200
OCR_MAX_TOKENS=1024
OCR_TEMPERATURE=0.0