Refocus classifier on rich extraction and local dedupe only

This commit is contained in:
Steve W
2026-04-09 18:18:35 +00:00
parent a1dcaf9a74
commit 1b2c7db924
7 changed files with 130 additions and 267 deletions

View File

@@ -1,6 +1,6 @@
# email-classifier
FastAPI service that classifies email using a configurable LLM backend, enriches the output for human review, and can upsert Todoist tasks without creating duplicates.
FastAPI service that classifies email using a configurable LLM backend, returns richer structured extraction, and tracks duplicate classifications using fingerprint-based dedupe.
## Environment configuration
@@ -25,11 +25,9 @@ export LLM_API_KEY=your_minimax_key
export LLM_MODEL=MiniMax-M2.7
```
Optional Todoist sync:
Optional local dedupe store path:
```bash
export TODOIST_API_KEY=your_todoist_token
export TODOIST_PROJECT_ID=optional_project_id
export EMAIL_CLASSIFIER_DB_PATH=.data/email_classifier.db
```
@@ -37,9 +35,9 @@ export EMAIL_CLASSIFIER_DB_PATH=.data/email_classifier.db
### POST /classify
Backward-compatible top-level response fields are preserved.
This overhaul is intended to return richer extraction. Top-level compatibility is not required.
Optional request metadata for dedupe and richer sync:
Request example:
```json
{
@@ -47,8 +45,6 @@ Optional request metadata for dedupe and richer sync:
"subject": "Can you review this by Friday?",
"body": "Hi Daniel, please review the attached budget proposal."
},
"message_id": "<abc123@example.com>",
"thread_id": "thread-789",
"from_address": "sender@example.com",
"received_at": "2026-04-09T12:55:00Z",
"provider": "anthropic",
@@ -57,7 +53,7 @@ Optional request metadata for dedupe and richer sync:
}
```
Response now includes optional enrichment and Todoist sync info:
Response example:
```json
{
@@ -80,42 +76,43 @@ Response now includes optional enrichment and Todoist sync info:
"source_signals": ["request", "deadline"],
"dedupe_key": "..."
},
"todoist": {
"status": "created",
"task_id": "1234567890",
"comment_added": false,
"dedupe_match": "none",
"message": null
"dedupe": {
"status": "new",
"seen_count": 1,
"matched_on": "none",
"subject_key": "...",
"fingerprint": "..."
}
}
```
## Dedupe behavior
When Todoist sync is enabled and `needs_action=true`:
- first match by `message_id`
- then by `thread_id`
- then by normalized content fingerprint fallback
The API does not create or update Todoist tasks.
It only returns richer extraction and local dedupe metadata for downstream automation like n8n.
Behavior:
- no existing task: create Todoist task
- existing task, same classification: do not duplicate, mark `unchanged`
- existing task, changed classification/context: update task in place
- add a Todoist comment only when material context changed
Matching strategy:
- normalized subject plus sender-derived `subject_key`
- full content fingerprint fallback based on sender + normalized subject + cleaned body
Statuses:
- `new`: no prior similar email seen
- `duplicate`: same dedupe target and same extracted result as before
- `updated`: matched prior email, but extracted result changed
This is intentionally heuristic, not perfect.
## Architecture
- `app/classifier.py`: classification orchestration and Todoist sync handoff
- `app/classifier.py`: classification orchestration and dedupe handoff
- `app/prompts.py`: richer extraction prompt
- `app/sync.py`: dedupe, task rendering, Todoist upsert logic
- `app/dedupe_store.py`: SQLite-backed mapping store
- `app/todoist.py`: Todoist REST client
- `app/sync.py`: subject normalization, fingerprinting, dedupe application
- `app/dedupe_store.py`: SQLite-backed dedupe store
- `app/llm_adapters.py`: provider adapters
- `app/config.py`: LLM settings
## Notes
- `/classify` remains backward compatible at the top level.
- New request metadata fields are optional.
- Todoist sync safely no-ops when `TODOIST_API_KEY` is not configured.
- SQLite is used for lightweight production-safe dedupe tracking.
- No Todoist integration lives in this API.
- Dedupe is best-effort and designed to help downstream workflows avoid obvious duplicates.
- SQLite is used for lightweight local dedupe tracking.