Use Outlook ids for classifier dedupe precedence
This commit is contained in:
97
README.md
97
README.md
@@ -1,6 +1,6 @@
|
||||
# email-classifier
|
||||
|
||||
FastAPI service that classifies email using a configurable LLM backend, returns richer structured extraction, and tracks duplicate classifications using fingerprint-based dedupe.
|
||||
FastAPI service that classifies email using a configurable LLM backend, returns richer structured extraction, and tracks duplicate classifications using Outlook-aware dedupe.
|
||||
|
||||
## Environment configuration
|
||||
|
||||
@@ -31,88 +31,101 @@ Optional local dedupe store path:
|
||||
export EMAIL_CLASSIFIER_DB_PATH=.data/email_classifier.db
|
||||
```
|
||||
|
||||
## API
|
||||
## Input shape
|
||||
|
||||
### POST /classify
|
||||
|
||||
This overhaul is intended to return richer extraction. Top-level compatibility is not required.
|
||||
|
||||
Request example:
|
||||
Designed around real Outlook message payloads. Relevant fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "AAMk...",
|
||||
"internetMessageId": "<...@...>",
|
||||
"conversationId": "AAQk...",
|
||||
"subject": "MB Printer",
|
||||
"bodyPreview": "Good morning, ...",
|
||||
"receivedDateTime": "2026-02-19T15:27:35Z",
|
||||
"sentDateTime": "2026-02-19T15:27:32Z",
|
||||
"hasAttachments": false,
|
||||
"importance": "normal",
|
||||
"isRead": false,
|
||||
"body": {
|
||||
"contentType": "html",
|
||||
"content": "..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
API request example:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "AAMk...",
|
||||
"internetMessageId": "<...@...>",
|
||||
"conversationId": "AAQk...",
|
||||
"bodyPreview": "Good morning, ...",
|
||||
"receivedDateTime": "2026-02-19T15:27:35Z",
|
||||
"sentDateTime": "2026-02-19T15:27:32Z",
|
||||
"hasAttachments": false,
|
||||
"importance": "normal",
|
||||
"isRead": false,
|
||||
"email_data": {
|
||||
"subject": "Can you review this by Friday?",
|
||||
"body": "Hi Daniel, please review the attached budget proposal."
|
||||
"subject": "MB Printer",
|
||||
"body": "<html>...</html>"
|
||||
},
|
||||
"from_address": "sender@example.com",
|
||||
"received_at": "2026-04-09T12:55:00Z",
|
||||
"provider": "anthropic",
|
||||
"base_url": "https://api.minimax.io/anthropic",
|
||||
"model": "MiniMax-M2.7"
|
||||
}
|
||||
```
|
||||
|
||||
Response example:
|
||||
## Response example
|
||||
|
||||
```json
|
||||
{
|
||||
"needs_action": true,
|
||||
"category": "question",
|
||||
"priority": "high",
|
||||
"task_description": "Review the budget proposal and respond by Friday",
|
||||
"reasoning": "Direct request with a deadline requires follow-up",
|
||||
"task_description": "Investigate MB Printer issue and reply",
|
||||
"reasoning": "The email appears to describe an issue requiring action.",
|
||||
"confidence": 0.91,
|
||||
"details": {
|
||||
"summary": "Budget proposal review requested with Friday deadline.",
|
||||
"suggested_title": "Review budget proposal and respond by Friday",
|
||||
"suggested_notes": "Requester asked for feedback on attached budget proposal before Friday.",
|
||||
"deadline": "Friday",
|
||||
"people": ["Daniel"],
|
||||
"summary": "Printer issue reported in the MB area.",
|
||||
"suggested_title": "Handle MB Printer issue",
|
||||
"suggested_notes": "Review the printer problem, identify urgency, and reply with next steps.",
|
||||
"deadline": null,
|
||||
"people": [],
|
||||
"organizations": [],
|
||||
"attachments_referenced": ["budget proposal"],
|
||||
"next_steps": ["Review attachment", "Reply with feedback"],
|
||||
"key_points": ["Deadline is Friday"],
|
||||
"source_signals": ["request", "deadline"],
|
||||
"attachments_referenced": [],
|
||||
"next_steps": ["Review issue", "Respond to sender"],
|
||||
"key_points": ["Printer issue reported"],
|
||||
"source_signals": ["request"],
|
||||
"dedupe_key": "..."
|
||||
},
|
||||
"dedupe": {
|
||||
"status": "new",
|
||||
"seen_count": 1,
|
||||
"matched_on": "none",
|
||||
"subject_key": "...",
|
||||
"message_id": "AAMk...",
|
||||
"conversation_id": "AAQk...",
|
||||
"fingerprint": "..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Dedupe behavior
|
||||
## Dedupe precedence
|
||||
|
||||
The API does not create or update Todoist tasks.
|
||||
It only returns richer extraction and local dedupe metadata for downstream automation like n8n.
|
||||
|
||||
Matching strategy:
|
||||
- normalized subject plus sender-derived `subject_key`
|
||||
- full content fingerprint fallback based on sender + normalized subject + cleaned body
|
||||
1. `id` for exact Outlook message match
|
||||
2. `conversationId` for thread grouping
|
||||
3. normalized subject + preview/body fingerprint fallback
|
||||
|
||||
Statuses:
|
||||
- `new`: no prior similar email seen
|
||||
- `duplicate`: same dedupe target and same extracted result as before
|
||||
- `updated`: matched prior email, but extracted result changed
|
||||
|
||||
This is intentionally heuristic, not perfect.
|
||||
|
||||
## Architecture
|
||||
|
||||
- `app/classifier.py`: classification orchestration and dedupe handoff
|
||||
- `app/prompts.py`: richer extraction prompt
|
||||
- `app/sync.py`: subject normalization, fingerprinting, dedupe application
|
||||
- `app/dedupe_store.py`: SQLite-backed dedupe store
|
||||
- `app/llm_adapters.py`: provider adapters
|
||||
- `app/config.py`: LLM settings
|
||||
This is intentionally heuristic for the fallback path.
|
||||
|
||||
## Notes
|
||||
|
||||
- No Todoist integration lives in this API.
|
||||
- Dedupe is best-effort and designed to help downstream workflows avoid obvious duplicates.
|
||||
- Dedupe is local and intended to help downstream workflows avoid obvious duplicates.
|
||||
- SQLite is used for lightweight local dedupe tracking.
|
||||
|
||||
Reference in New Issue
Block a user