[T15] T15 — KB Engine Architecture Gotchas

15 tháng 3, 2026 bởi

Viet Nguyen

⚠️ HIGH | Category: Odoo Tips | ID: T15 | Owner: CEO

T15 — KB Engine Architecture Gotchas

Tóm tắt

9 gotcha từ 6 phase xây dựng KB Engine (FTS5 + ChromaDB + Neo4j). Mỗi cái đã gây bug thật trong quá trình build. Đọc trước khi sửa hoặc mở rộng KB Engine.

---

1. doc_id collision — cùng basename, khác thư mục

Vấn đề: Nhiều file tên 2026-03-07.md nằm ở các dir khác nhau (sessions, intelligence, meetings). Dùng basename làm doc_id → ghi đè nhau.

Fix: Dùng format grandparent-parent:basename:

BAD — collision
doc_id = Path(file).stem  # "2026-03-07"
GOOD — unique
parts = Path(file).parts
doc_id = f"{parts[-3]}-{parts[-2]}:{parts[-1]}"
→ "sessions-company_hq:2026-03-07.md"

Nếu file có frontmatter id: (e.g., T01) thì dùng frontmatter id, bỏ qua path-based id.

---

2. FTS5 Vietnamese — giữ dấu tiếng Việt

Vấn đề: FTS5 tokenizer mặc định remove_diacritics 1 strip dấu → "thuế" = "thue", "phạt" = "phat". Search tiếng Việt mất chính xác.

Fix: Tạo FTS5 table với remove_diacritics 0:

CREATE VIRTUAL TABLE kb_fts USING fts5(
    doc_id, title, content,
    tokenize='unicode61 remove_diacritics 0'
);

BM25 score từ FTS5 là negative — phải abs() trước khi dùng boost.

---

3. ChromaDB embedding — PHẢI dùng pre-computed embeddings

Vấn đề: ChromaDB mặc định dùng all-MiniLM-L6-v2 (384 dims, English-only) nếu bạn pass documents=. Không phù hợp cho tiếng Việt.

Fix: Dùng intfloat/multilingual-e5-base (768 dims), tự compute embeddings, pass vào ChromaDB:

BAD — ChromaDB dùng MiniLM mặc định
collection.add(ids=[doc_id], documents=[text])
GOOD — pre-computed e5-base embeddings
embedding = model.encode("passage: " + text)
collection.add(ids=[doc_id], embeddings=[embedding.tolist()])

E5 prefix requirement — bỏ prefix giảm quality đáng kể:

Documents: "passage: " + text

Queries: "query: " + text

First-time embedding ~94 docs = ~59s (model load + encode). Incremental <200ms/doc.

---

4. Score normalization — cosine distance → similarity

Vấn đề: ChromaDB trả distance (cosine), không phải similarity. FTS5 trả BM25 score (thang khác). Không so sánh trực tiếp được.

Fix: Chuẩn hóa về thang 0-10:

ChromaDB cosine distance → similarity → scale 0-10
similarity = 1 - distance / 2    # cosine distance [0,2] → similarity [0,1]
semantic_score = similarity * 10  # scale lên 0-10 để comparable với BM25
FTS5 BM25 → abs() rồi dùng trực tiếp (đã ở thang phù hợp)
bm25_score = abs(raw_bm25_score)

---

5. Neo4j graph normalization

Vấn đề: Graph relevance scores không cùng thang với FTS5/semantic.

Fix: Normalize theo max:

graph_score = relevance / max_relevance * 10

Graph weight = 0.20 trong 3-way hybrid (FTS5: 0.30, semantic: 0.50, graph: 0.20).

---

6. 3-signal corroboration — bonus khi nhiều nguồn đồng thuận

Docs tìm thấy bởi cả 3 methods (keyword + semantic + graph) đáng tin hơn. Áp dụng bonus:

| Signals | Bonus | Match type | |---------|-------|------------| | 1 | ×1.0 | keyword / semantic / graph | | 2 | ×1.3 | hybrid / keyword+graph / semantic+graph | | 3 | ×1.4 | hybrid+graph |

Config: multi_signal_bonus: 0.15 per additional signal.

---

7. External docs boost 0.8x vs KB internal 2.5x

Thiết kế có chủ ý: Internal KB entries (curated, verified) luôn ưu tiên hơn external docs (scraped, noisy).

config.yaml boost.type
kb: 3.0        # KB entries — highest value
meeting: 1.5   # Meeting notes — decisions
session: 1.2   # Sessions — verbose but contextual
intel: 1.0     # Intelligence — baseline
external: 0.6  # External — noisy, lowest

Nếu external doc quan trọng → convert thành KB entry thay vì tăng boost.

---

8. trafilatura > BeautifulSoup cho web scraping

Vấn đề: BeautifulSoup lấy toàn bộ HTML → phải tự strip boilerplate, nav, ads, footer. Tốn code và thiếu chính xác.

Fix: Dùng trafilatura — tự detect main content, loại boilerplate:

BAD — manual cleanup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()  # Bao gồm menu, footer, ads...
GOOD — auto content extraction
import trafilatura
text = trafilatura.extract(html)  # Chỉ main content

Kết hợp với normalizer.py để auto-tag + generate frontmatter cho ingested docs.

---

9. Watchdog debounce 5s cho batch reindex

Vấn đề: File watcher trigger reindex mỗi khi 1 file thay đổi. Git pull/commit gây hàng chục events liên tiếp → reindex chồng chất.

Fix: Debounce 5 giây — gộp tất cả changes trong window thành 1 batch reindex:

watcher.py pattern
DEBOUNCE_SECONDS = 5
pending_changes = set()
last_change_time = None
def on_modified(event):
    pending_changes.add(event.src_path)
    last_change_time = time.time()
def check_and_reindex():
    if pending_changes and (time.time() - last_change_time) >= DEBOUNCE_SECONDS:
        reindex_files(pending_changes)  # SQLite + ChromaDB + Neo4j
        pending_changes.clear()

Cron 0 /6 chạy full reindex mỗi 6 giờ như fallback.

---

Liên quan

T01 — search_read limit trap (cùng pattern: default param gây bug ngầm)

T10 — KB Intelligence Roadmap (phase planning cho KB Engine)

Config: /root/company_hq/.kb-engine/config.yaml

Codebase: /root/company_hq/.kb-engine/ (search.py, vectorstore.py, graph.py, watcher.py, ingest.py)

📚 Published from Company Knowledge Base — T15
Last updated: 2026-03-14
Review by: 2026-06-12

trong Knowledge Base

# Odoo Tips

[T14] T14 — POS orders từ ngày cutover (01/03) không tạo được invoice