Problem
Teams lose hours searching wikis, PDFs, and past tickets. The goal: an assistant that answers questions from private company knowledge accurately, with citations — and says "I don't know" instead of hallucinating.
Architecture
Ingestion pipeline (parsing, semantic chunking, embedding into pgvector) feeding a FastAPI retrieval service with hybrid search (vector + keyword) and reranking. Claude generates grounded answers constrained to retrieved context, streaming into a Next.js chat UI with inline citations.
Challenges
- Chunking quality: naive fixed-size chunks destroyed answer quality; heading-aware semantic chunking fixed it.
- Groundedness: an answer-verification step checks claims against retrieved passages before responding.
Performance
High answer-acceptance rate in internal testing with strict citation coverage. (Draft metrics — replace with real numbers.)
Lessons Learned
Retrieval quality dominates model quality: most "LLM errors" were actually search errors.
Future Improvements
Multi-source connectors, conversational memory, and automated eval suite on a golden question set.