The Problem: ERP Data is Locked in Rows
ERPNext stores enormous amounts of business knowledge — every purchase order, sales pattern, vendor performance metric. But extracting insight requires writing reports or knowing exactly what to ask. Business users don't want to learn report filters; they want to ask "Which supplier had the most quality rejections last quarter?" in plain English.
Our Architecture
We built a RAG (Retrieval-Augmented Generation) pipeline with three components:
- Data Layer: A nightly sync job pulls key DocTypes (Sales Invoice, Purchase Order, Stock Ledger, Quality Inspection) from ERPNext's REST API and stores them in PostgreSQL with pgvector for semantic search.
- Retrieval Layer: LangChain orchestrates hybrid search — semantic similarity for context retrieval + metadata filters for date ranges and document types.
- Generation Layer: GPT-4o generates answers grounded in the retrieved context, with citations back to specific ERPNext document numbers.
Key Technical Challenges
ERPNext's data model uses many child tables. We flatten these into document-level text chunks before embedding. We also found that chunking by transaction (one Sales Invoice = one chunk including all items) gave better retrieval accuracy than arbitrary text splitting.
Results
The assistant answers 80% of business queries accurately without any report writing. The remaining 20% require clarification — which we handle with a follow-up prompt mechanism. Query latency is under 3 seconds for most questions.