The Problem: ERP Data is Locked in Rows

ERPNext stores enormous amounts of business knowledge — every purchase order, sales pattern, vendor performance metric. But extracting insight requires writing reports or knowing exactly what to ask. Business users don't want to learn report filters; they want to ask "Which supplier had the most quality rejections last quarter?" in plain English.

Our Architecture

We built a RAG (Retrieval-Augmented Generation) pipeline with three components:

  1. Data Layer: A nightly sync job pulls key DocTypes (Sales Invoice, Purchase Order, Stock Ledger, Quality Inspection) from ERPNext's REST API and stores them in PostgreSQL with pgvector for semantic search.
  2. Retrieval Layer: LangChain orchestrates hybrid search — semantic similarity for context retrieval + metadata filters for date ranges and document types.
  3. Generation Layer: GPT-4o generates answers grounded in the retrieved context, with citations back to specific ERPNext document numbers.

Key Technical Challenges

ERPNext's data model uses many child tables. We flatten these into document-level text chunks before embedding. We also found that chunking by transaction (one Sales Invoice = one chunk including all items) gave better retrieval accuracy than arbitrary text splitting.

Results

The assistant answers 80% of business queries accurately without any report writing. The remaining 20% require clarification — which we handle with a follow-up prompt mechanism. Query latency is under 3 seconds for most questions.