NLP · 2025
Local RAG Pipeline for Small Corpora
Quantized 2.5b Qwen to the rescue
01 / Problem
Most RAG tutorials use a hosted LLM — OpenAI, Anthropic, Groq — which means a network dependency, per-token cost, and your documents leaving your machine. I wanted a pipeline that runs entirely on local hardware with no API key required and no data sent anywhere.
02 / Approach
The pipeline uses Qwen2.5-1.5B-Instruct (~1.5 GB) for generation and MiniLM-L6-v2 for both sentence embeddings and cross-encoder reranking. Total model footprint is around 3 GB. Models are downloaded from HuggingFace and served locally — no quantization, no llama.cpp, just the native weights. A FastAPI backend handles document chunking, embedding, FAISS retrieval, and reranking; a Vite frontend sends queries and streams responses. The cross-encoder reranking step runs after initial retrieval to re-score candidates before passing context to the generator.
03 / Result
Handles documents up to ~35 pages without stability issues. On an i5 with no GPU, latency runs 2–3 minutes for longer queries — workable for experimentation. On a Ryzen 9 + RTX 3060, retrieval and generation are substantially faster thanks to GPU-accelerated embedding. The reranking step measurably improves answer relevance over naive top-k retrieval alone.
04 / Demo
Interactive demo — coming soon