NLP · 2025

Local RAG Pipeline for Small Corpora

Quantized 2.5b Qwen to the rescue

RAGPythonFastAPIQwen2.5FAISSMiniLM

01 / Problem

Most RAG tutorials use a hosted LLM — OpenAI, Anthropic, Groq — which means a network dependency, per-token cost, and your documents leaving your machine. I wanted a pipeline that runs entirely on local hardware with no API key required and no data sent anywhere.

02 / Approach

The pipeline uses Qwen2.5-1.5B-Instruct (~1.5 GB) for generation and MiniLM-L6-v2 for both sentence embeddings and cross-encoder reranking. Total model footprint is around 3 GB. Models are downloaded from HuggingFace and served locally — no quantization, no llama.cpp, just the native weights. A FastAPI backend handles document chunking, embedding, FAISS retrieval, and reranking; a Vite frontend sends queries and streams responses. The cross-encoder reranking step runs after initial retrieval to re-score candidates before passing context to the generator.

03 / Result

Handles documents up to ~35 pages without stability issues. On an i5 with no GPU, latency runs 2–3 minutes for longer queries — workable for experimentation. On a Ryzen 9 + RTX 3060, retrieval and generation are substantially faster thanks to GPU-accelerated embedding. The reranking step measurably improves answer relevance over naive top-k retrieval alone.

04 / Demo

Interactive demo — coming soon

View on GitHub