An eval-first, debuggable RAG engine you can read like a tutorial.
Strata-RAG indexes a heterogeneous, multi-format document corpus and answers questions over it — with hybrid retrieval, cross-encoder reranking, a metadata sidecar for exact aggregation, a real eval harness, layered prompt-injection & PII defenses, and an adaptive red-team. Every design choice is commented with the why.
The core design insight
Real questions over a project corpus split into two classes — and a pure embedding-RAG silently fails the second. Strata-RAG keeps both, and routes each question to the index that can actually answer it.
1 · Semantic — “what is this about?”
Meaning/theme questions go to a vector index (Qdrant + HNSW) with hybrid dense + BM25 retrieval fused by Reciprocal Rank Fusion, then a cross-encoder re-rank. Retrieval is the ceiling on answer quality, so this path is measured, not assumed.
2 · Aggregation — “how many / which set?”
Counting, grouping and exact lookups go to a structured metadata sidecar (SQLite) via a templated, read-only query layer. A vector top-k cannot count or intersect — so the engine doesn’t pretend it can.
What’s inside
The pieces a real document-intelligence system needs — each built to be studied, and each measured or guarded rather than trusted.
Hybrid retrieval + rerank
Dense (sentence-transformers) + sparse (BM25) fused with RRF, then a cross-encoder re-ranker. Local embeddings — runs with no API key.
Exact-aggregation sidecar
A structured SQLite sidecar + a query router so “count / group / lookup” questions get exact answers instead of a hallucinated approximation.
Eval harness (not vibes)
Recall@K / Precision@K / MRR / nDCG for retrieval and LLM-as-judge faithfulness for generation — kept strictly separate, each with its blind spot named.
Injection & PII defenses
Spotlighting + sentinel-fencing of untrusted context, an injection scanner, and secret/PII redaction — the LLM is treated as an untrusted component.
Adaptive red-team
Obfuscation encoders (zero-width, homoglyph, morse…), multi-turn & indirect attacks, and a structural success-oracle that won’t over-report compromise.
Ingestion observability
A dry-run manifest (include/exclude-with-reason, coverage, blind-spots) and a chunk inspector — so you can see what the index will and won’t answer.
Open-core plugin seam
Pluggable source adapters register at import time; bring your own corpus without forking the engine. Generic core, private overlays.
Agentic chatbot
A query router + a ReAct agent with semantic-search and metadata tools, multi-turn
/chat, and a thin Streamlit UI that surfaces the engine’s signals.
Layered leak defense
A deterministic grep gate (pre-commit + CI) plus a semantic auditor — direct and structural leaks — for safe private→public open-core publishing.
Quickstart
Runs out of the box on a tiny synthetic, fully-fictional sample corpus — no API key, local embeddings. Only enrichment, generation and the judge need an LLM backend.
# clone + install git clone https://github.com/NikolaiSachok/Strata-RAG && cd Strata-RAG python3 -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" # bring up the vector store, then plan the ingest (no embedding yet) docker compose up -d qdrant python -m rageval.ingest --dry-run # coverage manifest: what will/won't be indexed # build the index + serve the API python -m rageval.ingest uvicorn rageval.api:app --reload # → http://localhost:8000/docs
Deep-dive the design in the project wiki — Architecture, Design-Decisions, Evaluation, Red-Teaming — each written to teach the why, the failure mode each stage prevents, and when you’d choose differently.
The pipeline
Ingest builds two indexes in parallel; retrieval is hybrid, fused, reranked; generation is grounded and guarded; everything is measured.