Open-source · MIT · Python

An eval-first, debuggable RAG engine you can read like a tutorial.

Strata-RAG indexes a heterogeneous, multi-format document corpus and answers questions over it — with hybrid retrieval, cross-encoder reranking, a metadata sidecar for exact aggregation, a real eval harness, layered prompt-injection & PII defenses, and an adaptive red-team. Every design choice is commented with the why.

View on GitHub Read the Wiki Quickstart

Qdrant + HNSWdense + BM25 + RRF cross-encoder rerankRecall@K · nDCG · MRR LLM-as-judge faithfulnessprompt-injection guardrails adaptive red-teamFastAPI · Streamlit

The core design insight

Real questions over a project corpus split into two classes — and a pure embedding-RAG silently fails the second. Strata-RAG keeps both, and routes each question to the index that can actually answer it.

1 · Semantic — “what is this about?”

Meaning/theme questions go to a vector index (Qdrant + HNSW) with hybrid dense + BM25 retrieval fused by Reciprocal Rank Fusion, then a cross-encoder re-rank. Retrieval is the ceiling on answer quality, so this path is measured, not assumed.

2 · Aggregation — “how many / which set?”

Counting, grouping and exact lookups go to a structured metadata sidecar (SQLite) via a templated, read-only query layer. A vector top-k cannot count or intersect — so the engine doesn’t pretend it can.

What’s inside

The pieces a real document-intelligence system needs — each built to be studied, and each measured or guarded rather than trusted.

🔎

Hybrid retrieval + rerank

Dense (sentence-transformers) + sparse (BM25) fused with RRF, then a cross-encoder re-ranker. Local embeddings — runs with no API key.

🧮

Exact-aggregation sidecar

A structured SQLite sidecar + a query router so “count / group / lookup” questions get exact answers instead of a hallucinated approximation.

📏

Eval harness (not vibes)

Recall@K / Precision@K / MRR / nDCG for retrieval and LLM-as-judge faithfulness for generation — kept strictly separate, each with its blind spot named.

🛡️

Injection & PII defenses

Spotlighting + sentinel-fencing of untrusted context, an injection scanner, and secret/PII redaction — the LLM is treated as an untrusted component.

🎯

Adaptive red-team

Obfuscation encoders (zero-width, homoglyph, morse…), multi-turn & indirect attacks, and a structural success-oracle that won’t over-report compromise.

🔭

Ingestion observability

A dry-run manifest (include/exclude-with-reason, coverage, blind-spots) and a chunk inspector — so you can see what the index will and won’t answer.

🧩

Open-core plugin seam

Pluggable source adapters register at import time; bring your own corpus without forking the engine. Generic core, private overlays.

🤖

Agentic chatbot

A query router + a ReAct agent with semantic-search and metadata tools, multi-turn /chat, and a thin Streamlit UI that surfaces the engine’s signals.

🔐

Layered leak defense

A deterministic grep gate (pre-commit + CI) plus a semantic auditor — direct and structural leaks — for safe private→public open-core publishing.

Quickstart

Runs out of the box on a tiny synthetic, fully-fictional sample corpus — no API key, local embeddings. Only enrichment, generation and the judge need an LLM backend.

# clone + install
git clone https://github.com/NikolaiSachok/Strata-RAG && cd Strata-RAG
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# bring up the vector store, then plan the ingest (no embedding yet)
docker compose up -d qdrant
python -m rageval.ingest --dry-run        # coverage manifest: what will/won't be indexed

# build the index + serve the API
python -m rageval.ingest
uvicorn rageval.api:app --reload          # → http://localhost:8000/docs

Deep-dive the design in the project wiki — Architecture, Design-Decisions, Evaluation, Red-Teaming — each written to teach the why, the failure mode each stage prevents, and when you’d choose differently.

The pipeline

Ingest builds two indexes in parallel; retrieval is hybrid, fused, reranked; generation is grounded and guarded; everything is measured.

adapters → classify → redact → chunk → embed → Qdrant/HNSW → hybrid dense + BM25 → RRF → cross-encoder rerank → grounded generate (+ injection guardrails) → eval (Recall@K / nDCG / MRR + LLM-judge) ‖ structured metadata sidecar → query router → exact aggregation