March 2025
When I started building the retrieval layer for MAiQ — our multi-tenant AI SaaS — I quickly realized that naive vector search alone was not going to cut it. Dense retrieval misses exact keyword matches. Sparse retrieval misses semantic similarity. The answer was a hybrid pipeline that combines both, then uses a cross-encoder reranker to pick the best results.
Dense retrieval (embeddings + cosine similarity) is great for semantic understanding — "cheap flight" matches "budget airline". But it struggles with exact terms like product codes, acronyms, or proper nouns. BM25 (sparse retrieval) is the opposite — it excels at keyword matching but has no concept of meaning.
Hybrid search runs both in parallel and merges the results. The merge strategy matters: we used Reciprocal Rank Fusion (RRF), which is a simple, parameter-free formula that works well in practice.
def reciprocal_rank_fusion(results_list: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for results in results_list:
for rank, doc_id in enumerate(results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)We used PostgreSQL + pgvector instead of a dedicated vector DB like Milvus or Pinecone. The reason: we already had multi-schema PostgreSQL for multi-tenancy, and pgvector let us keep everything in one place — document metadata, chunk text, and embeddings — with full ACID guarantees.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
chunk_text TEXT NOT NULL,
embedding VECTOR(1536),
metadata JSONB
);
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);We generated embeddings using Azure OpenAI's text-embedding-ada-002 model. Each chunk was 512 tokens with a 64-token overlap to preserve context across boundaries.
We used the rank_bm25 Python library for in-process BM25 scoring. The BM25 index was built per-tenant on document ingestion and cached in Redis. On query time, we retrieved the top-50 BM25 candidates and top-50 vector candidates, then merged them.
from rank_bm25 import BM25Okapi
class BM25Retriever:
def __init__(self, chunks: list[str]):
tokenized = [doc.lower().split() for doc in chunks]
self.bm25 = BM25Okapi(tokenized)
self.chunks = chunks
def retrieve(self, query: str, top_k: int = 50) -> list[int]:
tokenized_query = query.lower().split()
scores = self.bm25.get_scores(tokenized_query)
return sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]Raw PDFs are notoriously hard to chunk well — tables, headers, and multi-column layouts destroy naive text extraction. We used LlamaParse to convert PDFs to clean Markdown before chunking. It handles tables as Markdown tables, preserves heading hierarchy, and significantly improves chunk quality for downstream retrieval.
from llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")
async def extract_pdf(file_path: str) -> str:
documents = await parser.aload_data(file_path)
return "\n\n".join(doc.text for doc in documents)After merging BM25 + vector results (top ~30 candidates), we ran Cohere's rerank-english-v3.0 model. A cross-encoder reads the query and each candidate chunk together — this is much more accurate than embedding similarity but too slow to run over the whole corpus. The two-stage approach (cheap retrieval first, expensive reranking second) gives you the best of both worlds.
import cohere
co = cohere.Client(api_key=settings.COHERE_API_KEY)
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=candidates,
top_n=top_n,
)
return [candidates[r.index] for r in response.results]Compared to pure vector search, the hybrid pipeline with reranking improved retrieval precision on our internal eval set by ~34%. The biggest gains came on queries with specific technical terms and on multi-document corpora where a single embedding space was not expressive enough. LlamaParse also reduced "garbage chunk" rate from PDF ingestion by roughly 60% compared to PyPDF2.