RAGLLMpgvectorCohere

Building a Hybrid RAG Pipeline with BM25, pgvector & Cohere Reranking

March 2025

When I started building the retrieval layer for MAiQ — our multi-tenant AI SaaS — I quickly realized that naive vector search alone was not going to cut it. Dense retrieval misses exact keyword matches. Sparse retrieval misses semantic similarity. The answer was a hybrid pipeline that combines both, then uses a cross-encoder reranker to pick the best results.

Why Hybrid Retrieval?

Dense retrieval (embeddings + cosine similarity) is great for semantic understanding — "cheap flight" matches "budget airline". But it struggles with exact terms like product codes, acronyms, or proper nouns. BM25 (sparse retrieval) is the opposite — it excels at keyword matching but has no concept of meaning.

Hybrid search runs both in parallel and merges the results. The merge strategy matters: we used Reciprocal Rank Fusion (RRF), which is a simple, parameter-free formula that works well in practice.

python

def reciprocal_rank_fusion(results_list: list[list[str]], k: int = 60) -> list[str]:
    scores: dict[str, float] = {}
    for results in results_list:
        for rank, doc_id in enumerate(results):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

Dense Retrieval with pgvector

We used PostgreSQL + pgvector instead of a dedicated vector DB like Milvus or Pinecone. The reason: we already had multi-schema PostgreSQL for multi-tenancy, and pgvector let us keep everything in one place — document metadata, chunk text, and embeddings — with full ACID guarantees.

sql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
    chunk_text  TEXT NOT NULL,
    embedding   VECTOR(1536),
    metadata    JSONB
);

CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

We generated embeddings using Azure OpenAI's text-embedding-ada-002 model. Each chunk was 512 tokens with a 64-token overlap to preserve context across boundaries.

Sparse Retrieval with BM25

We used the rank_bm25 Python library for in-process BM25 scoring. The BM25 index was built per-tenant on document ingestion and cached in Redis. On query time, we retrieved the top-50 BM25 candidates and top-50 vector candidates, then merged them.

python

from rank_bm25 import BM25Okapi

class BM25Retriever:
    def __init__(self, chunks: list[str]):
        tokenized = [doc.lower().split() for doc in chunks]
        self.bm25 = BM25Okapi(tokenized)
        self.chunks = chunks

    def retrieve(self, query: str, top_k: int = 50) -> list[int]:
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        return sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

PDF Extraction with LlamaParse

Raw PDFs are notoriously hard to chunk well — tables, headers, and multi-column layouts destroy naive text extraction. We used LlamaParse to convert PDFs to clean Markdown before chunking. It handles tables as Markdown tables, preserves heading hierarchy, and significantly improves chunk quality for downstream retrieval.

python

from llama_parse import LlamaParse

parser = LlamaParse(result_type="markdown")

async def extract_pdf(file_path: str) -> str:
    documents = await parser.aload_data(file_path)
    return "\n\n".join(doc.text for doc in documents)

Reranking with Cohere

After merging BM25 + vector results (top ~30 candidates), we ran Cohere's rerank-english-v3.0 model. A cross-encoder reads the query and each candidate chunk together — this is much more accurate than embedding similarity but too slow to run over the whole corpus. The two-stage approach (cheap retrieval first, expensive reranking second) gives you the best of both worlds.

python

import cohere

co = cohere.Client(api_key=settings.COHERE_API_KEY)

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [candidates[r.index] for r in response.results]

Results

Compared to pure vector search, the hybrid pipeline with reranking improved retrieval precision on our internal eval set by ~34%. The biggest gains came on queries with specific technical terms and on multi-document corpora where a single embedding space was not expressive enough. LlamaParse also reduced "garbage chunk" rate from PDF ingestion by roughly 60% compared to PyPDF2.