RAG for RFP Responses: My Architecture Plan for BidScribe

Every RAG tutorial follows the same script: split documents, embed chunks, store in a vector database, retrieve top-k, send to LLM. Done.

It works in demos. I'm betting it falls apart in production. So before I build BidScribe — my planned AI-powered RFP response tool — I'm thinking carefully about the architecture.

Full disclosure: BidScribe is in the early idea/prototype stage. I haven't built this system yet. What follows is my architecture plan based on research, prototyping, and thinking through the problem. I'm sharing it because I think the reasoning is useful even before the implementation exists.

The Problem Space

RFP responses aren't blog posts. They're structured answers — often with headers, sub-questions, tables, and bullet points that form a logical unit. Naive chunking will split these apart, and retrieval will return fragments that lack context.

This is why I think a vanilla RAG tutorial approach won't cut it for this use case.

Chunking: Where I Think Most RAG Systems Will Fail

My Planned Approach

Semantic chunking over fixed-size. I want to chunk by logical units — a complete answer, a section, a coherent paragraph group. The chunks will vary in size from 200 to 2000 tokens. Uniform chunk size is a false goal.

Preserve metadata aggressively. Every chunk should carry:

Source document title
Section header hierarchy
Original question (if it was a Q&A pair)
Date and version
Tags/categories

This metadata isn't just for filtering — it gets injected into the prompt alongside the chunk content. An answer about "data security practices" means very different things depending on whether it came from a healthcare RFP or a financial services RFP.

Overlapping context windows. Instead of overlapping tokens between chunks, I plan to prepend parent context. Each chunk knows its place in the document hierarchy — like breadcrumbs: Document Title > Section > Subsection > This Chunk.

interface Chunk {
  id: string;
  content: string;
  embedding: number[];
  metadata: {
    sourceDocument: string;
    sectionPath: string[];   // ["Security", "Data Protection", "Encryption"]
    originalQuestion?: string;
    documentDate: string;
    tags: string[];
  };
}

Embeddings: Research Suggests the Model Matters Less Than You Think

From what I've read and my early experiments, embedding model choice (OpenAI's text-embedding-3-large vs text-embedding-3-small vs Cohere's embed v3) matters less than the preprocessing pipeline.

What I plan to do before embedding:

Strip formatting artifacts (HTML tags, markdown remnants, weird Unicode)
Normalize whitespace and structure
Prepend the section path as natural language: "In the context of Security > Data Protection > Encryption:"
Include the original question when available

That last one should be huge for RFPs. If someone asks "How do you handle data encryption at rest?" and the knowledge base has the answer filed under "Storage Security Protocols," the question-enriched embedding should bridge that vocabulary gap.

Retrieval: Hybrid Search Is the Plan

Pure vector similarity search probably isn't enough. Here's my reasoning.

The Keyword Problem

User asks: "What is your SOC 2 compliance status?"

Vector search will return chunks about "security certifications," "audit processes," and "compliance frameworks." All semantically related. But maybe none specifically mention SOC 2.

Meanwhile, there could be a chunk that says "We achieved SOC 2 Type II certification in March 2024" — but it's embedded in a broader section about company milestones, so it won't be the top vector match.

Planned Hybrid Approach

I'm planning to use Supabase with pgvector for embeddings and PostgreSQL's built-in full-text search in the same query:

-- Planned retrieval query
SELECT
  chunks.id,
  chunks.content,
  chunks.metadata,
  1 - (chunks.embedding <=> query_embedding) AS vector_score,
  ts_rank(chunks.fts, plainto_tsquery('english', query_text)) AS text_score
FROM chunks
WHERE chunks.workspace_id = $1
ORDER BY
  (0.7 * (1 - (chunks.embedding <=> query_embedding))) +
  (0.3 * ts_rank(chunks.fts, plainto_tsquery('english', query_text)))
  DESC
LIMIT 20;

The 0.7/0.3 weighting is a starting point — I'll tune it against real queries once I have test data.

Re-ranking

I plan to re-rank top-20 results using the LLM itself. Expensive, but I think it'll be worth it for catching relevance that embedding similarity misses.

Things I'm Thinking About

Staleness

Knowledge bases go stale. I'm planning:

Date-aware retrieval — boost recent chunks slightly
Version tracking — soft-delete old chunks when documents are re-uploaded
Confidence signals — flag answers based on old data

When RAG Should Say "I Don't Know"

This is critical for RFP responses. A confident wrong answer in a proposal is worse than no answer. My planned approach:

Retrieval threshold — flag anything below 0.6 similarity
Explicit instruction — system prompt says never to fabricate details
Source attribution — every answer links back to source chunks
Draft mode — everything is a draft that a human reviews

That last point is philosophical as much as technical. RAG systems should augment human judgment, not replace it. Especially when the output goes to a client.

Multi-tenancy

Each workspace needs its own isolated knowledge base. Supabase's Row Level Security handles this cleanly:

ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY "Users can only access their workspace chunks"
ON chunks FOR SELECT
USING (workspace_id IN (
  SELECT workspace_id FROM workspace_members
  WHERE user_id = auth.uid()
));

What I'd Tell My Future Self

Start with evaluation. Before building the full system, create a test set of query-answer pairs. Measure quality from day one.

Invest in the ingestion pipeline. Document parsing (especially PDFs and tables) will probably take more time than the actual RAG logic.

Don't over-engineer the vector store. Supabase with pgvector should handle tens of thousands of chunks without issues. Start simple. Migrate only when needed.

The Bottom Line

I haven't built this yet. But I've done enough research and prototyping to have conviction about the architecture. RAG in production is going to be 20% retrieval algorithm and 80% everything else — data quality, chunking strategy, metadata, evaluation, and knowing when the system should shut up instead of guessing.

I'll write about the actual implementation as I build it. For now, this is the plan. If you're thinking about RAG for structured documents, I hope this gives you a useful starting framework.