Choosing a model and vector store for RAG without overengineering it
RAG (Retrieval-Augmented Generation) is sold as the straightforward way to make an LLM “know your data”. In practice, most RAG pain comes from two decisions made too early and with too little context:
- Which model(s) to use for embeddings and generation.
- Which vector store to use to retrieve the right chunks quickly and reliably.
You can spend months chasing “best practice” architectures-hybrid search, reranking, multi-stage retrieval, complex chunking pipelines-only to discover the real blocker was: the data is messy, the questions are ambiguous, and nobody agreed what “good results” means.
This article is about making those two choices with a risk-and-cost lens: scale, quality, latency, operational load, and money. The aim is not the theoretically perfect system, but one that is good enough for the outcomes you actually need.
Start with reality: RAG is an information retrieval system with an LLM bolted on
Most RAG failures are retrieval failures, not generation failures.
If the system retrieves irrelevant chunks, the model will produce confident nonsense. If it retrieves too much, the prompt gets bloated, costs rise, and accuracy can fall. If it retrieves too little, the model guesses.
So when selecting a model and a vector store, treat it like building search:
- What is the corpus size and growth rate?
- How often does it change?
- How fresh must the answers be?
- What does “correct” mean and how will you measure it?
- What are your latency and cost budgets?
If you cannot answer those questions, optimise for simplicity and changeability. Most teams choose the wrong stack by assuming future scale and future sophistication.
Define “good results” before you buy anything
“Quality” in RAG needs a definition you can test. Agree on a small evaluation set:
- 30-100 representative questions (real user questions if possible).
- For each, an expected source document (or at least what “good grounding” looks like).
- A scoring approach:
- Answer correctness (did it answer the question?)
- Grounding (is it supported by retrieved sources?)
- Retrieval quality (were the top-k chunks actually relevant?)
- Refusal behaviour (does it say “I don’t know” when it should?)
Without this, model selection becomes opinion. With it, you can run A/B comparisons and make trade-offs explicitly.
Model choices: separate embeddings from generation
A pragmatic approach is to treat RAG as two different modelling problems:
- Embedding model: “Can I represent documents and queries so similar things end up near each other?”
- Generation model: “Given the retrieved context, can I answer accurately and consistently?”
You can change these independently, and you should.
Embedding model: what matters in practice
Embedding model choice is mostly about retrieval quality per cost and latency. Key factors:
- Domain fit: general embeddings can be fine for general corpora; highly specialised domains (legal, medical, internal jargon) may benefit from a stronger embedding model or domain-tuned embeddings.
- Language support: if you’re multilingual, pick embeddings that handle that well or separate indexes by language.
- Vector dimension and storage: larger vectors can mean better representation, but cost more to store and can be slower to search. This matters at scale.
- Throughput and price: ingestion pipelines can become expensive if you re-embed frequently.
A sensible baseline:
- Start with a strong general embedding model, keep chunking conservative, and measure retrieval quality.
- Only move to more complex approaches (hybrid search, reranking, fine-tuning embeddings) when you can show measurable improvement on your evaluation set.
Generation model: what matters in practice
For the generation model, focus less on “smartest model available” and more on behaviour:
- Instruction following: does it reliably use the provided context and cite it?
- Hallucination control: does it avoid inventing facts when context is missing?
- Output consistency: does it follow your response format?
- Latency: especially if this is interactive.
- Context window: if you routinely need lots of retrieved text, context length matters.
- Cost: most RAG cost at scale is tokens.
A common pattern that works well:
- Use a smaller, cheaper model for most queries.
- Escalate to a larger model for “hard” queries (low retrieval confidence, high business impact, or user explicitly requests deeper analysis).
This keeps cost under control without committing everything to the most expensive model.
Vector store choices: treat it as a product decision, not a hobby
The vector store is where you pay operational complexity. Your options typically fall into three buckets:
- Dedicated vector databases (purpose-built)
- Search engines with vector support (vector + keyword / hybrid)
- Relational databases with vector support (good enough for small/medium scale)
The right choice depends on scale, uptime requirements, and your team’s willingness to run infrastructure.
What “scale” means in RAG
Scale is not just “how many documents”.
It is:
- Number of vectors (chunks, not documents)
- Update frequency (append-only vs constant churn)
- Query rate (requests per second)
- Latency SLO (what’s acceptable end-to-end)
- Multi-tenancy (do you need strict isolation per customer?)
Rule of thumb: if your chunking produces 200 chunks per document, your “10,000 documents” is actually “2 million vectors”. That is where storage, indexing, and query performance start to matter.
Dedicated vector DBs
Strengths:
- Designed for approximate nearest neighbour search (ANN) and tuning.
- Often easier scaling and performance for large vector counts.
- Common features: filtering, metadata, hybrid search, replication.
Costs:
- Another system to run, secure, monitor, and backup.
- Vendor lock-in risk.
- Pricing can surprise you (storage, indexing, read/write ops).
When it’s “good enough”:
- You have large vector counts and consistent query volume.
- Latency matters.
- You expect to grow and you want purpose-built capabilities.
Search engines with vector support
Strengths:
- One platform for keyword + vector (hybrid search).
- Good operational tooling in many orgs already.
- Hybrid search often improves quality on messy corpora (IDs, codes, product names, exact phrases).
Costs:
- More tuning to get ANN right.
- Can be heavier operationally than simpler stores.
- You may pay for scale you don’t need if you run a full search cluster for a small RAG feature.
When it’s “good enough”:
- You already operate a search engine.
- Your data benefits from keyword matching as well as semantic similarity.
- You want a straightforward path to hybrid retrieval.
Relational DB with vectors
Strengths:
- Simplest operational footprint if you already run a relational database.
- Metadata filtering and tenant isolation are straightforward.
- Good choice for early-stage products and moderate scale.
Costs:
- Vector search performance will plateau earlier than specialised systems.
- Fewer retrieval features out of the box.
- You may end up migrating if vectors and QPS grow significantly.
When it’s “good enough”:
- You are at small/medium scale (or uncertain).
- You want to ship quickly and keep operations simple.
- Your query volume is modest and latency tolerance is reasonable.
Quality levers that matter more than the vendor choice
Teams often blame the model or vector store when the real issues are upstream. The biggest practical quality levers are:
Chunking strategy
- Too big: retrieval finds the right document but includes lots of irrelevant text.
- Too small: you lose context, answers become thin or wrong.
- Overlap: helps continuity, but increases vector count and cost.
A pragmatic default: chunk by headings/sections where possible, not purely by token count. Structure beats maths.
Metadata and filtering
If you don’t filter by tenant, product, document type, or time range, retrieval becomes noisy fast. Good metadata is often a bigger win than a “better” embedding model.
Hybrid retrieval and reranking (only when justified)
Hybrid search (keyword + vector) and reranking can dramatically improve results-when you have evidence that semantic search alone is missing key hits.
But they add cost and latency:
- Hybrid can increase query complexity.
- Reranking adds another model call or compute step.
Use them when your evaluation set shows meaningful uplift, not because it’s trendy.
Confidence and refusal behaviour
A production RAG system should be allowed to say “I don’t know”. This usually requires:
- Retrieval confidence heuristics (e.g., similarity thresholds)
- Guardrails in prompting (no answer without sources)
- UX that makes “no answer” acceptable
This is often the difference between “usable” and “dangerous”.
Cost model: where the money goes
RAG costs tend to cluster in three places:
Embedding ingestion
- One-off cost for initial embedding.
- Ongoing cost for updates and re-embedding when chunking changes.
Vector storage and search
- Storage cost scales with number of vectors and dimension.
- Query cost scales with QPS and index type.
- Multi-tenancy can multiply indexes if you isolate per customer.
Generation tokens
- The silent killer at scale.
- Longer retrieved context increases input tokens.
- Verbose outputs increase output tokens.
Practical cost controls:
- Cap top-k and context size; prioritise precision.
- Summarise or compress long documents at ingest (carefully).
- Cache retrieval results for repeated questions.
- Route most queries to a cheaper model, escalate selectively.
A simple selection approach that avoids regret
If you want a decision process that works in most organisations:
- Define the evaluation set and success criteria.
- Pick a baseline embedding model + baseline generation model.
- Start with the simplest vector store that meets your current scale and tenancy needs.
- Measure retrieval quality and end-to-end correctness.
- Only add complexity in this order:
- Better chunking + metadata
- Hybrid retrieval (if exact matching matters)
- Reranking (if top-k precision is weak)
- Model upgrades (if generation behaviour is the blocker)
- Infrastructure upgrade (if performance/scale is the blocker)
This keeps you from buying a Ferrari when you need a van.
Closing thought
Selecting a model and vector store for RAG is not about finding the “best” technology. It is about choosing the cheapest stack that reliably produces acceptable answers at your scale, with an upgrade path when you need it.
Agree what “good” looks like, measure it, and treat every extra moving part as something you will have to run at 3am when it breaks.