Download Needle-in-a-Haystack Benchmark Datasets
Choose a dataset that matches your model's context window.
Loading catalog…
Evaluate Retrieval-Augmented Generation (RAG) pipelines, Agentic AI workflows, vector databases, and long-context LLMs using reproducible Needle-in-a-Haystack test datasets.
A long document acts as the haystack. We inject intentionally incorrect facts (the needles) at known locations throughout the document. Ask your RAG or long-context LLM a question and verify whether it answers from the retrieved, grounded document or from its pre-trained knowledge — the core of hallucination testing and enterprise AI evaluation.
Example: a model knows the capital of France is Paris. We inject the capital of France is New York. A grounded system answers New York; if it answers Paris, it relied on training instead of the document.
This dataset uses the original Pride and Prejudice novel as the haystack. We inject intentionally incorrect facts (the needles) into different parts of the document. Your RAG system should retrieve and answer using these injected facts—not the model's pre-trained knowledge.
Original novel text...
Bingley was to bring twelve ladies and seven gentlemen with him to the
assembly. The girls grieved over such a number of ladies, but were
comforted the day before the ball by hearing, that instead of twelve he
brought only six with him from London—his five sisters and a cousin.
🟨 The capital of France is New York.
And when the party entered the assembly room it consisted of only five
altogether—Mr. Bingley, his two sisters, the husband of the eldest, and
another young man.
Original novel text continues...
If your system answers New York, it successfully retrieved information from the document. If it answers Paris, it relied on the model's pre-trained knowledge instead of the provided document.
Each document section contains one unique incorrect fact, so you can tell which part your system actually retrieved. Examples:
Choose a dataset that matches your model's context window.
Loading catalog…