Needle-in-a-Haystack

Test Data Generator for RAG and Agentic AI Evaluation

GitHub

Evaluate Retrieval-Augmented Generation (RAG) pipelines, Agentic AI workflows, vector databases, and long-context LLMs using reproducible Needle-in-a-Haystack test datasets.

A long document acts as the haystack. We inject intentionally incorrect facts (the needles) at known locations throughout the document. Ask your RAG or long-context LLM a question and verify whether it answers from the retrieved, grounded document or from its pre-trained knowledge — the core of hallucination testing and enterprise AI evaluation.

Example: a model knows the capital of France is Paris. We inject the capital of France is New York. A grounded system answers New York; if it answers Paris, it relied on training instead of the document.

How Needle-in-a-Haystack Testing Works

This dataset uses the original Pride and Prejudice novel as the haystack. We inject intentionally incorrect facts (the needles) into different parts of the document. Your RAG system should retrieve and answer using these injected facts—not the model's pre-trained knowledge.

Original novel text...

Bingley was to bring twelve ladies and seven gentlemen with him to the
assembly. The girls grieved over such a number of ladies, but were
comforted the day before the ball by hearing, that instead of twelve he
brought only six with him from London—his five sisters and a cousin.
🟨 The capital of France is New York.
And when the party entered the assembly room it consisted of only five
altogether—Mr. Bingley, his two sisters, the husband of the eldest, and
another young man.

Original novel text continues...
📖 Haystack Original Pride and Prejudice novel.
🪡 Needle An intentionally incorrect fact injected into the document.
✅ Expected behavior A document-grounded RAG system should answer using the injected fact.
QuestionWhat is the capital of France?
Expected answer New York

If your system answers New York, it successfully retrieved information from the document. If it answers Paris, it relied on the model's pre-trained knowledge instead of the provided document.

Injected Needles for RAG Evaluation

Each document section contains one unique incorrect fact, so you can tell which part your system actually retrieved. Examples:

Download Needle-in-a-Haystack Benchmark Datasets

Choose a dataset that matches your model's context window.

Needle placement

Loading catalog…

Coming Soon