Building a custom RAG benchmark
Learn to generate evaluation datasets directly from your vector database.
Evaluating RAG systems is critical to ensure they retrieve relevant information and generate accurate responses. A comprehensive RAG benchmark should give you insights on:
Retrieval effectiveness: How well your system finds relevant documents
Answer generation: The accuracy and relevance of generated responses
System robustness: Performance across different question types and complexities
We will be using ChromaDB, an open-source vector database, in this tutorial. You will load documents and generate diverse and realistic user queries, testing reasoning from single and multiple documents. A knowledge graph is constructed from the documents, and query-answer-context test cases are synthesized from this graph.
Steps
Set up environment
# Install Phinity
pip install phinitydata
os.environ["OPENAI_API_KEY"] = 'api-key'
import os
from phinitydata.testset.rag_generator import TestsetGenerator
import chromadb
from phinitydata.connectors.chromadb import ChromaDBConnector
# Disable tokenizers parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Set up your OpenAI API key
if "OPENAI_API_KEY" not in os.environ:
print("Please set your OpenAI API key first:")
print("export OPENAI_API_KEY='your-api-key-here'")
print("or")
api_key = input("Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = api_key
Data prep
For this tutorial, we'll create some example documents to generate test questions from. In a real scenario, you would use your own documents or knowledge base.
# Sample documents about different topics
# These will be our knowledge base for the RAG benchmark
doc1 = """
Phinity is a comprehensive synthetic data generation platform designed for AI applications.
It helps create realistic test data for training and evaluating AI systems without using real user data.
The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems.
It can create various types of questions including simple factual questions, complex multi-hop questions,
and abstract questions that require synthesizing information from multiple sources.
Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic.
The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.
"""
doc2 = """
ChromaDB is an open-source vector database designed specifically for AI applications.
It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data.
Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching.
ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query.
It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production.
The database is designed to scale horizontally and handle millions of embeddings efficiently.
ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.
"""
Loading documents into ChromaDB
Now let's set up ChromaDB and create a collection to store our documents:
print("Setting up ChromaDB collection...")
# Initialize ChromaDB
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="test_collection",
metadata={"description": "Test collection for RAG evaluation"}
)
# Add documents to ChromaDB
collection.add(
documents=[doc1, doc2],
ids=["doc1", "doc2"],
metadatas=[{"source": "phinity_docs"}, {"source": "chromadb_docs"}]
)
Creating a connector and generating the test set
Now we'll use Phinity to generate a test set from our ChromaDB collection:
print("\nGenerating test cases from ChromaDB collection...")
# Initialize TestsetGenerator and connector
generator = TestsetGenerator()
connector = ChromaDBConnector(collection)
# Generate QA pairs from the collection
testset = generator.generate_from_connector(
connector=connector,
testset_size=4
)
We also support query distribution customization, crediting RAGAS framework's query synthesizers. The query distribution parameter is a dictionary that maps question types to their relative frequencies - Phinity uses these weights to determine how many questions of each type to generate when creating your benchmark.
# Optional: Customize query distribution
testset = generator.generate_from_connector(
connector=connector,
testset_size=4,
query_distribution={
"single_hop_specific": 0.5, # Simple factual questions
"multi_hop_abstract": 0.25, # Retrieval of multiple documents, open-ended answer
"multi_hop_specific": 0.25 # Retrieval of multiple documents, specific answer
}
)
Display and Export
print("\nGenerated test cases:")
for qa in testset.qa_pairs:
print(f"\nInput Query: {qa.question}")
print(f"Expected Answer: {qa.answer}")
print("Retrieved Context:")
for ctx in qa.context:
print(f"- {ctx.strip()}")
print("--------------------------------------------------")
# Export results
output_file = "rag_testset.json"
testset.to_json(output_file)
print(f"\nExported test cases to {output_file}")
After cleaning and verifying that the queries are realistic, you can use your test set to:
Evaluate your current RAG system
Compare different systems - run the same benchmark against multiple RAG implementations
Last updated