Building a custom RAG benchmark

Learn to generate evaluation datasets directly from your vector database.

Evaluating RAG systems is critical to ensure they retrieve relevant information and generate accurate responses. A comprehensive RAG benchmark should give you insights on:

Retrieval effectiveness: How well your system finds relevant documents
Answer generation: The accuracy and relevance of generated responses
System robustness: Performance across different question types and complexities

We will be using ChromaDB, an open-source vector database, in this tutorial. You will load documents and generate diverse and realistic user queries, testing reasoning from single and multiple documents. A knowledge graph is constructed from the documents, and query-answer-context test cases are synthesized from this graph.

Steps

Set up environment

# Install Phinity
pip install phinitydata 

os.environ["OPENAI_API_KEY"] = 'api-key'

import os
from phinitydata.testset.rag_generator import TestsetGenerator
import chromadb
from phinitydata.connectors.chromadb import ChromaDBConnector

# Disable tokenizers parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Set up your OpenAI API key
if "OPENAI_API_KEY" not in os.environ:
    print("Please set your OpenAI API key first:")
    print("export OPENAI_API_KEY='your-api-key-here'")
    print("or")
    api_key = input("Enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = api_key

Data prep

For this tutorial, we'll create some example documents to generate test questions from. In a real scenario, you would use your own documents or knowledge base.

# Sample documents about different topics
# These will be our knowledge base for the RAG benchmark

doc1 = """
    Phinity is a comprehensive synthetic data generation platform designed for AI applications. 
    It helps create realistic test data for training and evaluating AI systems without using real user data.
    The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems.
    It can create various types of questions including simple factual questions, complex multi-hop questions,
    and abstract questions that require synthesizing information from multiple sources.
    Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic.
    The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.
    """

doc2 = """
    ChromaDB is an open-source vector database designed specifically for AI applications.
    It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data.
    Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching.
    ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query.
    It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production.
    The database is designed to scale horizontally and handle millions of embeddings efficiently.
    ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.
    """

Loading documents into ChromaDB

Now let's set up ChromaDB and create a collection to store our documents:

print("Setting up ChromaDB collection...")

# Initialize ChromaDB
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
    name="test_collection",
    metadata={"description": "Test collection for RAG evaluation"}
)

# Add documents to ChromaDB
collection.add(
    documents=[doc1, doc2],
    ids=["doc1", "doc2"],
    metadatas=[{"source": "phinity_docs"}, {"source": "chromadb_docs"}]
)

Creating a connector and generating the test set

Now we'll use Phinity to generate a test set from our ChromaDB collection:

print("\nGenerating test cases from ChromaDB collection...")

# Initialize TestsetGenerator and connector
generator = TestsetGenerator()
connector = ChromaDBConnector(collection)

# Generate QA pairs from the collection
testset = generator.generate_from_connector(
    connector=connector,
    testset_size=4
)

We also support query distribution customization, crediting RAGAS framework's query synthesizers. The query distribution parameter is a dictionary that maps question types to their relative frequencies - Phinity uses these weights to determine how many questions of each type to generate when creating your benchmark.

# Optional: Customize query distribution
testset = generator.generate_from_connector(
    connector=connector,
    testset_size=4,
    query_distribution={
        "single_hop_specific": 0.5,  # Simple factual questions
        "multi_hop_abstract": 0.25,  # Retrieval of multiple documents, open-ended answer
        "multi_hop_specific": 0.25   # Retrieval of multiple documents, specific answer
    }
)

Display and Export

print("\nGenerated test cases:")
for qa in testset.qa_pairs:
    print(f"\nInput Query: {qa.question}")
    print(f"Expected Answer: {qa.answer}")
    print("Retrieved Context:")
    for ctx in qa.context:
        print(f"- {ctx.strip()}")
    print("--------------------------------------------------")

# Export results
output_file = "rag_testset.json"
testset.to_json(output_file)
print(f"\nExported test cases to {output_file}")

After cleaning and verifying that the queries are realistic, you can use your test set to:

Evaluate your current RAG system
Compare different systems - run the same benchmark against multiple RAG implementations

Output from the above script

Generated QA pairs from ChromaDB:

Input: What is the role of RAG in the context of Phinity's synthetic data generation?

Expected Output: RAG, or Retrieval Augmented Generation, is a system for which Phinity specializes in generating question-answer pairs. The platform creates various types of questions, including simple factual questions, complex multi-hop questions, and abstract questions that require synthesizing information from multiple sources.

Context:
- Phinity is a comprehensive synthetic data generation platform designed for AI applications.It helps create realistic test data for training and evaluating AI systems without using real user data.The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems. It can create various types of questions including simple factual questions, complex multi-hop questions, and abstract questions that require synthesizing information from multiple sources. Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic.The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.
--------------------------------------------------

Input: What ChromaDB do for AI applications?

Expected Output: ChromaDB is an open-source vector database designed specifically for AI applications. It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data. ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query.

Context:
ChromaDB is an open-source vector database designed specifically for AI applications. It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data.Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching.ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query. It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production.The database is designed to scale horizontally and handle millions of embeddings efficiently. ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.
--------------------------------------------------

Input: How Phinity use ChromaDB for synthetic data generation?

Expected Output: Phinity uses ChromaDB to generate evaluation data directly from stored documents, integrating with the vector database to enhance its synthetic data generation capabilities for AI applications.

Context:
- <1-hop>

Phinity is a comprehensive synthetic data generation platform designed for AI applications. It helps create realistic test data for training and evaluating AI systems without using real user data. The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems. It can create various types of questions including simple factual questions, complex multi-hop questions, and abstract questions that require synthesizing information from multiple sources. Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic. The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.

- <2-hop>

ChromaDB is an open-source vector database designed specifically for AI applications. It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data. Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching. ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query. It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production. The database is designed to scale horizontally and handle millions of embeddings efficiently. ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.
--------------------------------------------------

Input: How does Phinity utilize ChromaDB to enhance the generation of question-answer pairs for RAG systems?

Expected Output: Phinity utilizes ChromaDB to enhance the generation of question-answer pairs for RAG systems by integrating with the vector database to generate evaluation data directly from stored documents. This allows Phinity to create realistic test data for training and evaluating AI systems, including various types of questions that can be simple factual, complex multi-hop, or abstract, thereby improving the effectiveness of RAG applications.

Context:
- <1-hop>

ChromaDB is an open-source vector database designed specifically for AI applications. It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data. Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching. ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query. It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production. The database is designed to scale horizontally and handle millions of embeddings efficiently. ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.

- <2-hop>

Phinity is a comprehensive synthetic data generation platform designed for AI applications. It helps create realistic test data for training and evaluating AI systems without using real user data. The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems. It can create various types of questions including simple factual questions, complex multi-hop questions, and abstract questions that require synthesizing information from multiple sources. Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic. The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.
--------------------------------------------------

Exported ChromaDB testset to chromadb_testset.json

Full test script

"""
Test script for generating RAG test cases using ChromaDB and Phinity
"""

import os
from phinitydata.testset.rag_generator import TestsetGenerator
import chromadb
from phinitydata.connectors.chromadb import ChromaDBConnector

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Check for OpenAI API key
if "OPENAI_API_KEY" not in os.environ:
    print("Please set your OpenAI API key first:")
    print("export OPENAI_API_KEY='your-api-key-here'")
    print("or")
    api_key = input("Enter your OpenAI API key: ")
    os.environ["OPENAI_API_KEY"] = api_key

# Sample documents (two different topics for multi-hop questions)
doc1 = """
    Phinity is a comprehensive synthetic data generation platform designed for AI applications. 
    It helps create realistic test data for training and evaluating AI systems without using real user data.
    The platform specializes in generating question-answer pairs for RAG (Retrieval Augmented Generation) systems.
    It can create various types of questions including simple factual questions, complex multi-hop questions,
    and abstract questions that require synthesizing information from multiple sources.
    Phinity uses advanced natural language processing techniques to ensure the generated data is diverse and realistic.
    The platform integrates with vector databases like ChromaDB to generate evaluation data directly from stored documents.
    """

doc2 = """
    ChromaDB is an open-source vector database designed specifically for AI applications.
    It efficiently stores and retrieves vector embeddings, which are numerical representations of text, images, or other data.
    Vector databases are essential components of RAG systems, enabling semantic search beyond simple keyword matching.
    ChromaDB offers high-performance similarity search, allowing developers to find the most relevant documents for a given query.
    It supports various embedding models and can be deployed either in-memory for development or as a persistent database for production.
    The database is designed to scale horizontally and handle millions of embeddings efficiently.
    ChromaDB's Python client makes it easy to integrate with machine learning pipelines and LLM-based applications.
    """

try:
    print("Setting up ChromaDB collection...")

    # Initialize ChromaDB
    chroma_client = chromadb.Client()
    collection = chroma_client.create_collection(
        name="test_collection",
        metadata={"description": "Test collection for RAG evaluation"}
    )

    # Add documents to ChromaDB
    collection.add(
        documents=[doc1, doc2],
        ids=["doc1", "doc2"],
        metadatas=[{"source": "phinity_docs"}, {"source": "chromadb_docs"}]
    )

    print("\nGenerating test cases from ChromaDB collection...")

    # Initialize TestsetGenerator and connector
    generator = TestsetGenerator()
    connector = ChromaDBConnector(collection)

    # Generate QA pairs from the collection
    testset = generator.generate_from_connector(
        connector=connector,
        testset_size=4
    )

    # Print generated QA pairs
    print("\nGenerated test cases:")
    for qa in testset.qa_pairs:
        print(f"\nInput Query: {qa.question}")
        print(f"Expected Answer: {qa.answer}")
        print("Retrieved Context:")
        for ctx in qa.context:
            print(f"- {ctx.strip()}")
        print("--------------------------------------------------")

    # Export results
    output_file = "rag_testset.json"
    testset.to_json(output_file)
    print(f"\nExported test cases to {output_file}")

except ValueError as e:
    print(f"\nError: {str(e)}")
    print("This might be due to insufficient or invalid content in the documents.")
except Exception as e:
    print(f"\nError: {str(e)}")
    if "API key" in str(e):
        print("\nPlease make sure your OpenAI API key is valid and has sufficient credits.")

PreviousIn-domain SFT NextCode generation (coming soon)

Last updated 7 months ago