SFTGenerator

The SFTGenerator class is designed for generating high-quality instruction-response pairs for supervised fine-tuning (SFT) of language models. It uses an "evolve-then-verify" approach with document grounding to create diverse, contextually relevant instruction sets.

Overview

SFTGenerator implements sophisticated instruction evolution strategies to create diverse, high-quality training data. The generator can create data from scratch or ground it in specific documents, ensuring the generated questions are answerable from the provided context.

Installation

pip install phinitydata

Basic Usage

from phinitydata.testset.sft_generator import SFTGenerator
import os

# Initialize with API key
api_key = os.getenv("OPENAI_API_KEY")
generator = SFTGenerator(
    api_key=api_key,
    llm_model="gpt-4o-mini",
    temperature=0.7
)

# Define seed instructions
seed_instructions = [
    "What is machine learning?",
    "Explain how neural networks work"
]

# Generate evolved instructions
results = generator.generate(
    seed_instructions=seed_instructions,
    target_samples=10,
    domain_context="artificial intelligence",
    verbose=True,
    export_path="evolved_instructions.jsonl"
)

Initialization Parameters

  • api_key (Optional[str], default=None): OpenAI API key. If not provided, will look for OPENAI_API_KEY environment variable

  • llm_model (str, default="gpt-4o-mini"): Model to use for generation

  • embedding_model (Optional[Embeddings], default=None): Optional custom embedding model

  • vector_store_type (str, default="faiss"): Type of vector store to use ("faiss" or "chroma")

  • temperature (float, default=0.7): Temperature for generation

Key Methods

generate

Generate domain-specific instruction-response pairs with document grounding using a population-based evolution approach.

Parameters:

  • seed_instructions (List[str], required): Initial instruction seeds

  • documents (Optional[List[str]], default=None): Optional documents for grounding/verification

  • target_samples (int, default=10): Number of samples to generate

  • domain_context (str, default=""): Context about the domain for guiding generation

  • evolution_config (Optional[Dict], default=None): Configuration for evolution strategies

  • strict_grounding (bool, default=False): Whether to strictly verify answerability

  • verbose (bool, default=False): Whether to print detailed progress

  • export_format (Literal["json", "jsonl"], default="json"): Format to export results

  • export_path (Optional[str], default=None): Optional path to export results

Returns:

A dictionary containing:

  • samples: List of generated instruction-response pairs

  • metrics: Statistics about the generation process

add_evolution_strategy

Add a custom evolution strategy to the generator.

Parameters:

  • name (str): Unique name for the strategy

  • description (str): Description of what the strategy does

  • prompt_template (str): Template for evolving prompts with placeholders

Built-in Evolution Strategies

The SFTGenerator includes several built-in evolution strategies, each with its own prompt template for transforming instructions in different ways.

1. Deepening Strategy

Description: Makes instructions more detailed and specific while maintaining length.

Prompt Template:

2. Concretizing Strategy

Description: Adds concrete examples or scenarios while maintaining length.

Prompt Template:

3. Reasoning Strategy

Description: Asks for reasoning or step-by-step explanations.

Prompt Template:

4. Comparative Strategy

Description: Transforms the prompt to include comparison elements.

Prompt Template:

Custom Evolution Strategies

One of the most powerful features of SFTGenerator is the ability to add custom evolution strategies tailored to specific domains or applications. This allows you to generate instructions that have domain-specific characteristics, terminology, and patterns.

When to Create Custom Evolution Strategies

Consider creating custom evolution strategies when:

  1. You need to generate data for a specialized domain (finance, healthcare, legal, etc.)

  2. You want instructions to follow specific formats or include domain-specific elements

  3. You want to emphasize certain types of questions within your dataset

  4. You need to adapt the complexity level for specific use cases

Example: Adding a Domain-Specific Strategy

Here's an example of adding a custom strategy for financial domain instruction evolution:

Example: Technical Documentation Strategy

Document Grounding

When documents are provided, the generator ensures evolved instructions remain answerable from the document context:

  1. Verifies answerability against the document set

  2. Repairs instructions that drift from the document context

  3. Simplifies overly complex instructions that cannot be fully answered

  4. Supports partial answerability for complex instructions

Document Grounding Example

Vector Database Integration

The SFTGenerator supports integration with vector databases for efficient document retrieval:

Error Handling

The generator includes robust error handling for API failures, unverifiable instructions, and other potential issues. When generating at scale, it's recommended to use the verbose=True option for detailed progress tracking.

Advanced Configuration

Evolution Configuration

You can customize the evolution process by providing a configuration dictionary:

Output Format Options

The generator supports different output formats: