SFTGenerator
The SFTGenerator class is designed for generating high-quality instruction-response pairs for supervised fine-tuning (SFT) of language models. It uses an "evolve-then-verify" approach with document grounding to create diverse, contextually relevant instruction sets.
Overview
SFTGenerator implements sophisticated instruction evolution strategies to create diverse, high-quality training data. The generator can create data from scratch or ground it in specific documents, ensuring the generated questions are answerable from the provided context.
Installation
pip install phinitydataBasic Usage
from phinitydata.testset.sft_generator import SFTGenerator
import os
# Initialize with API key
api_key = os.getenv("OPENAI_API_KEY")
generator = SFTGenerator(
api_key=api_key,
llm_model="gpt-4o-mini",
temperature=0.7
)
# Define seed instructions
seed_instructions = [
"What is machine learning?",
"Explain how neural networks work"
]
# Generate evolved instructions
results = generator.generate(
seed_instructions=seed_instructions,
target_samples=10,
domain_context="artificial intelligence",
verbose=True,
export_path="evolved_instructions.jsonl"
)Initialization Parameters
api_key(Optional[str], default=None): OpenAI API key. If not provided, will look for OPENAI_API_KEY environment variablellm_model(str, default="gpt-4o-mini"): Model to use for generationembedding_model(Optional[Embeddings], default=None): Optional custom embedding modelvector_store_type(str, default="faiss"): Type of vector store to use ("faiss" or "chroma")temperature(float, default=0.7): Temperature for generation
Key Methods
generate
generateGenerate domain-specific instruction-response pairs with document grounding using a population-based evolution approach.
Parameters:
seed_instructions(List[str], required): Initial instruction seedsdocuments(Optional[List[str]], default=None): Optional documents for grounding/verificationtarget_samples(int, default=10): Number of samples to generatedomain_context(str, default=""): Context about the domain for guiding generationevolution_config(Optional[Dict], default=None): Configuration for evolution strategiesstrict_grounding(bool, default=False): Whether to strictly verify answerabilityverbose(bool, default=False): Whether to print detailed progressexport_format(Literal["json", "jsonl"], default="json"): Format to export resultsexport_path(Optional[str], default=None): Optional path to export results
Returns:
A dictionary containing:
samples: List of generated instruction-response pairsmetrics: Statistics about the generation process
add_evolution_strategy
add_evolution_strategyAdd a custom evolution strategy to the generator.
Parameters:
name(str): Unique name for the strategydescription(str): Description of what the strategy doesprompt_template(str): Template for evolving prompts with placeholders
Built-in Evolution Strategies
The SFTGenerator includes several built-in evolution strategies, each with its own prompt template for transforming instructions in different ways.
1. Deepening Strategy
Description: Makes instructions more detailed and specific while maintaining length.
Prompt Template:
2. Concretizing Strategy
Description: Adds concrete examples or scenarios while maintaining length.
Prompt Template:
3. Reasoning Strategy
Description: Asks for reasoning or step-by-step explanations.
Prompt Template:
4. Comparative Strategy
Description: Transforms the prompt to include comparison elements.
Prompt Template:
Custom Evolution Strategies
One of the most powerful features of SFTGenerator is the ability to add custom evolution strategies tailored to specific domains or applications. This allows you to generate instructions that have domain-specific characteristics, terminology, and patterns.
When to Create Custom Evolution Strategies
Consider creating custom evolution strategies when:
You need to generate data for a specialized domain (finance, healthcare, legal, etc.)
You want instructions to follow specific formats or include domain-specific elements
You want to emphasize certain types of questions within your dataset
You need to adapt the complexity level for specific use cases
Example: Adding a Domain-Specific Strategy
Here's an example of adding a custom strategy for financial domain instruction evolution:
Example: Technical Documentation Strategy
Document Grounding
When documents are provided, the generator ensures evolved instructions remain answerable from the document context:
Verifies answerability against the document set
Repairs instructions that drift from the document context
Simplifies overly complex instructions that cannot be fully answered
Supports partial answerability for complex instructions
Document Grounding Example
Vector Database Integration
The SFTGenerator supports integration with vector databases for efficient document retrieval:
Error Handling
The generator includes robust error handling for API failures, unverifiable instructions, and other potential issues. When generating at scale, it's recommended to use the verbose=True option for detailed progress tracking.
Advanced Configuration
Evolution Configuration
You can customize the evolution process by providing a configuration dictionary:
Output Format Options
The generator supports different output formats:
