The SFTGenerator class is designed for generating high-quality instruction-response pairs for supervised fine-tuning (SFT) of language models. It uses an "evolve-then-verify" approach with document grounding to create diverse, contextually relevant instruction sets.
Overview
SFTGenerator implements sophisticated instruction evolution strategies to create diverse, high-quality training data. The generator can create data from scratch or ground it in specific documents, ensuring the generated questions are answerable from the provided context.
Installation
pip install phinitydata
Basic Usage
from phinitydata.testset.sft_generator import SFTGenerator
import os
# Initialize with API key
api_key = os.getenv("OPENAI_API_KEY")
generator = SFTGenerator(
api_key=api_key,
llm_model="gpt-4o-mini",
temperature=0.7
)
# Define seed instructions
seed_instructions = [
"What is machine learning?",
"Explain how neural networks work"
]
# Generate evolved instructions
results = generator.generate(
seed_instructions=seed_instructions,
target_samples=10,
domain_context="artificial intelligence",
verbose=True,
export_path="evolved_instructions.jsonl"
)
Initialization Parameters
api_key (Optional[str], default=None): OpenAI API key. If not provided, will look for OPENAI_API_KEY environment variable
llm_model (str, default="gpt-4o-mini"): Model to use for generation
embedding_model (Optional[Embeddings], default=None): Optional custom embedding model
vector_store_type (str, default="faiss"): Type of vector store to use ("faiss" or "chroma")
temperature (float, default=0.7): Temperature for generation
Key Methods
generate
Generate domain-specific instruction-response pairs with document grounding using a population-based evolution approach.
description (str): Description of what the strategy does
prompt_template (str): Template for evolving prompts with placeholders
Built-in Evolution Strategies
The SFTGenerator includes several built-in evolution strategies, each with its own prompt template for transforming instructions in different ways.
1. Deepening Strategy
Description: Makes instructions more detailed and specific while maintaining length.
Prompt Template:
Create a more specific version of the given prompt.
The evolved prompt should:
1. Add specific details or requirements THAT CAN BE ANSWERED from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core intent clear and focused
4. NOT introduce elements that aren't supported by the document context
5. Return ONLY the evolved prompt without any prefix
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a focused, specific version:
2. Concretizing Strategy
Description: Adds concrete examples or scenarios while maintaining length.
Prompt Template:
Create a version with concrete examples that are answerable from the document context.
The evolved prompt should:
1. Include specific examples from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core question clear and focused
4. NOT add unnecessary complexity
5. Return ONLY the evolved prompt without any prefix
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version with concrete examples:
3. Reasoning Strategy
Description: Asks for reasoning or step-by-step explanations.
Prompt Template:
Rewrite the given prompt to focus on reasoning or explanations.
The evolved prompt should:
1. Ask for step-by-step explanations
2. Request reasoning behind concepts
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version that focuses on reasoning:
4. Comparative Strategy
Description: Transforms the prompt to include comparison elements.
Prompt Template:
Rewrite the given prompt to include comparative analysis.
The evolved prompt should:
1. Ask to compare or contrast related concepts
2. Include evaluation of different approaches or perspectives
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version with comparative elements:
Custom Evolution Strategies
One of the most powerful features of SFTGenerator is the ability to add custom evolution strategies tailored to specific domains or applications. This allows you to generate instructions that have domain-specific characteristics, terminology, and patterns.
You need to generate data for a specialized domain (finance, healthcare, legal, etc.)
You want instructions to follow specific formats or include domain-specific elements
You want to emphasize certain types of questions within your dataset
You need to adapt the complexity level for specific use cases
Example: Adding a Domain-Specific Strategy
Here's an example of adding a custom strategy for financial domain instruction evolution:
# Initialize generator
generator = SFTGenerator(api_key=api_key)
# Add custom financial domain strategy
generator.add_evolution_strategy(
name="financial_domain",
description="Adapts instructions to include financial terminology and concepts",
prompt_template="""
Rewrite the given prompt to be specific to the financial domain.
The evolved prompt should:
1. Use financial terminology and concepts
2. Reference financial scenarios or use cases
3. Maintain the core intent of the original prompt
4. Be clear and professional
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a finance-specific version:
"""
)
# Use the custom strategy in generation
results = generator.generate(
seed_instructions=seed_instructions,
evolution_config={
"strategies": ["financial_domain", "deepening"],
"weights": [0.7, 0.3]
},
domain_context="investment and portfolio management",
target_samples=10
)
Example: Technical Documentation Strategy
generator.add_evolution_strategy(
name="technical_docs",
description="Creates instructions focused on technical documentation comprehension",
prompt_template="""
Transform the given prompt into a technical documentation comprehension task.
The evolved prompt should:
1. Focus on understanding technical concepts from documentation
2. Ask for explanations of functions, parameters, or system architecture
3. Maintain the core intent of the original prompt
4. Be specific to software/API documentation contexts
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a technical documentation-focused version:
"""
)
Document Grounding
When documents are provided, the generator ensures evolved instructions remain answerable from the document context:
Verifies answerability against the document set
Repairs instructions that drift from the document context
Simplifies overly complex instructions that cannot be fully answered
Supports partial answerability for complex instructions
Document Grounding Example
# Document grounding example
documents = [
"Machine learning is a subset of artificial intelligence...",
"Neural networks are composed of layers of interconnected nodes..."
]
results = generator.generate(
seed_instructions=seed_instructions,
target_samples=5,
documents=documents, # Provide documents for grounding
domain_context="artificial intelligence",
strict_grounding=True, # Enforce strict answerability verification
export_path="grounded_instructions.jsonl"
)
Vector Database Integration
The SFTGenerator supports integration with vector databases for efficient document retrieval:
# Using ChromaDB for document storage
generator = SFTGenerator(
api_key=api_key,
vector_store_type="chroma" # Use ChromaDB instead of FAISS
)
# Generated instructions will be verified against ChromaDB document store
Error Handling
The generator includes robust error handling for API failures, unverifiable instructions, and other potential issues. When generating at scale, it's recommended to use the verbose=True option for detailed progress tracking.
Advanced Configuration
Evolution Configuration
You can customize the evolution process by providing a configuration dictionary:
evolution_config = {
"strategies": ["deepening", "concretizing", "reasoning", "comparative"],
"weights": [0.3, 0.3, 0.2, 0.2] # Probability weights for each strategy
}
results = generator.generate(
seed_instructions=seed_instructions,
evolution_config=evolution_config,
target_samples=10
)
Output Format Options
The generator supports different output formats:
# Export as JSONL (one JSON object per line)
results = generator.generate(
seed_instructions=seed_instructions,
export_format="jsonl",
export_path="instructions.jsonl"
)
# Export as JSON (full structured object)
results = generator.generate(
seed_instructions=seed_instructions,
export_format="json",
export_path="instructions.json"
)