The SFTGenerator class is designed for generating high-quality instruction-response pairs for supervised fine-tuning (SFT) of language models. It uses an "evolve-then-verify" approach with document grounding to create diverse, contextually relevant instruction sets.
Overview
SFTGenerator implements sophisticated instruction evolution strategies to create diverse, high-quality training data. The generator can create data from scratch or ground it in specific documents, ensuring the generated questions are answerable from the provided context.
Installation
pipinstallphinitydata
Basic Usage
from phinitydata.testset.sft_generator import SFTGeneratorimport os# Initialize with API keyapi_key = os.getenv("OPENAI_API_KEY")generator =SFTGenerator(api_key=api_key,llm_model="gpt-4o-mini",temperature=0.7)# Define seed instructionsseed_instructions =["What is machine learning?","Explain how neural networks work"]# Generate evolved instructionsresults = generator.generate(seed_instructions=seed_instructions,target_samples=10,domain_context="artificial intelligence",verbose=True,export_path="evolved_instructions.jsonl")
Initialization Parameters
api_key (Optional[str], default=None): OpenAI API key. If not provided, will look for OPENAI_API_KEY environment variable
llm_model (str, default="gpt-4o-mini"): Model to use for generation
embedding_model (Optional[Embeddings], default=None): Optional custom embedding model
vector_store_type (str, default="faiss"): Type of vector store to use ("faiss" or "chroma")
temperature (float, default=0.7): Temperature for generation
Key Methods
generate
Generate domain-specific instruction-response pairs with document grounding using a population-based evolution approach.
documents (Optional[List[str]], default=None): Optional documents for grounding/verification
target_samples (int, default=10): Number of samples to generate
domain_context (str, default=""): Context about the domain for guiding generation
evolution_config (Optional[Dict], default=None): Configuration for evolution strategies
strict_grounding (bool, default=False): Whether to strictly verify answerability
verbose (bool, default=False): Whether to print detailed progress
export_format (Literal["json", "jsonl"], default="json"): Format to export results
export_path (Optional[str], default=None): Optional path to export results
Returns:
A dictionary containing:
samples: List of generated instruction-response pairs
metrics: Statistics about the generation process
add_evolution_strategy
Add a custom evolution strategy to the generator.
Parameters:
name (str): Unique name for the strategy
description (str): Description of what the strategy does
prompt_template (str): Template for evolving prompts with placeholders
Built-in Evolution Strategies
The SFTGenerator includes several built-in evolution strategies, each with its own prompt template for transforming instructions in different ways.
1. Deepening Strategy
Description: Makes instructions more detailed and specific while maintaining length.
Prompt Template:
2. Concretizing Strategy
Description: Adds concrete examples or scenarios while maintaining length.
Prompt Template:
3. Reasoning Strategy
Description: Asks for reasoning or step-by-step explanations.
Prompt Template:
4. Comparative Strategy
Description: Transforms the prompt to include comparison elements.
Prompt Template:
Custom Evolution Strategies
One of the most powerful features of SFTGenerator is the ability to add custom evolution strategies tailored to specific domains or applications. This allows you to generate instructions that have domain-specific characteristics, terminology, and patterns.
You need to generate data for a specialized domain (finance, healthcare, legal, etc.)
You want instructions to follow specific formats or include domain-specific elements
You want to emphasize certain types of questions within your dataset
You need to adapt the complexity level for specific use cases
Example: Adding a Domain-Specific Strategy
Here's an example of adding a custom strategy for financial domain instruction evolution:
Example: Technical Documentation Strategy
Document Grounding
When documents are provided, the generator ensures evolved instructions remain answerable from the document context:
Verifies answerability against the document set
Repairs instructions that drift from the document context
Simplifies overly complex instructions that cannot be fully answered
Supports partial answerability for complex instructions
Document Grounding Example
Vector Database Integration
The SFTGenerator supports integration with vector databases for efficient document retrieval:
Error Handling
The generator includes robust error handling for API failures, unverifiable instructions, and other potential issues. When generating at scale, it's recommended to use the verbose=True option for detailed progress tracking.
Advanced Configuration
Evolution Configuration
You can customize the evolution process by providing a configuration dictionary:
Create a more specific version of the given prompt.
The evolved prompt should:
1. Add specific details or requirements THAT CAN BE ANSWERED from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core intent clear and focused
4. NOT introduce elements that aren't supported by the document context
5. Return ONLY the evolved prompt without any prefix
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a focused, specific version:
Create a version with concrete examples that are answerable from the document context.
The evolved prompt should:
1. Include specific examples from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core question clear and focused
4. NOT add unnecessary complexity
5. Return ONLY the evolved prompt without any prefix
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version with concrete examples:
Rewrite the given prompt to focus on reasoning or explanations.
The evolved prompt should:
1. Ask for step-by-step explanations
2. Request reasoning behind concepts
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version that focuses on reasoning:
Rewrite the given prompt to include comparative analysis.
The evolved prompt should:
1. Ask to compare or contrast related concepts
2. Include evaluation of different approaches or perspectives
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a version with comparative elements:
# Initialize generator
generator = SFTGenerator(api_key=api_key)
# Add custom financial domain strategy
generator.add_evolution_strategy(
name="financial_domain",
description="Adapts instructions to include financial terminology and concepts",
prompt_template="""
Rewrite the given prompt to be specific to the financial domain.
The evolved prompt should:
1. Use financial terminology and concepts
2. Reference financial scenarios or use cases
3. Maintain the core intent of the original prompt
4. Be clear and professional
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a finance-specific version:
"""
)
# Use the custom strategy in generation
results = generator.generate(
seed_instructions=seed_instructions,
evolution_config={
"strategies": ["financial_domain", "deepening"],
"weights": [0.7, 0.3]
},
domain_context="investment and portfolio management",
target_samples=10
)
generator.add_evolution_strategy(
name="technical_docs",
description="Creates instructions focused on technical documentation comprehension",
prompt_template="""
Transform the given prompt into a technical documentation comprehension task.
The evolved prompt should:
1. Focus on understanding technical concepts from documentation
2. Ask for explanations of functions, parameters, or system architecture
3. Maintain the core intent of the original prompt
4. Be specific to software/API documentation contexts
DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}
Create a technical documentation-focused version:
"""
)
# Document grounding example
documents = [
"Machine learning is a subset of artificial intelligence...",
"Neural networks are composed of layers of interconnected nodes..."
]
results = generator.generate(
seed_instructions=seed_instructions,
target_samples=5,
documents=documents, # Provide documents for grounding
domain_context="artificial intelligence",
strict_grounding=True, # Enforce strict answerability verification
export_path="grounded_instructions.jsonl"
)
# Using ChromaDB for document storage
generator = SFTGenerator(
api_key=api_key,
vector_store_type="chroma" # Use ChromaDB instead of FAISS
)
# Generated instructions will be verified against ChromaDB document store
evolution_config = {
"strategies": ["deepening", "concretizing", "reasoning", "comparative"],
"weights": [0.3, 0.3, 0.2, 0.2] # Probability weights for each strategy
}
results = generator.generate(
seed_instructions=seed_instructions,
evolution_config=evolution_config,
target_samples=10
)
# Export as JSONL (one JSON object per line)
results = generator.generate(
seed_instructions=seed_instructions,
export_format="jsonl",
export_path="instructions.jsonl"
)
# Export as JSON (full structured object)
results = generator.generate(
seed_instructions=seed_instructions,
export_format="json",
export_path="instructions.json"
)