SFTGenerator

The SFTGenerator class is designed for generating high-quality instruction-response pairs for supervised fine-tuning (SFT) of language models. It uses an "evolve-then-verify" approach with document grounding to create diverse, contextually relevant instruction sets.

Overview

SFTGenerator implements sophisticated instruction evolution strategies to create diverse, high-quality training data. The generator can create data from scratch or ground it in specific documents, ensuring the generated questions are answerable from the provided context.

Installation

pip install phinitydata

Basic Usage

from phinitydata.testset.sft_generator import SFTGenerator
import os

# Initialize with API key
api_key = os.getenv("OPENAI_API_KEY")
generator = SFTGenerator(
    api_key=api_key,
    llm_model="gpt-4o-mini",
    temperature=0.7
)

# Define seed instructions
seed_instructions = [
    "What is machine learning?",
    "Explain how neural networks work"
]

# Generate evolved instructions
results = generator.generate(
    seed_instructions=seed_instructions,
    target_samples=10,
    domain_context="artificial intelligence",
    verbose=True,
    export_path="evolved_instructions.jsonl"
)

Initialization Parameters

api_key (Optional[str], default=None): OpenAI API key. If not provided, will look for OPENAI_API_KEY environment variable
llm_model (str, default="gpt-4o-mini"): Model to use for generation
embedding_model (Optional[Embeddings], default=None): Optional custom embedding model
vector_store_type (str, default="faiss"): Type of vector store to use ("faiss" or "chroma")
temperature (float, default=0.7): Temperature for generation

Key Methods

`generate`

Generate domain-specific instruction-response pairs with document grounding using a population-based evolution approach.

def generate(
    self,
    seed_instructions: List[str],
    documents: Optional[List[str]] = None,
    target_samples: int = 10,
    domain_context: str = "",
    evolution_config: Optional[Dict] = None,
    strict_grounding: bool = False,
    verbose: bool = False,
    export_format: Literal["json", "jsonl"] = "json",
    export_path: Optional[str] = None
) -> Dict

Parameters:

seed_instructions (List[str], required): Initial instruction seeds
documents (Optional[List[str]], default=None): Optional documents for grounding/verification
target_samples (int, default=10): Number of samples to generate
domain_context (str, default=""): Context about the domain for guiding generation
evolution_config (Optional[Dict], default=None): Configuration for evolution strategies
strict_grounding (bool, default=False): Whether to strictly verify answerability
verbose (bool, default=False): Whether to print detailed progress
export_format (Literal["json", "jsonl"], default="json"): Format to export results
export_path (Optional[str], default=None): Optional path to export results

Returns:

A dictionary containing:

samples: List of generated instruction-response pairs
metrics: Statistics about the generation process

`add_evolution_strategy`

Add a custom evolution strategy to the generator.

def add_evolution_strategy(
    self,
    name: str,
    description: str,
    prompt_template: str
) -> None

Parameters:

name (str): Unique name for the strategy
description (str): Description of what the strategy does
prompt_template (str): Template for evolving prompts with placeholders

Built-in Evolution Strategies

The SFTGenerator includes several built-in evolution strategies, each with its own prompt template for transforming instructions in different ways.

1. Deepening Strategy

Description: Makes instructions more detailed and specific while maintaining length.

Prompt Template:

Create a more specific version of the given prompt.
The evolved prompt should:
1. Add specific details or requirements THAT CAN BE ANSWERED from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core intent clear and focused
4. NOT introduce elements that aren't supported by the document context
5. Return ONLY the evolved prompt without any prefix

DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words

RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}

Create a focused, specific version:

2. Concretizing Strategy

Description: Adds concrete examples or scenarios while maintaining length.

Prompt Template:

Create a version with concrete examples that are answerable from the document context.
The evolved prompt should:
1. Include specific examples from the document context
2. Maintain similar length to the original prompt (within 20% longer/shorter)
3. Keep the core question clear and focused
4. NOT add unnecessary complexity
5. Return ONLY the evolved prompt without any prefix

DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}
LENGTH GUIDE: Keep close to {original_length} words

RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}

Create a version with concrete examples:

3. Reasoning Strategy

Description: Asks for reasoning or step-by-step explanations.

Prompt Template:

Rewrite the given prompt to focus on reasoning or explanations.
The evolved prompt should:
1. Ask for step-by-step explanations
2. Request reasoning behind concepts
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"

DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}

RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}

Create a version that focuses on reasoning:

4. Comparative Strategy

Description: Transforms the prompt to include comparison elements.

Prompt Template:

Rewrite the given prompt to include comparative analysis.
The evolved prompt should:
1. Ask to compare or contrast related concepts
2. Include evaluation of different approaches or perspectives
3. Maintain the core intent of the original prompt
4. Return ONLY the evolved prompt without any prefix like "Revised Prompt:"

DOMAIN CONTEXT: {domain_summary}
ORIGINAL PROMPT: {original_prompt}

RECENT INSTRUCTIONS (avoid repeating these):
{recent_history}

Create a version with comparative elements:

Custom Evolution Strategies

One of the most powerful features of SFTGenerator is the ability to add custom evolution strategies tailored to specific domains or applications. This allows you to generate instructions that have domain-specific characteristics, terminology, and patterns.

When to Create Custom Evolution Strategies

Consider creating custom evolution strategies when:

You need to generate data for a specialized domain (finance, healthcare, legal, etc.)
You want instructions to follow specific formats or include domain-specific elements
You want to emphasize certain types of questions within your dataset
You need to adapt the complexity level for specific use cases

Example: Adding a Domain-Specific Strategy

Here's an example of adding a custom strategy for financial domain instruction evolution:

# Initialize generator
generator = SFTGenerator(api_key=api_key)

# Add custom financial domain strategy
generator.add_evolution_strategy(
    name="financial_domain",
    description="Adapts instructions to include financial terminology and concepts",
    prompt_template="""
    Rewrite the given prompt to be specific to the financial domain.
    The evolved prompt should:
    1. Use financial terminology and concepts
    2. Reference financial scenarios or use cases
    3. Maintain the core intent of the original prompt
    4. Be clear and professional
    
    DOMAIN CONTEXT: {domain_summary}
    ORIGINAL PROMPT: {original_prompt}
    
    RECENT INSTRUCTIONS (avoid repeating these):
    {recent_history}
    
    Create a finance-specific version:
    """
)

# Use the custom strategy in generation
results = generator.generate(
    seed_instructions=seed_instructions,
    evolution_config={
        "strategies": ["financial_domain", "deepening"],
        "weights": [0.7, 0.3]
    },
    domain_context="investment and portfolio management",
    target_samples=10
)

Example: Technical Documentation Strategy

generator.add_evolution_strategy(
    name="technical_docs",
    description="Creates instructions focused on technical documentation comprehension",
    prompt_template="""
    Transform the given prompt into a technical documentation comprehension task.
    The evolved prompt should:
    1. Focus on understanding technical concepts from documentation
    2. Ask for explanations of functions, parameters, or system architecture
    3. Maintain the core intent of the original prompt
    4. Be specific to software/API documentation contexts
    
    DOMAIN CONTEXT: {domain_summary}
    ORIGINAL PROMPT: {original_prompt}
    
    RECENT INSTRUCTIONS (avoid repeating these):
    {recent_history}
    
    Create a technical documentation-focused version:
    """
)

Document Grounding

When documents are provided, the generator ensures evolved instructions remain answerable from the document context:

Verifies answerability against the document set
Repairs instructions that drift from the document context
Simplifies overly complex instructions that cannot be fully answered
Supports partial answerability for complex instructions

Document Grounding Example

# Document grounding example
documents = [
    "Machine learning is a subset of artificial intelligence...",
    "Neural networks are composed of layers of interconnected nodes..."
]

results = generator.generate(
    seed_instructions=seed_instructions,
    target_samples=5,
    documents=documents,  # Provide documents for grounding
    domain_context="artificial intelligence",
    strict_grounding=True,  # Enforce strict answerability verification
    export_path="grounded_instructions.jsonl"
)

Vector Database Integration

The SFTGenerator supports integration with vector databases for efficient document retrieval:

# Using ChromaDB for document storage
generator = SFTGenerator(
    api_key=api_key,
    vector_store_type="chroma"  # Use ChromaDB instead of FAISS
)

# Generated instructions will be verified against ChromaDB document store

Error Handling

The generator includes robust error handling for API failures, unverifiable instructions, and other potential issues. When generating at scale, it's recommended to use the verbose=True option for detailed progress tracking.

Advanced Configuration

Evolution Configuration

You can customize the evolution process by providing a configuration dictionary:

evolution_config = {
    "strategies": ["deepening", "concretizing", "reasoning", "comparative"],
    "weights": [0.3, 0.3, 0.2, 0.2]  # Probability weights for each strategy
}

results = generator.generate(
    seed_instructions=seed_instructions,
    evolution_config=evolution_config,
    target_samples=10
)

Output Format Options

The generator supports different output formats:

# Export as JSONL (one JSON object per line)
results = generator.generate(
    seed_instructions=seed_instructions,
    export_format="jsonl",
    export_path="instructions.jsonl"
)

# Export as JSON (full structured object)
results = generator.generate(
    seed_instructions=seed_instructions,
    export_format="json",
    export_path="instructions.json"
)

PreviousTool use (coming soon)NextAligning an LLM Judge