Supervised Fine-Tuning (SFT) with high-quality, domain-specific data is critical for fine-tuning LLMs to excel in specialized tasks. However, creating diverse, high-quality datasets manually is time-consuming and expensive. This tutorial will guide you through using Phinity to generate synthetic in-domain SFT data efficiently.
Supervised Fine-Tuning (SFT) data consists of instruction-response pairs used to fine-tune language models for specific tasks or domains. Quality SFT data should be:
Diverse: Covering a range of scenarios and question types
Complex: Including challenging examples that push the model's capabilities
Domain-specific: Focused on your target application area
Realistic: Representing real-world use cases
One of the most difficult aspects of synthetic data generation at scale is diversity. Methods like WizardLM Evol-Instruct and Genetic Instruct have been developed to enable diverse instruction generation at scale - to do this, they continuously create new prompts from a seed set of prompts that the user provides by "evolving" them in the domain. Think of a never-ending family tree: prompts give birth to new prompts with various added mutations through generations. Now you have 1000000s of new family members from a starting set of two parents 🤠 (unfortunately, we have to eliminate many of these family members later when we do in-domain quality filtering).
Phinity builds in customizable Evol-Instruct. You can either generate instructions from scratch or from your documents.
Quick Start: Medical Domain with Documents
This tutorial demonstrates how to use the SFTGenerator class to create domain-specific instruction-response pairs grounded in medical documents about diabetes.
pip install phinitydata
Setup and Initialization
from phinitydata.testset.sft_generator import SFTGeneratorimport os# Create output directoryos.makedirs("generated_data", exist_ok=True)# Initialize the generatorgenerator =SFTGenerator()
1
Prepare Documents
# Example medical documents about diabetes
medical_docs = [
"""
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),
increased hunger (polyphagia), and unexplained weight loss. The condition is often
associated with obesity and physical inactivity. Management typically involves
lifestyle modifications, oral medications, and sometimes insulin therapy.
""",
"""
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decreased insulin production from beta cells
2. Liver: Increased glucose production
3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance
This creates a cycle of elevated blood glucose levels and metabolic dysfunction.
""",
"""
Treatment options for type 2 diabetes include:
- Metformin: First-line medication that reduces liver glucose production
- Sulfonylureas: Increase insulin secretion from pancreas
- GLP-1 receptor agonists: Enhance insulin release and reduce appetite
- DPP-4 inhibitors: Prolong the action of incretin hormones
Regular monitoring of blood glucose and HbA1c levels is essential.
"""
]
Above are example documents - we also support connecting ChromaDB.
2
Configure Evolution
results = generator.generate(
seed_instructions=seeds,
documents=medical_docs,
target_samples=5,
domain_context="medical knowledge about type 2 diabetes - RULE: USE ONLY INFORMATION FROM THE PROVIDED DOCUMENTS",
evolution_config={
"max_generations": 3,
"strategies": ["deepening", "concretizing"],
"weights": [0.5, 0.5]
},
strict_grounding=True, # Ensures instructions are answerable from documents
verbose=True,
export_format="jsonl",
export_path="generated_data/medical_instructions.jsonl"
)
The domain summary will be injected into the evolution prompts. Put context and rules for your domain in this field. Read more about the evolution prompts and creating custom evolution strategies in SFTGenerator
3
Generate
# Print results with document sources
print("\n=== Generated Medical Instructions ===")
for i, sample in enumerate(results['samples'], 1):
print(f"\nInstruction {i}:")
print(f"Q: {sample['instruction']}")
print(f"A: {sample['response'][:150]}...") # Print first 150 chars of response
print("\nRelevant Documents:")
for j, doc in enumerate(sample.get('relevant_documents', []), 1):
print(f"Doc {j}: {doc[:100]}...")
print("-" * 80)
4
Examine and Filter Generations
Examine the file and filter data for relevance and quality using an aligned LLM as judge (coming soon) or other custom verifiers
Full Script
from phinitydata.testset.sft_generator import SFTGenerator
import os
def medical_doc_example():
# Initialize generator
generator = SFTGenerator()
# Example medical documents
medical_docs = [
"""
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),
increased hunger (polyphagia), and unexplained weight loss. The condition is often
associated with obesity and physical inactivity. Management typically involves
lifestyle modifications, oral medications, and sometimes insulin therapy.
""",
"""
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decreased insulin production from beta cells
2. Liver: Increased glucose production
3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance
This creates a cycle of elevated blood glucose levels and metabolic dysfunction.
""",
"""
Treatment options for type 2 diabetes include:
- Metformin: First-line medication that reduces liver glucose production
- Sulfonylureas: Increase insulin secretion from pancreas
- GLP-1 receptor agonists: Enhance insulin release and reduce appetite
- DPP-4 inhibitors: Prolong the action of incretin hormones
Regular monitoring of blood glucose and HbA1c levels is essential.
"""
]
# Seed instructions focused on medical content
seeds = [
"What is type 2 diabetes?",
"How does type 2 diabetes affect different organs?",
"What are the treatment options for diabetes?"
]
print("\n=== Medical Document-Grounded Instruction Generation ===")
print(f"Using {len(medical_docs)} medical documents")
print(f"Starting with {len(seeds)} seed instructions")
# Generate with strict document grounding
results = generator.generate(
seed_instructions=seeds,
documents=medical_docs,
target_samples=5,
domain_context="medical knowledge about type 2 diabetes - RULE: USE ONLY INFORMATION FROM THE PROVIDED DOCUMENTS",
evolution_config={
"max_generations": 3,
"strategies": ["deepening", "concretizing"],
"weights": [0.5, 0.5]
},
strict_grounding=True,
verbose=True,
export_format="jsonl",
export_path="generated_data/medical_instructions.jsonl"
)
# Print results with document sources
print("\n=== Generated Medical Instructions ===")
for i, sample in enumerate(results['samples'], 1):
print(f"\nInstruction {i}:")
print(f"Q: {sample['instruction']}")
print("\nRelevant Documents:")
for j, doc in enumerate(sample['relevant_documents'], 1):
print(f"Doc {j}: {doc[:100]}...")
print("-" * 80)
if __name__ == "__main__":
# Create output directory
os.makedirs("generated_data", exist_ok=True)
medical_doc_example()
Output Logs
=== Medical Document-Grounded Instruction Generation ===
Using 3 medical documents
Starting with 3 seed instructions
=== Starting Instruction Generation ===
Seeds: 3
Target samples: 5
Evolving with deepening:
Parent: What is type 2 diabetes?
New instruction: What are the causes and symptoms of type 2 diabetes?
Response preview: Type 2 diabetes is primarily caused by insulin resistance and a relative deficiency of insulin production. Key contributing factors include obesity and physical inactivity, which are often associated ...
--------------------------------------------------------------------------------
Evolving with deepening:
Parent: How does type 2 diabetes affect different organs?
Evolving with deepening:
Parent: What are the treatment options for diabetes?
New instruction: What are the recommended treatment options for type 2 diabetes?
Response preview: The recommended treatment options for type 2 diabetes include:
1. **Metformin**: This is the first-line medication commonly prescribed. It works by reducing liver glucose production, thereby helping ...
--------------------------------------------------------------------------------
Evolving with deepening:
Parent: What are the causes and symptoms of type 2 diabetes?
New instruction: What are the primary causes and common symptoms of type 2 diabetes?
Response preview: Type 2 diabetes is primarily caused by insulin resistance and relative insulin deficiency. The key factors associated with its development include obesity and physical inactivity.
Common symptoms of...
--------------------------------------------------------------------------------
Evolving with concretizing:
Parent: What are the recommended treatment options for type 2 diabetes?
New instruction: What are the recommended treatment options for managing type 2 diabetes, such as metformin or lifestyle changes?
Response preview: The recommended treatment options for managing type 2 diabetes include a combination of lifestyle changes and medications.
1. **Lifestyle Changes**: These are foundational in managing type 2 diabete...
--------------------------------------------------------------------------------
Evolving with deepening:
Parent: What are the primary causes and common symptoms of type 2 diabetes?
New instruction: What are the leading risk factors and prevalent symptoms associated with type 2 diabetes?
Response preview: Type 2 diabetes is primarily characterized by insulin resistance and relative insulin deficiency, and it is associated with several leading risk factors and prevalent symptoms.
**Leading Risk Factors...
--------------------------------------------------------------------------------
=== Generated Medical Instructions ===
Instruction 1:
Q: What are the causes and symptoms of type 2 diabetes?
Relevant Documents:
Doc 1:
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
...
Doc 2:
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decre...
Doc 3:
Treatment options for type 2 diabetes include:
- Metformin: First-line medication t...
--------------------------------------------------------------------------------
Instruction 2:
Q: What are the recommended treatment options for type 2 diabetes?
Relevant Documents:
Doc 1:
Treatment options for type 2 diabetes include:
- Metformin: First-line medication t...
Doc 2:
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
...
Doc 3:
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decre...
--------------------------------------------------------------------------------
Instruction 3:
Q: What are the primary causes and common symptoms of type 2 diabetes?
Relevant Documents:
Doc 1:
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
...
Doc 2:
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decre...
Doc 3:
Treatment options for type 2 diabetes include:
- Metformin: First-line medication t...
--------------------------------------------------------------------------------
Instruction 4:
Q: What are the recommended treatment options for managing type 2 diabetes, such as metformin or lifestyle changes?
Relevant Documents:
Doc 1:
Treatment options for type 2 diabetes include:
- Metformin: First-line medication t...
Doc 2:
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
...
Doc 3:
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decre...
--------------------------------------------------------------------------------
Instruction 5:
Q: What are the leading risk factors and prevalent symptoms associated with type 2 diabetes?
Relevant Documents:
Doc 1:
Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.
...
Doc 2:
The pathophysiology of type 2 diabetes involves multiple organs:
1. Pancreas: Decre...
Doc 3:
Treatment options for type 2 diabetes include:
- Metformin: First-line medication t...
--------------------------------------------------------------------------------
Output File (JSONL)
{"instruction": "What are the causes and symptoms of type 2 diabetes?", "response": "Type 2 diabetes is primarily caused by insulin resistance and a relative deficiency of insulin production. Key contributing factors include obesity and physical inactivity, which are often associated with the development of the condition.\n\nThe symptoms of type 2 diabetes can manifest as:\n- Increased thirst (polydipsia)\n- Frequent urination (polyuria)\n- Increased hunger (polyphagia)\n- Unexplained weight loss\n\nThe pathophysiology of type 2 diabetes involves dysfunction in multiple organs:\n1. **Pancreas**: There is a decrease in insulin production from the beta cells.\n2. **Liver**: There is an increase in glucose production.\n3. **Muscle and fat cells**: These cells exhibit reduced glucose uptake due to insulin resistance.\n\nThis interplay of factors leads to sustained elevated blood glucose levels and a state of metabolic dysfunction. Regular monitoring of blood glucose and HbA1c levels is crucial for managing the condition.", "parent": "What is type 2 diabetes?", "strategy": "deepening", "relevant_documents": ["\n Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.\n Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),\n increased hunger (polyphagia), and unexplained weight loss. The condition is often\n associated with obesity and physical inactivity. Management typically involves\n lifestyle modifications, oral medications, and sometimes insulin therapy.\n ", "\n The pathophysiology of type 2 diabetes involves multiple organs:\n 1. Pancreas: Decreased insulin production from beta cells\n 2. Liver: Increased glucose production\n 3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance\n This creates a cycle of elevated blood glucose levels and metabolic dysfunction.\n ", "\n Treatment options for type 2 diabetes include:\n - Metformin: First-line medication that reduces liver glucose production\n - Sulfonylureas: Increase insulin secretion from pancreas\n - GLP-1 receptor agonists: Enhance insulin release and reduce appetite\n - DPP-4 inhibitors: Prolong the action of incretin hormones\n Regular monitoring of blood glucose and HbA1c levels is essential.\n "], "metadata": {"strategy": "deepening", "parent": "What is type 2 diabetes?", "grounding_type": "strict"}}
{"instruction": "What are the recommended treatment options for type 2 diabetes?", "response": "The recommended treatment options for type 2 diabetes include:\n\n1. **Metformin**: This is the first-line medication commonly prescribed. It works by reducing liver glucose production, thereby helping to lower blood sugar levels.\n\n2. **Sulfonylureas**: These medications increase insulin secretion from the pancreas, which can help lower blood glucose levels.\n\n3. **GLP-1 receptor agonists**: These agents enhance insulin release from the pancreas and also help reduce appetite, contributing to better blood sugar control and potential weight loss.\n\n4. **DPP-4 inhibitors**: These medications prolong the action of incretin hormones, which helps to regulate blood sugar levels.\n\nIn addition to these pharmacological treatments, regular monitoring of blood glucose and HbA1c levels is essential to managing type 2 diabetes effectively. \n\nManagement of the condition typically also involves lifestyle modifications, which may include changes in diet and increased physical activity, as well as, in some cases, insulin therapy.", "parent": "What are the treatment options for diabetes?", "strategy": "deepening", "relevant_documents": ["\n Treatment options for type 2 diabetes include:\n - Metformin: First-line medication that reduces liver glucose production\n - Sulfonylureas: Increase insulin secretion from pancreas\n - GLP-1 receptor agonists: Enhance insulin release and reduce appetite\n - DPP-4 inhibitors: Prolong the action of incretin hormones\n Regular monitoring of blood glucose and HbA1c levels is essential.\n ", "\n Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.\n Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),\n increased hunger (polyphagia), and unexplained weight loss. The condition is often\n associated with obesity and physical inactivity. Management typically involves\n lifestyle modifications, oral medications, and sometimes insulin therapy.\n ", "\n The pathophysiology of type 2 diabetes involves multiple organs:\n 1. Pancreas: Decreased insulin production from beta cells\n 2. Liver: Increased glucose production\n 3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance\n This creates a cycle of elevated blood glucose levels and metabolic dysfunction.\n "], "metadata": {"strategy": "deepening", "parent": "What are the treatment options for diabetes?", "grounding_type": "strict"}}
{"instruction": "What are the primary causes and common symptoms of type 2 diabetes?", "response": "Type 2 diabetes is primarily caused by insulin resistance and relative insulin deficiency. The key factors associated with its development include obesity and physical inactivity. \n\nCommon symptoms of type 2 diabetes include:\n- Increased thirst (polydipsia)\n- Frequent urination (polyuria)\n- Increased hunger (polyphagia)\n- Unexplained weight loss\n\nThese symptoms arise due to the underlying pathophysiology, which involves multiple organs: the pancreas exhibits decreased insulin production, the liver increases glucose production, and muscle and fat cells show reduced glucose uptake due to insulin resistance. This leads to elevated blood glucose levels and metabolic dysfunction.", "parent": "What are the causes and symptoms of type 2 diabetes?", "strategy": "deepening", "relevant_documents": ["\n Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.\n Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),\n increased hunger (polyphagia), and unexplained weight loss. The condition is often\n associated with obesity and physical inactivity. Management typically involves\n lifestyle modifications, oral medications, and sometimes insulin therapy.\n ", "\n The pathophysiology of type 2 diabetes involves multiple organs:\n 1. Pancreas: Decreased insulin production from beta cells\n 2. Liver: Increased glucose production\n 3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance\n This creates a cycle of elevated blood glucose levels and metabolic dysfunction.\n ", "\n Treatment options for type 2 diabetes include:\n - Metformin: First-line medication that reduces liver glucose production\n - Sulfonylureas: Increase insulin secretion from pancreas\n - GLP-1 receptor agonists: Enhance insulin release and reduce appetite\n - DPP-4 inhibitors: Prolong the action of incretin hormones\n Regular monitoring of blood glucose and HbA1c levels is essential.\n "], "metadata": {"strategy": "deepening", "parent": "What are the causes and symptoms of type 2 diabetes?", "grounding_type": "strict"}}
{"instruction": "What are the recommended treatment options for managing type 2 diabetes, such as metformin or lifestyle changes?", "response": "The recommended treatment options for managing type 2 diabetes include a combination of lifestyle changes and medications. \n\n1. **Lifestyle Changes**: These are foundational in managing type 2 diabetes and typically involve:\n - Dietary modifications to promote a balanced diet that helps control blood sugar levels.\n - Increased physical activity to enhance insulin sensitivity and aid in weight management.\n - Weight loss, if necessary, to reduce insulin resistance.\n\n2. **Medications**: When lifestyle changes alone are insufficient, several medications may be prescribed:\n - **Metformin**: This is the first-line medication recommended for reducing liver glucose production and improving insulin sensitivity.\n - **Sulfonylureas**: These medications work by increasing insulin secretion from the pancreas.\n - **GLP-1 Receptor Agonists**: These enhance insulin release and reduce appetite, helping with weight management.\n - **DPP-4 Inhibitors**: These prolong the action of incretin hormones, which helps to control blood sugar levels.\n\nAdditionally, regular monitoring of blood glucose and HbA1c levels is essential to assess the effectiveness of the treatment plan and make necessary adjustments. \n\nOverall, effective management of type 2 diabetes relies on a comprehensive approach that combines these treatment options tailored to the individual\u2019s needs.", "parent": "What are the recommended treatment options for type 2 diabetes?", "strategy": "concretizing", "relevant_documents": ["\n Treatment options for type 2 diabetes include:\n - Metformin: First-line medication that reduces liver glucose production\n - Sulfonylureas: Increase insulin secretion from pancreas\n - GLP-1 receptor agonists: Enhance insulin release and reduce appetite\n - DPP-4 inhibitors: Prolong the action of incretin hormones\n Regular monitoring of blood glucose and HbA1c levels is essential.\n ", "\n Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.\n Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),\n increased hunger (polyphagia), and unexplained weight loss. The condition is often\n associated with obesity and physical inactivity. Management typically involves\n lifestyle modifications, oral medications, and sometimes insulin therapy.\n ", "\n The pathophysiology of type 2 diabetes involves multiple organs:\n 1. Pancreas: Decreased insulin production from beta cells\n 2. Liver: Increased glucose production\n 3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance\n This creates a cycle of elevated blood glucose levels and metabolic dysfunction.\n "], "metadata": {"strategy": "concretizing", "parent": "What are the recommended treatment options for type 2 diabetes?", "grounding_type": "strict"}}
{"instruction": "What are the leading risk factors and prevalent symptoms associated with type 2 diabetes?", "response": "Type 2 diabetes is primarily characterized by insulin resistance and relative insulin deficiency, and it is associated with several leading risk factors and prevalent symptoms.\n\n**Leading Risk Factors:**\n1. **Obesity:** Excess body weight is one of the most significant risk factors for developing type 2 diabetes.\n2. **Physical Inactivity:** A sedentary lifestyle contributes to the risk of developing insulin resistance and type 2 diabetes.\n3. **Age:** Risk increases with age, particularly for those over 45 years.\n4. **Family History:** A family history of diabetes can elevate an individual's risk.\n5. **Ethnicity:** Certain ethnic groups, including African American, Hispanic, Native American, and Asian American, are at higher risk.\n\n**Prevalent Symptoms:**\n1. **Increased Thirst (Polydipsia):** Individuals often experience excessive thirst due to elevated blood glucose levels.\n2. **Frequent Urination (Polyuria):** High blood sugar levels can lead to increased urination as the body attempts to eliminate excess glucose.\n3. **Increased Hunger (Polyphagia):** Despite eating, individuals may feel more hungry due to the body\u2019s inability to utilize glucose effectively.\n4. **Unexplained Weight Loss:** Some people may lose weight without trying due to the body breaking down fat and muscle for energy when it cannot use glucose.\n\nRecognition of these risk factors and symptoms is crucial for early diagnosis and management of type 2 diabetes.", "parent": "What are the primary causes and common symptoms of type 2 diabetes?", "strategy": "deepening", "relevant_documents": ["\n Type 2 diabetes is characterized by insulin resistance and relative insulin deficiency.\n Key symptoms include increased thirst (polydipsia), frequent urination (polyuria),\n increased hunger (polyphagia), and unexplained weight loss. The condition is often\n associated with obesity and physical inactivity. Management typically involves\n lifestyle modifications, oral medications, and sometimes insulin therapy.\n ", "\n The pathophysiology of type 2 diabetes involves multiple organs:\n 1. Pancreas: Decreased insulin production from beta cells\n 2. Liver: Increased glucose production\n 3. Muscle and fat cells: Reduced glucose uptake due to insulin resistance\n This creates a cycle of elevated blood glucose levels and metabolic dysfunction.\n ", "\n Treatment options for type 2 diabetes include:\n - Metformin: First-line medication that reduces liver glucose production\n - Sulfonylureas: Increase insulin secretion from pancreas\n - GLP-1 receptor agonists: Enhance insulin release and reduce appetite\n - DPP-4 inhibitors: Prolong the action of incretin hormones\n Regular monitoring of blood glucose and HbA1c levels is essential.\n "], "metadata": {"strategy": "deepening", "parent": "What are the primary causes and common symptoms of type 2 diabetes?", "grounding_type": "strict"}}