In-domain SFT

Generating custom SFT datasets

Supervised Fine-Tuning (SFT) with high-quality, domain-specific data is critical for fine-tuning LLMs to excel in specialized tasks. However, creating diverse, high-quality datasets manually is time-consuming and expensive. This tutorial will guide you through using Phinity to generate synthetic in-domain SFT data efficiently.

Supervised Fine-Tuning (SFT) data consists of instruction-response pairs used to fine-tune language models for specific tasks or domains. Quality SFT data should be:

  • Diverse: Covering a range of scenarios and question types

  • Complex: Including challenging examples that push the model's capabilities

  • Domain-specific: Focused on your target application area

  • Realistic: Representing real-world use cases

One of the most difficult aspects of synthetic data generation at scale is diversity. Methods like WizardLM Evol-Instruct and Genetic Instruct have been developed to enable diverse instruction generation at scale - to do this, they continuously create new prompts from a seed set of prompts that the user provides by "evolving" them in the domain. Think of a never-ending family tree: prompts give birth to new prompts with various added mutations through generations. Now you have 1000000s of new family members from a starting set of two parents 🤠 (unfortunately, we have to eliminate many of these family members later when we do in-domain quality filtering).

Phinity builds in customizable Evol-Instruct. You can either generate instructions from scratch or from your documents.

Quick Start: Medical Domain with Documents

This tutorial demonstrates how to use the SFTGenerator class to create domain-specific instruction-response pairs grounded in medical documents about diabetes.

pip install phinitydata

Setup and Initialization

from phinitydata.testset.sft_generator import SFTGenerator
import os

# Create output directory
os.makedirs("generated_data", exist_ok=True)

# Initialize the generator
generator = SFTGenerator()

1

Prepare Documents

Above are example documents - we also support connecting ChromaDB.

2

Configure Evolution

The domain summary will be injected into the evolution prompts. Put context and rules for your domain in this field. Read more about the evolution prompts and creating custom evolution strategies in SFTGenerator

3

Generate

4

Examine and Filter Generations

Examine the file and filter data for relevance and quality using an aligned LLM as judge (coming soon) or other custom verifiers

Full Script

Output Logs

Output File (JSONL)

Last updated