Synthetic Data for LLM Training

How I think about using generated training data for large language model training.

Robot writing synthetic data for its children :D

Here are my notes on synthetic data. Take it with a grain of salt, as I am new in this area.

Let’s say you have a generative AI (generative machine learning) problem, so you have some input data and some corresponding output data. But you have only a tiny quantity of this training data.

Synthetic data is data generated by a machine learning model. Here we will discuss primarily large language models (LLMs). Synthetic data helps where you need to add missing hard-to-collect data for your predictions. This data could be something people never write down or say. For example, human problem-solving thoughts usually don’t get written down. Let’s Verify Step by Step uses synthetic data to create PRM800K dataset for math step-wise reasoning dataset not available elsewhere:

For this kind of PRM800K dataset is rare and with additional each step ranking it is even rarer:

Let's call the numerator x.

So the denominator is 3x-7.

We know that x/(3x-7) = 2/5.

So 5x = 2(3x-7).

5x = 6x-14.

So x = 7.

The rarer and more critical the data, the more impactful the synthetic data can be.

If nothing like the required data was present during the model pretraining, you won’t be able to prompt-instruct the model to perform the required action. Few shot examples can help, but the more complex the problem, the more likely you will need more examples, which are costly to write by hand.

Another advantage of synthetic data is that they can be cleaner than data collected from random places on the internet. The disadvantage is that they can be less diverse, less representative than the real data, and contain more hallucinations.

Why synthetic data makes sense?

Real data costs human time, it can contain copyrighted data or personally identifiable information (PII privacy), it can be noisy, incomplete, or irrelevant. Synthetic data can represent a way around these problem. But you cannot create the required data synthetically out of thin air.

What Is Needed to Create Synthetic Data?

You may need:

  1. Either a more general model that was essentially trained on something similar to the target data needed,
  2. Or you need some real data that is close to the required data and perform only an easy modification to match the required data distribution.

Another way of looking at 2 is to use other data, which was trained into LLM, to shift the output distribution the way you want. For example, polishing the LLM behavior by making it consistent with selected good patterns in some parts of the training data, which are retrieved and applied with prompting instructions, leads to generating good synthetic data.

Terms of service must be checked to comply with the provider’s conditions. allows use of their GPT-4-level Large model for training LLMs on their output synthetic data, which is in contrast to OpenAI’s policy which does not allow that.

Another alternative are open-source models, e.g., Mixtral which is very open with license Apache-2.0 and requires only 12B active parameters. You can buy Mixtral inference from many providers for example OpenRouter or As of 2024-03-18, the prices for Mixtral are similar to GPT-3.5.

Avoid Llama-70b, because it is partially open model with use-restrictions including bad on producing synthetic data for training other models (“You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).”

Levels of Synthetic Data

Here is a spectrum of increasingly less human involvement in the process or human leverage in the process or model development.

  1. Fully manual: You would love to train on data from people that is fully manually written and verified.
  2. Cleaned-up manual data: The data is manually written but rewritten, rephrased, or cleaned by a machine, then verified.
  3. The data is entirely generated but manually verified, and each sample is labeled.
  4. The data is entirely generated and rated by a machine. The trained model is evaluated on a small human-labeled subset.
  5. Self-aligning or self-polishing: The data is entirely generated and rated by a machine. The trained model is evaluated on machine-generated data.
  6. Autonomous: Self-improving from the environment and from generated data. The Q-transformer is a step towards that.

Curation of Synthetic Data

Garbage-in implies garbage-out. The more complex the data to generate or the more distant to the real training data, the harder it is to synthesize the correct data to train on.

In some problems, verification is easier than generation, so you can remove the invalid data from the generated data. For example, the goal may be to generate a program function that passes an executable set of tests. In this case, verifying that the generated sum is correct is very easy. Another example could be playing chess.

Constitutional AI uses an LLM to label synthetically generated responses to follow specific rules (constitution about harm, bias, and more).

Below is a prompt from Self-Alignment with Instruction Backtranslation for self-curation of synthetic

Below is an instruction from an user and a candidate answer. Evaluate whether or
not the answer is a good example of how AI Assistant should respond to the user’s
instruction. Please assign a score using the following 5-point scale:
1: It means the answer is incomplete, vague, off-topic, ...
2: It means the answer addresses most of the asks from the user. ...
3: It means the answer is helpful but not written by an AI Assistant. ...
4: It means the answer is written from an AI assistant’s perspective with a clear focus of addressing the instruction. ...
5: It means it is a perfect answer from an AI Assistant. ...
Please first provide a brief reasoning you used to derive the rating score, and then write "Score: <rating>" in the last line.
<generated instruction>

Expanding Human Data

WizardLM Evol Instruct uses human-written coding examples to generate more difficult (complex) examples with GPT-3.5. Then, they use the synthetic dataset to fine-tune and improve performance on coding problems in general.

Generating Instructions from Outputs (self-augmentation)

In some cases, generating the questions (inputs) is easy given the answers (outputs). This inverted generation also allows you to control the distribution of the labeled samples. In the case of text classification, it is easier to generate a text that matches the given category label. Another example of this method is Self-Alignment with Instruction Backtranslation, which also involves self-curation.


The model cannot learn out of thin air, but if you can reformulate or abstract the given problems into other problems, which the model trained more on, you can reapply the lessons and improve the model by polishing it this way.

For example, verifying and rating the response quality may be easier than writing the reply so that the model can self-improve on the given task this way.

Examples of Synthetic Data Applications

What synthetic data tools can you use today?

Teknium’s Nous-Hermes-2-Mistral-7B-DPO

Mistral 7B fine-tunes on open Hermes synthetic datasets from OpenAI are one of the best OSS models in this weight count category.

Nous-Hermes-2-Mistral-7B-DPO was trained on:

  • Supervised fine-tuning synthetic dataset teknium/OpenHermes-2.5 generated by GPT-4 in size of 1,000,000 instructions/chats.
  • Direct preference optimization dataset, likely also GPT-4 generated (synthetic).

DSPy Python Library

DSPy can take just tens of labeled examples and high-level LLM chains and generate prompts to use, generate synthetic data, and fine-tune a smaller model.

This library generally helps you build prompt chains or pipelines where the LLMs have well-defined inputs and outputs and various tools like RAG.

The library abstracts away manual prompt engineering and instead optimizes the prompts for you, such that you only focus on structured and documented inputs and outputs. DSPy seems much more practical than LangChain.

The library uses a selected larger model (GPT-3.5 or Llama2 13b) to generate prompts and few-shot examples for your smaller LLM like T5. Not only that, you can compose an entire pipeline or prompt chain. Generating and optimizing the prompts (MIPRO) within the prompt chain is called compiling. For example, DSPy can generate reasoning examples and optionally fine-tune a smaller LLM on them. The question is, how good are the examples? Un-compiled chains use zero-shot prompting.

Here is an example DSPy project in a video.

Research Papers Relevant to Synthetic Data

Below is a partial list, and I may expand on this in the future.

Let’s Verify Step by Step

A math reasoning dataset. Contributions of this paper are:

  1. process supervision can train more reliable reward models than outcome supervision. State-of-the-art process-supervised reward model solves 78.2% of problems from a representative subset of the MATH test set.
  2. A large reward model can reliably approximate human supervision for smaller reward models and can efficiently conduct large-scale data collection ablations.
  3. active learning leads to a 2.6× improvement in the data efficiency of process supervision.
  4. full process supervision dataset, PRM800K, a dataset of 800K step-level labels across 75K solutions to 12K problems). Assign each step in the solution a positive, negative, or neutral label.

Chain-of-Abstraction Reasoning

A LLM model is trained to generate reasoning steps (chains) that use general tools like calculator or search. The tools are then executed in order of given in the reasoning chain, where output of one tool call be an input of another.

The abstraction word here is to express that the reasoning chains reduce amount of specifics effectively then using re-usable general problem-solving tools.

Llama-70b was used to generate the synthetic data to train on.

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

The smaller model teacher model generates inputs and labels. The bigger model can learn to outperform a weaker teacher if allowed to be “over-confident.” However, this approach is not generally proven for all situations and still shows an upper-performance limit.

Created on 20 Feb 2024. Updated on: 18 Mar 2024.
Thank you

About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.