Skip to main content

Synthetic Data Generation

The synthesize command is a powerful way to create synthetic data sets based on real data in combination with fully synthesized values.

You can upload CSV/JSON/NDJSON files to the cellar and thaw those files as seeds for the synthesize commands.

You can also provide natural language prompts to the synthesize command describing the synthetic data sets you would like to generate.

Note: The synthesize command requires auth when useing th --prompt flag, so you'll first need to configure an openai credential containing your OpenAI API key with the name openai.

How it works

The synthesize command takes zero or more arguments, which must refer to columns in the previous stage's results. The synthesize command will use these columns as data sets of possible values to select from for data synthesis.

Columns can be linked together by providing the argument as a comma delimited list of arguments. If columns are linked, then all values synthesized for the linked columns will come from the same row.

Let's take a look at an example.

Query

seed '[
{
"product": "phone",
"price": 800,
"country": "US",
"user": "Nemo",
"user_id": "zxcv-asdf-1234-5678"
},
{
"product": "computer",
"price": 1000,
"country": "FR",
"user": "Dory",
"user_id": "1234-5678-asdf-zxcv"
},
{
"country": "CA",
"user": "Marlin",
"user_id": "0978-6543-asdf-yuio"
},
{
"country": "GB",
},
]'
|| synthesize "product,price" "user,user_id" "country"
--prompt "add a timestamp (named ts), a guid (named tx_id)"
--count 100

Synthesize Arguments

By providing the argument "product,price", we are telling the synthesize command that the product and price columns are linked, meaning that there is a 1:1 relationship between the two.

Synthesized data generated with a certain product value will always have the same price, based on the previous set of results used to seed the synthesize command.

The same logic applies for the "user,user_id" argument, which are independently linked.

The "country" argument is standalone in this example. This means that any possible value for country will be selected to generate a row.

Synthesize Flags

The --prompt flag allows us to provide a natural language prompt detailing what fully synthetic values to include in the result.

The --count flag determines how many results will be synthesized.

Results

This query will generate 100 results resembling the below:

productpriceuseruser_idcountrytstx_id
phone800Dory1234-5678-asdf-zxcvFR2022-06-29T13:38:37.413Ze00024d9-3969-4243-8388-53ec0b76e31f
..............
computer1000Dory0978-6543-asdf-yuioGB2021-06-17T02:24:32.357Zv98gh36f-e2f3-9867-9e11-f54hj7asd221
computer1000Marlin0978-6543-asdf-yuioUS2020-05-14T02:24:56.857Zb83be58d-e8e4-4952-9e99-f20bd4ece530

Additional Examples

Prompt only

To generate synthetic results that don't use any real data, we can simply use the --prompt flag and provide a natural language prompt describing the data set we would like to synthesize. Notice that zero arguments are provided.

Query

synthesize
--prompt "add a random product (named product) which is either a tablet, a phone, or a computer, add a random price (named price), add a timestamp (named ts), a guid (named tx_id)"
--count 100

Results

This will generate 100 results resembling the below:

productpricetstx_id
phone2062022-06-29T13:38:37.413Ze00024d9-3969-4243-8388-53ec0b76e31f
........
computer6002020-05-14T02:24:56.857Zb83be58d-e8e4-4952-9e99-f20bd4ece530

Combination with thaw

We can also use uploaded or frozen files using the thaw command as the data set to select values from for our synthetic results.

First let's freeze a sample data set.

seed '[
{
"product": "phone",
"price": 800,
"country": "US",
"user": "Nemo",
"user_id": "zxcv-asdf-1234-5678"
},
{
"product": "computer",
"price": 1000,
"country": "FR",
"user": "Dory",
"user_id": "1234-5678-asdf-zxcv"
},
{
"country": "CA",
"user": "Marlin",
"user_id": "0978-6543-asdf-yuio"
},
{
"country": "GB",
},
]'
|| freeze synthetic-demo

This data set now exists in the cellar with the name synthetic-demo and can be accessed with the thaw command.

thaw synthetic-demo
|| synthesize "product,price" "user,user_id" "country"
--prompt "add a timestamp (named ts), a guid (named tx_id)"
--count 100

Results

This query will generate 100 results resembling the below:

productpriceuseruser_idcountrytstx_id
phone800Dory1234-5678-asdf-zxcvFR2022-06-29T13:38:37.413Ze00024d9-3969-4243-8388-53ec0b76e31f
..............
computer1000Dory0978-6543-asdf-yuioGB2021-06-17T02:24:32.357Zv98gh36f-e2f3-9867-9e11-f54hj7asd221
computer1000Marlin0978-6543-asdf-yuioUS2020-05-14T02:24:56.857Zb83be58d-e8e4-4952-9e99-f20bd4ece530

Combination with api/open

Similar to with thaw, we can take the output of query using the api/open commands, and use the synthesize command to generate synthetic data from those results.

Query

api get https://pokeapi.co/api/v2/pokemon
|| normalize results
|| synthesize "name"
--prompt "add a first name (name first_name), and a guid (named tx_id)"
--count 100

Results

This query will generate 100 results resembling the below:

pokemonfirst_nametx_id
charizardStevee00024d9-3969-4243-8388-53ec0b76e31f
......
bulbasaurLindab83be58d-e8e4-4952-9e99-f20bd4ece530