Skip to main content

synthesize

synthesize column1 ... columnN

Synthesizes results from a provided set of columns which can be grouped together, can also take a prompt flag to provide a GPT prompt to describe the desired synthesized data set.

arguments:

column

One or more columns to use as a source of synthesized values. Can also be a comma delimited list of columns ("col1,col2,col3") to denote that the columns are linked within a row, meaning that selecting a value from one column, shold select values for the other columns from the same row (type: string)

flags:

--appendStage

Used to append the results from a previous stage to the current stage. (provide a label, stage index, or boolean true to append the previous results)

--cache

A boolean value of true/false that determines whether or not to use the cache. Generally most commands will default to true.

--checkpoint

Format: "{CHECKPOINT NAME}:{COLUMN}" Used to store the value of the provided column (in the first row of results) in the provided name for use as a checkpoint in scheduled queries or other stages. Can be accessed using $CHECKPOINTS.{CHECKPOINT NAME}$

--count

The number of results to synthesize.

--credential

An OpenAI credential to override the default one named 'openai' required by this command.

--filter

A filter to run on the command results before completing the command. If not provided, no filter is run on the results.

--guid

Adds a populated random guid column.

--ignoreEmpty

Ignore empty values in previous results. If this is false, then empty values will be used as a source of synthesized values.

--labelStage

Used to label a stage with a user provided label.

--prompt

A GPT prompt to detail the desired synthesized data set. Requires an OpenAI credential named 'openai' to be configured.

--prompt.frequency_penalty

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

--prompt.logint_bias

Modify the likelihood of specified tokens appearing in the completion.

--prompt.model

ID of the model to use.

--prompt.presence_penalty

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

--prompt.temperature

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

--prompt.top_p

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

--randomizeHash

Randomizes the stage hash, even if args and flags are the same.

--retry

The number of times to retry the synthesizing operation if it fails.

--source

Do not run the synthesize operation, only return the source.

--stats

Controls if a stats calculation is run on a stage after it completes.

--table

A comma separated list of columns to include in the command results. If not provided, all columns will be included.

--type

Each command has a default type, either "mapping" or "reducing". Some commands can operate as either, when "reducing" they will operate on all rows at once, when "mapping", they will operate on one row at a time.

support

AMI_ENTERPRISE AMI_FREE AMI_PRO BINARY_ENTERPRISE BINARY_FREE BINARY_PRO DESKTOP_ENTERPRISE DESKTOP_FREE DESKTOP_PRO DOCKER_ENTERPRISE DOCKER_FREE DOCKER_PRO