Skip to main content


synthesize column1 ... columnN

Synthesizes results from a provided set of columns which can be grouped together, can also take a prompt flag to provide a GPT prompt to describe the desired synthesized data set.



One or more columns to use as a source of synthesized values. Can also be a comma delimited list of columns ("col1,col2,col3") to denote that the columns are linked within a row, meaning that selecting a value from one column, shold select values for the other columns from the same row (type: string)



Used to append the results from a previous stage to the current stage. (provide a label, stage index, or boolean true to append the previous results)


A boolean value of true/false that determines whether or not to use the cache. Generally most commands will default to true.


Format: "{CHECKPOINT NAME}:{COLUMN}" Used to store the value of the provided column (in the first row of results) in the provided name for use as a checkpoint in scheduled queries or other stages. Can be accessed using $CHECKPOINTS.{CHECKPOINT NAME}$


The number of results to synthesize.


An OpenAI credential to override the default one named 'openai' required by this command.


A filter to run on the command results before completing the command. If not provided, no filter is run on the results.


Adds a populated random guid column.


Ignore empty values in previous results. If this is false, then empty values will be used as a source of synthesized values.


Used to label a stage with a user provided label.


A GPT prompt to detail the desired synthesized data set. Requires an OpenAI credential named 'openai' to be configured.


Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.


Modify the likelihood of specified tokens appearing in the completion.


ID of the model to use.


Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.


What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.


An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.


Randomizes the stage hash, even if args and flags are the same.


The number of times to retry the synthesizing operation if it fails.


Do not run the synthesize operation, only return the source.


Controls if a stats calculation is run on a stage after it completes.


A comma separated list of columns to include in the command results. If not provided, all columns will be included.


Each command has a default type, either "mapping" or "reducing". Some commands can operate as either, when "reducing" they will operate on all rows at once, when "mapping", they will operate on one row at a time.