Skip to main content

Β· 2 min read
crul team

Authenticate and query more Business Apps for export to your Data Lake in Parquet format. Are we missing something? Let us know!

More Business Apps for Data Lakes​

Authenticate with over 20+ Business App services now including Okta for Custom Authorization Servers, Microsoft 365 and Zoom.

Export in Parquet Data File Format​

Schedule and store incremental diffs in Apache Parquet format. Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 2 min read
crul team

We've packed in plenty of your requests into this one! Authenticate and query Business Apps for export to your Data Lake. Maintain state across query stages using the new --variable flag. Paginate through REST and SOAP APIs with ease. Are we missing something? Let us know!

Business Apps for Data Lakes​

Authenticate and query Workday, Salesforce, Reltio, Mulesoft and Zoom. Store Business App event data to your preferred Data Lake.

Read more...

Query Variables​

You will often find yourself needing a value from a few stages ago that has been filtered out while developing queries. Some examples would be credentials, dates, or other bits of state. This release introduces query variables for simple stage level state setting and retrieval.

Read more...

API Pagination​

Paginate any type of REST and SOAP API specification.

Read more...

Global controls​

Disable caching globally for certain development and production use cases. Turn off Domain Throttling for increased outbound request throughput - use at your own risk!

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 6 min read
crul team

Getting clean data is hard enough, but sometimes, you need more than clean data. Just like in a house or an apartment, pushing the clutter into the closet doesn't make it go away. You need to tidy your data.

It's one thing to have tabular data with nice columns and logical rows. It's another to have data that is ready for analysis. Tidy data is data that is ready for analysis. It's data that is organized in a way that makes it easy to work with, and it is easy to manipulate, visualize, and model. Tidy data is data that is easy to use.

What is tidy data?

The tidy data concept was introduced by Hadley Wickham in his 2014 paper, Tidy Data. And it still rings true today.

Simply put, tidy data a dataset where:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Althought this structure is not always necessary for high quality analysis, it can often make big difference in the ease of analysis.

Tidy data in crul

The defacto data format in crul is the table, and the query language makes it easy to normalize (similar to flattening), rename, table, and untable data dynamically. Other common operations like joining and appending are also possible.

Often to get tiny data you'll need to "melt" your columns. This is where crul's new melt command shines. More on this shortly, but first, what does it mean to "melt" data?

Melting data

Melting data is the process of taking columns and turning them into rows. This is often necessary when you have a dataset that has multiple columns that represent the same thing. For example, if you have a dataset that has a column for each year, you might want to melt the data so that you have a single column for the year and a single column for the value.

Data prior to melting:

songartist20192020202120222023
song1artist1100200300400500
song2artist2200300400500600
song3artist3300400500600700

Data after melting:

songartistyearplays
song1artist12019100
song1artist12020200
song1artist12021300
............
song1artist22022500
............
song3artist32023700

By melting the data, you can now easily analyze the data by year. You can also easily visualize the data by year. And you can easily model the data by year.

Melting data in crul

The melt command in crul makes it easy to melt data. It takes a list of columns to melt. It then melts the data in those columns, and keeps the rest. You can then rename the columns to whatever you want, and continue processing, download as a csv or json file, or push to a third party store (like an S3 bucket).

Let's see an example of melting the data from the previous example.

We'll assume our data is in a file called plays.csv that we have uploaded to the cellar. It will be the same as the data in the previous example.

thaw plays.csv
|| melt 2019 2020 2021 2022 2023
|| rename column year

You can also provide wildcards to the melt command. For example, if you wanted to melt all columns that start with 20, you could do the following:

thaw plays.csv
|| melt 20*
|| rename column year

More examples from the tidy data paper

Let's take two examples from tidy data paper and see how we can melt the data in crul.

Example 1: Billboard top 100​

We'll start with the billboard charts dataset from the tidy data paper. You can find the dataset here.

We'll first upload that csv to the cellar so we can thaw it into our pipeline.

thaw billboard.csv

Raw Billboard

Notice that we have observations in our columns, specifically the billboard rank at different weeks in columns x1st.week, x2nd.week, etc. This is not tidy!

Let's melt all columns that fit the regex pattern x.* (x1st.week, x2nd.week, etc.).

thaw billboard.csv
|| melt x.*
|| rename value.week rank
|| rename column week

Tidy Billboard

From here we can do a little more cleanup and renaming of columns, construct timestamps, or process otherwise, but our data is now effectively "molten".

Example 2: Tuberculosis​

thaw tb.csv

Raw TB

Notice that we have observations in our columns, specifically the number of cases for different categories/dates in columns new_sp_m04, new_sp_m514, etc. This is not tidy!

Let's melt all columns that fit the regex pattern new_sp.* (new_sp_m04, new_sp_m514, etc.).

We are also using the untable command to remove an unwanted row that will match our pattern.

Finally we use a combination of the fillEmpty and filter commands to filter out null values. This is optional, in fact you might want these empty values in your results for analysis, or you may want to fill them with a different default and leave them in!

thaw tb
|| untable new_sp
|| melt new_sp.*
|| fillEmpty --filler "EMPTY"
|| filter "(value != 'EMPTY')"

Tidy TB

From here we can do a little more cleanup and renaming of columns, construct timestamps, or process otherwise, but our data is now effectively "molten".

Why use crul for tidy data?

The advantage of using crul for tidy data is the ability to both access the data and process it quickly in one place. Crul's caching tiers make it easy to iteratively design your dataset. You can also configure a schedule to automatically build data sets and optionally push them to one or more of 30+ common stores.

You can take advantage of other powerful commands in combination with the melt command. For example, incorporate semi-synthetic data generation with the synthesize command, incorporate prompting with the prompt command, or enrich/seed your data sets from web or API content with the open and api commands.

Happy melting!

Join our Community​

Come hang out and ask us any questions.

Join our discord

Β· One min read
crul team

Turn SOAP APIs into data sets with ease. Tidy up your data with the melt command. Support for XML data sources and uploads. Are we missing something? Let us know!

SOAP API support​

Make SOAP API requests using the soap command. Transform SOAP APIs into clean, tabular data sets with the benefits of crul's caching, domain policies, and powerful command library.

Read more...

Melt data​

Apply Tidy Data principles through the new melt command. Perfect for nested data when the normalize isn't quite what you are looking for.

Read more...

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· One min read
crul team

Map, reduce, expand your mind and data using JavaScript and JSON. Are we missing something? Let us know!

Data Processing with JavaScript​

Embed JavaScript for custom data processing within a crul query using the evaluate command. Include your favorite JavaScript data manipulation libraries like mathjs, Lodash or load custom ESM module libraries.

Read more...

Syntax Highlighting​

Beautiful syntax highlighting support for all your JSONs and JavaScripts in the crul query bar. Use the triple back tick ```javascript/json notation and look like a pro.

Read more...

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 2 min read
crul team

We're going multi-dimensional in this release, query GraphQL API's, run curl scripts, generate Vector Embeddings, query using Semantic Search and developer API Documentation. Are we missing something? Let us know!

Dynamically generate Vector embeddings from API and Web data. Persist vector embeddings into a vector database such as pinecone, and semantically query on the fly. Query pre-populated vector databases for performant search.

Read more...

Run curl Scripts and Query GraphQL APIs​

Take your curl and GraphQL games to the next level... Run, paginate, cache, schedule and securely authenticate curl and GraphQL queries. Transform and persist curl and GraphQL differential results to 30+ stores. Generate synthetic datasets and Vector Embeddings using curl and GraphQL results data. Use generative AI to create curl/GraphQL scripts/queries to run.

Read more on curl...
Read more on GraphQL...

Developer API Documentation​

The crul API allows for programatic access to core crul services and resources. This includes dispatching queries and results retrieval, as well as create, read, update, delete (CRUD) operations on core crul resources such as scheduled queries, credentials, domain policies, and more.

Read more...

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 2 min read
crul team

Whoa, we've included Prompts and Synthetic Data Generation in this release. Are we missing something? Let us know!

Chainable GPT Prompts​

Integrate GPT into your data pipeline. Send, chain and reuse prompts with fine grained sampling, likelihood, penalty, bias and model control.

Seed prompts with API and Web data. Recursively generate prompts from other prompts. Lose your mind.

Read more...

Synthetic Data Generation​

Create synthetic data sets using both real data in combination with fully synthesized values. Use natural language prompts describing the synthetic data sets you would like to generate.

Read more...

And much much more...​

Read more of more...

Join our Community​

Come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 5 min read
crul team

Intro​

Many of you have probably already tried using OpenAI's API to interact with available AI models. crul's natural API interaction and expansion principles make it easy to create generative AI workflows that dynamically generate and chain together prompts.

Although this blog is limited to OpenAI's API and ChatGPT related models, you can use the same concepts to chain together or distribute prompts across multiple models.

First contact​

Let's start small and run a single prompt.

Note: The OpenAI API requires auth, so you'll first need to configure a basic credential containing your OpenAI API key with the name openai.

Running a single prompt​

This query uses the prompt command to run a simple prompt using OpenAI's API.

prompt "Write a haiku about APIs"

Simple prompt

Note: If you rerun this query, you'll get back the cached results. You can bypass the cache using the --cache false global flag.

Another way to run this query is with a template from crul's query library. You can think of this template as a macro for the expanded query described in the Template explanation at the end of this post. This template will show you the underlying implementation of the prompt command using the api command.

Chaining prompts​

One prompt is cool enough, but let's chain some prompts together with some expansion. In this query, we'll prompt for a list of 5 cities, then split the comma separated response and normalize into 5 rows, then use a different prompt that includes the values ($response$) from the first prompt.

prompt "give me a list of 5 cities on earth, comma separated. like Paris, Los Angeles,etc."
|| split response ","
|| normalize response
|| prompt "what is the distance in miles to $response$ from San Francisco. Respond only with the numeric value. So if the answer is 3000 miles, respond with just 3000, no miles, commas or other punctuation or units of measurement"

Chained prompt

Seed prompts from web content​

We can manually fill in the prompt, or generate one/many from a previous set of results. For example, let's get all the headlines from the Hacker News homepage, then ask OpenAI to create a haiku based on each headline. Notice the $headline$ token in the prompt which allows us to create dynamic, previous stage dependent prompts.

open https://news.ycombinator.com/news
|| filter "(nodeName == 'A' and parentElement.attributes.class == 'titleline')"
|| rename innerText headline
|| prompt "Write a haiku about the following headline: $headline$"

Note: This query will have outbound requests throttled by domain policies, which defaults to 1 request per second per domain. This is also the throttle for the OpenAI API as of this blog post, so all good there!

HN Haikus

That's kind of cute! Let's try a more complex prompt and translate each title to French.

Translate web content​

What's different about the next query is that we'll merge the headlines into a single row containing a @@@ delimited string to pass into a single request. This isn't neccessary, however it reduces the number of requests we make in the prompt commands stage.

The last two stages (extract and normalize) of the query will extract the JSON response and expand it into rows.

For similar queries, you will need to write your prompt accordingly to let the model understand the data structure you are providing it with.

open https://news.ycombinator.com/news
|| filter "(nodeName == 'A' and parentElement.attributes.class == 'titleline')"
|| mergecolumn innerText --asArray false --delimiter "@@@"
|| rename innerText headlines
|| prompt "Respond only with an array of json objects containing the original headline in a headline key and the translation in a translation key. Translate each of the headlines contained in the following @@@ delimited string to French: $headlines$"
|| extract response
|| normalize

HN Translations

Using different models​

The current prompt command defaults to OpenAI's gpt-3.5-turbo model. To override, see the prompt command's --prompt.model flag.

Summary​

Have fun playing with API and web content as seeds for LLM prompts! The possiblities are endless, and crul is great tool for quickly trying out new ideas, and creating ML powered workflows.

Possible next steps using crul?

  • Schedule the query and export the results to a data store of your choice to automate a workflow.

  • Use the responses as inputs to another API request.

  • Download the results to CSV/JSON for further processing elsewhere.

We're working on a few improvements to transform web content into JSON friendly strings. Until then, you may run into classic (annoying) issues with string escaping in the --data object.

Please let us know if you run into any issues!

Pretty cool no? Or terrifying, you decide.

Template explanation​

The Prompt OpenAI template is essentially the query below. It's a pretty standard authenticated API request using the crul language, where we set headers for Content-Type and Authorization and provide a data payload.

To run this query, you'll need to remplate the $template...$ tokens with explicit values or tokens of your own. For example $template.apiKey$ could be replaced with $CREDENTIALS.openai$.

api post https://api.openai.com/v1/chat/completions
--headers '{
"Content-Type": "application/json",
"Authorization": "Bearer $template.apiKey$",
}'
--data '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "$template.prompt$"}],
}'

Β· 2 min read
crul team

Holy moly what a wild month it has been! The ticker keeps ticking on downloads and we’ve included our user top requests in this release. Are we missing something? Let us know!

API/Web Proxy Support​

A proxy server can now be used as a gateway between crul and the internet for core Web ( open, requests, scrape, form) and API (api) commands. Proxy servers are a useful way to provide varying levels of functionality, security, and privacy depending on your use case, needs, or company policy.

Support for http, https, socks4, socks5 and pac proxies comes out of the box. Party at Crul’s - bring your own proxy!

Related Links: Proxy web page request || Proxy API request

Headless Browser Stealth​

Remotely controlling a web browser leaves traces called browser fingerprints. Browser fingerprints can be used to distinguish a remotely controlled browser from a normal user controlled web browser.

By default, in Crul 1.1.0 β€˜Browser Stealth’ mode is enabled, hiding the browser's remote control state by erasing the browser fingerprint that is associated with a non-human user.

Read more...

Fetch ZIP Archives to Scan and Extract​

ZIP archives can now be fetched (api command) with the ability to scan metadata and read/serialize specific files into tabular format.

Read more...

New commands for HTML content extraction​

The new parseHTMLTable command allows for simplified conversion of HTML tables into Crul’s tabular format. Easy to process and export as a csv! The new parseArticle command can turn web pages containing an article into a standardized data structure of article content, headline, author etc.

Related Links: HTML table extraction || HTML article extraction

And much much more...​

Read more of more...

Join our Community​

We started up the Crul discord channel recently, come hang out and ask us any questions. Some of the features and fixes in this release come from your requests!

Crul’n IT,
Nic and Carl

Β· 7 min read
crul team

First off, we're so grateful to all of you who took the time to read our Hacker News post and try out crul. It's our 1.0.x, and we had no idea what to expect, seeing so many of you jump in has really fueled our spirit.

We know we have a lot to improve, but we hope you found it useful and intriguing. We could really use your help in understanding more about your use cases, and what interested you about crul.

Alright, let's go back in time and behind the scenes. 🎞️

Tuesday February 21st 10AM:

We call off our planned launch and decide to push it a week. Why? Nic becomes convinced we shouldn't let potential Windows and Linux users out to dry. Carl agrees. Native binaries could be tough, but a docker image? - possible. Let's give it a shot. After all, you only launch (for the first time) once - YOL(FTFT)O?

Our new launch date? One week from now - February 28th, 8AM. ⏰

Wednesday February 22nd, 3PM:

First successful build of our docker image! 🐳 Lots of tweaks and CICD still needed, but we are now feeling good about our planned release date. Carl adds in some nice enhancements to our query bar command suggest section and clears up a few long standing bugs with the query bar.

Friday February the 24th, 6PM:

Both our docker and Mac OS 1.0.3 builds are done, we upload and start to tie up loose ends on the crul.com homepage and enhancing documentation. 🏁

Saturday February the 25th, 9AM:

Our account page looks pretty ... rough. Before you all saw it, it was just 4 big ugly purple buttons, it gets the job done, but we can make it look better. Carl calls for a refresh. 🧼

By the end of the day it's looking much more polished. Nice work!

Meanwhile we are tweaking our post for Hacker News, we want it to sound like it’s coming from us, two engineers who love what they are building, but figuring things out as we go. πŸ—ΊοΈ

Sunday February 26th, 9AM:

Today we'll run end to end tests, proofread things (far from done - we're still catching things!), keep adding documentation, examples, etc. Get a little rest as well, we hope to have a big week ahead. πŸ›οΈ

Monday February 27th, 10AM:

There's something up with upgrades, it's inconsistent but does fail sometimes. Ouch! 😱 Nic spends most of the day diagnosing, no luck. We decide to move forward, we'll fix it in time for the next upgrade.

Tuesday February 28th, 7AM:

We're up early, at least for Nic, Carl likes to get the worm. πŸͺ± Just getting everything in place ahead of submitting.

Tuesday February 28th, 8AM:

Submit! Post a message to our Slack, "Hey check this out - we're on Hacker News!"

Major OOPS, stay tuned. β­•

Tuesday February 28th, 9:30AM:

Hm, we're still in shownew, it's been an hour and a half, we haven't moved to the show section of HN, which is just a click off the home page and should get us a little more traffic. We've got (~7) upvotes and a comment? What's going on? 😬

We draft an email for dang/Hacker News.

Tuesday February 28th, 10:10AM:

We get a response from HN, we screwed up badly by posting to our slack channel, just a couple quick upvotes from our slack friends and we are flagged. We really screwed up here, but fortunately, it's not egregious, and we get a second chance. 🀭🀭🀭

I can't iterate enough, do NOT post the link to your HN post after submission! It's clearly outlined not to solicit upvotes or comments, but we took this too literally. We convinced ourselves that a simple "Hey check this out - we're on Hacker News!" did not explicitly solicit interaction, but, of course, it goes against the spirit of the submission rules - we were dumb and lucky, don't make the same mistake!

Tuesday February 28th, 10:15AM:

We hit the show section, traffic starts coming in, upvotes start coming in, comments start coming in.

OMG!

Tuesday February 28th, 10:30AM:

We're getting good steady traffic for the first time - ever? So exciting, we're glued to the post and our traffic metrics! We also get on the front page!!! 🀩

Tuesday February 28th, 2:00PM:

Still in the 15-25 rank range on the home page, lots of discussion, sign ups and inbound, it's exciting but nerve-wracking. 😹

Tuesday February 28th, 8:00PM:

Still (!) in the 15-25 rank range on the home page, still (!) steady traffic and downloads, we are so thrilled. We're adding tickets to our backlog like crazy based off of what we are hearing back. 😊

Wednesday February 29th, 8:00AM:

Can't fully remember if we were still on the front page 24 hours later, but we're pretty sure - it's a bit of a blur and I doubt either of us slept well from the excitement. The post is still doing well, we are still getting traffic and plenty of sign ups. 🚘

Wednesday February 29th, 10:00AM:

Responding to tickets and questions and trying to wrap our heads around how to figure out what people found cool about crul, and whether it lived up to their expectations. πŸ€·β€β™‚οΈ

|| timestamp (Present day)

The last week has been a whirlwind πŸŒ€, and it is still ongoing now as our sign ups keep ticking up. We hope you enjoyed a little taste of the behind the scenes. 🎬

What's next?​

We've bootstrapped crul and are excited to continue developing it into our larger vision, but we truly need your help. We built what we wanted to build, what we thought was cool and interesting and powerful, but now we could really use your perspective. πŸ‘€

What did/do you want to do with crul? Were you able to? What made sense, what was confusing? Would you be willing to share your thoughts here?

We would also love to chat with you, and if you've got 15 minutes to tell us your thoughts and maybe hear some epic stories from Carl, you'll be first in line for a commemorative "i used crul before it was cool" mug! β˜•

Lastly, if you have an enterprise use case, we would love to collaborate with you as a design partner. We're a lot of fun to work with, dedicated, and care deeply about solving your problems. Hell, we'll probably make it free if you ask nicely! πŸ˜‰

TLDR; Launched on Hacker News, did well, was exciting, now what? Help us out?

Stumbling along as we go,

Nic and Carl (Founders of Crul, Inc.)