Skip to main content

Introduction

Let's discover crul in less than 5 minutes!

What is crul?

The name crul comes from a mashup of the word "crawl" (as in web crawling), and "curl", one of our favorite tools for interacting with the web. crul was built from the desire to be able to transform the web (open, saas, api and dark) into a dynamically explorable data set in real time.

The crul query language allows you to transform web pages and API requests into a shapeable data set, with built in concepts of expansion into new links, and a processing language to filter, transform and export your data set to a growing collection of common destinations.

The crul query language

The crul query language is designed with the express purpose of interacting with web content through a small number of core commands, further detailed below.

The crul language also includes a large number of processing commands, which allow for sequential processing of data. for example, we can start with a core api command to make and API request, then pipe the API request's results to the find command, which will search for "keyword" in the API request results. Here's what that query might look like:

api get https://... || find "keyword"

Many commands include a number of possible arguments and flags to further control the way they operate.

Core commands

api

The api command allows you to make REST requests to an endpoint, and returns the response in a tabular form. Currently this command supports XML, CSV, and JSON response formats, and will return other formats as raw data.

open

The open command will open a web page, fulfill network requests, render javascript, etc. and process the page's content into a tabular data set.

requests

The requests command will open a web page, and monitor network requests. Once the page has fully loaded, the response will include a rich data set including request sources and destinations, full request and response payloads, timing data, and more.

Query Examples

Open Web Content

Let's start with an example of a query that would get us article text from all the headlines on a news webpage.

We could easily create a query that would open the homepage of a news webpage, filter all the article links, then expand and open each article link, and finally filter the articles text contents.

open https://www.[my news website].com /* open the news homepage */
|| filter "attributes.class == 'article_link'" /* filter for links to articles */
|| open $attributes.href$ /* open all article links */
|| filter "attributes.class == 'article_text'" /* filter for article text */

Constructing filters can sometimes be a little trickier than this example! Fortunately you can often use the tabular data set to find multiple filters that together represent the part(s) of the page you are looking for.

API Content

You can also interact with APIs, where the concept of filtering, expansion, and searching is very natural.

api get https://pokeapi.co/api/v2/pokemon /* request a list of pokemon */
|| normalize results /* normalize/flatten the data set */
|| api get $url$ /* open each pokemon entry referenced by the url in the normalized data */
|| find "charizard" /* find the word charizard in the results */

Scheduling and sending to 3rd party destination

Once we have written queries, we can turn them into simple reusable templates, and/or schedule them to run on an interval. We can also freeze query results to different destinations for further processing. For example, with our initial news article content query:

open https://www.[my news website].com
|| filter "attributes.class == 'article_link'"
|| open $attributes.href$
|| filter "attributes.class == 'article_text'"
|| freeze --store "kafka" /* send the results to a preconfigured kafka topic */