📄️ Quickstart 1: Finding Expired GitHub GPG Keys
Click me to watch a recorded video version of this crul quickstart!
📄️ Quickstart 2: Retrieving comments from top posts on Hacker News
📄️ Expanding links from a webpage (Hacker News)
A common use case for the crul query language is taking advantage of expanding stages to open many pages from a single webpage and return the results as a consolidated data set. For example, we may have a recipe site with many recipes listed in a recipe directory. We can use crul to get links to all the recipes in the directory, then expand each of those links and filter for recipe ingredients or another use case.
📄️ Expanding links from an API (Hacker News)
In this example, we will get back the last 10 items from the Hacker News API. Comments, stories, etc. are all considered items with a unique id and metadata defining if it is a comment, etc. We will first get the largest item (most recent) and use it to construct a range of items to query. We will then use this range to make an expanding stage that makes an api request for each item.
📄️ Fetch a Zip Archive and Scan Extract
ZIP archives can be remotely fetched via the api command for entry metadata scanning and extraction/conversion to datasets.
📄️ Getting product prices from a webpage (Shopify)
This query will extract product prices from a Shopify powered site.
📄️ Querying an authenticated API (Twitter)
Many APIs require some form of authentication. This can be a token, an OAuth flow, or another mechanism. The api command is able to send requests with custom headers and data payloads, and also includes a --bearer flag among other auth related flags to support many forms of API authentication.
📄️ Querying an asynchronous API (Splunk Query)
Many services, such as query engines like GCP BigQuery, AWS Athena, Splunk, etc. have asynchronous dispatch APIs for running queries. This means that you can dispatch a query against these services, get back a job id, which you can then poll for status/completion before accessing the results. This is a common API pattern, and is supported by the crul api command syntax.
📄️ Querying an API Through a Proxy
An API request can be easily routed through a proxy. Supported proxy protocols include http, https, socks4, socks5 and pac.
📄️ Querying a Web Page Through a Proxy
A web page request can be easily routed through a proxy. Supported proxy protocols include http, https, socks4 and socks5.
📄️ Interacting with the OpenAI API
📄️ Paginating a blog
There are several approaches to pagination of web content in crul, but the most simple one involves understanding the urls of paginated content. Oftentimes, a blog directory or other paginated content will indicate new pages through a /page/1, /page/2, etc. form in the url. We can use this structure in combination with crul's expanding stages to paginate through a directory of blog posts.
📄️ Paginating an API (pokeapi)
Many APIs return paginated responses, this means that not all results are available in a single request, the response however will include a pointer to the next set of results. This pointer could be a hash value, a page, an offset, or an explicit link. The crul api command can handle many types of pagination using the --pagination.* set of flags.
📄️ Exporting results
Exporting results to 3rd party data lakes or other destinations is straightforward with the help of the freeze command. With the freeze command, we can easily push the results of our query directly to a preconfigured 3rd party store, or save the results locally to a file.
📄️ Capturing network requests
In addition to rendering web content and processing web pages into tabular data sets, crul can also capture the network lifecycle of loading a webpage using the requests command. This command allows us to capture performance and request/response content of third party network requests as well as a whole assortment of rich metadata for performance and security monitoring. It's also incredibly easy to use.
📄️ How to find filters
Using the open or api command often generates a large data set that can be tricky to find filters for. For example, if we are trying to get all headlines from a news site, how do we know what filters describe a headline?
📄️ How to use branching
The branch is a powerful way to split a query mid execution an operate on a stage in two different ways before joining the results back together. For example, you could run a specific filter in one branch and a totally different one in another, then further interact with each set of filtered data, before joining them back together in a single consolidated data set.
📄️ Converting an HTML Table to a Dataset
HTML tables allow web developers to arrange data into rows and columns. A table in HTML consists of table cells inside rows and columns which can be easily converted to a dataset using the parseHTMLTable command.
📄️ Converting an HTML Page to an Article
An HTML document can be easily converted to an article using the parseArticle command.