Building a data feed of GitHub issues to Cribl
Building a data feed from a query is a common use case of crul. In this example, we'll deploy a crul query that builds a dynamic data set from a REST API and delivers it to Cribl on a scheduled interval. We'll also explore how the
diff command can be used to construct data feeds that only send new results.
We want to capture newly generated issues from a few Github repos and push those to Cribl once an hour.
Building our query
Our first step will be writing a query that gets back a set of issues. With the
api command, this is straightforward.
api get https://api.github.com/repos/nodejs/node/issues?per_page=100
api command is very powerful, and can handle authentication of various forms (including tokens, keys, OAuth, etc.), paginate, set checkpoints, set custom headers and dat payloads, and much more!
To now expand this query to issues from multiple repos we'll use the
seed command with a few specific repositories. We could also dynamically fetch repos from a particular organization using an
api request, or use the
thaw command to use repositories previously described in a csv/json file and uploaded to the cellar.
|| api get $repository$/issues?per_page=100
Great! Now we have the most recent 300 issues (max) from each the three provided repositories.
Many APIs return paginated responses, this means that not all results are available in a single request, the response however will include a pointer to the next set of results. This pointer could be a hash value, a page, an offset, or an explicit link. The crul api command can handle many types of pagination using the
--pagination.* set of flags.
Many APIs require some form of authentication. This can be a token, an OAuth flow, or another mechanism. The
api command is able to send requests with custom headers and data payloads, and also includes a
--bearer flag among other auth related flags to support many forms of API authentication.
Configuring a Store, Scheduling and Sending Incremental Diffs to Cribl
We now need to export this data set to Cribl and schedule it to run on a 1 hour interval. However, if we simply schedule the above query, we'll likely send many duplicate issues to Cribl as most repos will not create more than 100 issues per hour. This is why we'll need to use the
diff command to only send over new issues that have not already been sent.
Since this diff/export/schedule step is so common, we can use crul's scheduled export wizard. Let's walk through each of the steps.
Step 1: Select a store or create a new one
The first step of the scheduled export wizard is to select a store, or configure a new one. Let's assume we are going to create a new one. If you have an existing store (possibly created by the export wizard previously?) you can simply select it from the dropdown.
Step 2: Select a schedule interval
Select a time interval for this query to run on. It will run shortly after creation and then on the set interval.
CAUTION! If you pick a short interval (less than 5 mins) you may run into issues with the crul cache. Ensure that stages in your query that you do not wish to be cached set the
--cache flag to false. Example:
api ... --cache false. When in doubt, set
--cache false on all stages.
Step 3: Select a diff strategy
There are a few diff strategies.
The rotating diff is the most commonly used diff strategy. It works by comparing the current diff file to the current results, then returning the results that do not exist in the diff file. Finally, it overwrites the diff file with all the current results, including ones that already existing in the diff file.
Content of old diff:
Results prior to diff command:
Results after diff command:
Contents of diff file after diff command:
Store ALL results and append new ones. This can lead to big, growing, diff files and is NOT recommended unless the results sets are small and/or your are cleaning the diff file regularly of older entries.
Send the whole set of results each time.
Step 4: Confirm and deploy!
Check that the details look correct and hit
Submit to deploy the data feed. It will start running on a schedule.
Note that crul must be continually running for the scheduled queries to run. We recommend using the docker image as a long running service when creating data feeds.
We've now seen how simple it is to convert a query into a data feed, that populates one of 30+ destinations on a schedule, while maintaining a diff to ensure that only new result are sent over.
Any crul query can be turned into a feed using these steps! Have a web page that you would like turned into a data feed? No problem! Need to turn a REST API into a data feed? We got you!
Looks like that wizard added a few stages to my query - how do those work?
With the exception of the scheduling step, the wizard really only adds 2 stages to our data retrieval query.
The first added stage is the
diff command, which compares the current results to previously sent results and only returns the new ones.
Once you understand these commands, you can construct powerful data feeds yourself and schedule them manually, or just use the wizard!
Join our community
Come hang out and ask us any questions. Many of the features and fixes in crul come from our user requests!