Skip to main content

Expanding links from a webpage (Hacker News)

A common use case for the crul query language is taking advantage of expanding stages to open many pages from a single webpage and return the results as a consolidated data set. For example, we may have a recipe site with many recipes listed in a recipe directory. We can use crul to get links to all the recipes in the directory, then expand each of those links and filter for recipe ingredients or another use case.

Example: Hacker News Comments​

Full Query​

open https://news.ycombinator.com/news
|| find comments
|| filter "(nodeName == 'A') and (parentElement.attributes.class == 'subline')"
|| open $attributes.href$
|| filter "(attributes.class == 'comment')"
open https://news.ycombinator.com/news
|| find comments
|| filter "(nodeName == 'A') and (parentElement.attributes.class == 'subline')"

The first stage will open the Hacker News site and process the page into a tabular structure. Think of crul as browser that is opening this page and rendering the content, fulfilling network requests, etc., then converting that rendered content a tabular format.

Next we will find the keyword comments. This helps to narrow down our data set to only rows that contain the string comments somewhere in the row values.

We will next provide a filter expression that narrows down our result set to just links to comment sections. We now have a list of links to comments to pass into our next expanding stage.

...
|| open $attributes.href$
|| filter "(attributes.class == 'comment')"

With our list of links to comments, we will open each link asynchronously (throttled/limited by domain policy and available browser workers) and then filter the results to only include elements on the page that contain a comment.

We now have a data set of most comments from the top postings on Hacker News.

Note: There could be some missing comments due to possible expandable sections, but this is beyond the scope of this example!