Skip to main content

Converting an HTML Page to an Article

An HTML document can be easily converted to an article using the parseArticle command.

Let's take a look at an example.

Example 1

Full Query

open https://www.crul.com/blog/2023-03-07-tales-hn-front-page --html
|| filter "(nodeName == 'HTML')"
|| parseArticle outerHTML

Stage 1: Open a web page

open https://www.crul.com/blog/2023-03-07-tales-hn-front-page --html

Open a web page in a browser and wait for all JavaScript and external assets to load. We use the -html flag to include the HTML source of the rendered web page for each returned element.

NOTE: The -html flag has speed implications as it includes both the outerHTML and innerHTML per element.

Stage 2: Filtering for the HTML document

...
|| filter "(nodeName == 'HTML')"

The filter will find and match all rows that are HTML elements.

Stage 3: Parse/convert the html document to an article dataset.

...
|| parseArticle outerHTML

The HTML document will be parsed and converted into a data set that will contain multiple article columns and the hash of the content.