Skip to main content

Converting an HTML Table to a Dataset

HTML tables allow web developers to arrange data into rows and columns. A table in HTML consists of table cells inside rows and columns which can be easily converted to a dataset using the parseHTMLTable command.

Let's take a look at a pair of examples.

Example 1

Full Query

echo "<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>"
|| parseHTMLTable echo

Stage 1: Making the sample HTML table

echo "<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>"

The echo command is used to generate an HTML column called echo with a value of the first argument passed.

Stage 2: Converting the HTML Table into a dataset.

...
|| parseHTMLTable echo

This stage uses the parseHTMLTable command to construct a dataset from the echo cell value.

The HTML table will be parsed and converted into a data set will contain four columns: Company, Contact, Country, hash, and sequence. Company, Contact and Country are the HTML table columns. hash is the md5hash of a rows values. sequence is the ordinal position of the row which is 0 based.

Example 2

Full Query

open https://www.w3schools.com/html/html_tables.asp --html
|| filter "(nodeName == 'TABLE')"
|| head 1
|| parseHTMLTable outerHTML

Stage 1: Open a web page

open https://www.w3schools.com/html/html_tables.asp --html

Open a web page in a browser and wait for all JavaScript and external assets to load. We use the -html flag to include the HTML source of the rendered web page for each returned element.

NOTE: The -html flag has speed implications as it includes both the outerHTML and innerHTML per element.

Stage 2-3: Filtering for the first table

...
|| filter "(nodeName == 'TABLE')"
|| head 1

The filter will find and match all rows that are TABLE elements. We pluck out the first table by limiting the rows returned by using the head command followed by the constraint.

Stage 4: Parse/convert the html table to a dataset.

...
|| parseHTMLTable outerHTML

Using the parseHTMLTable we can convert the full HTML table source found in the outerHTML column.

NOTE: outerHTML contains an HTML element's self and inner contents whereas the innerHTML contains it's inner contents only.