Skip to main content

Getting product prices from a webpage (Shopify)

This query will extract product prices from a Shopify powered site.

It works by first filtering for elements on the page representing a product listing. This listing elements contains child elements containing the price, product name, product image, and associated product details.

Each listing will be hashed, which will allow for grouping them after further processing.

We then expand the product listing element into all of its child elements (price, name, etc.), filter for the elements we are interested in, then group the elements by their parent (product listing element) hash. Now with some renaming and tabling of our dataset, we get back a dataset of product and price.

Full Query​

open https://www.tentree.ca/collections/mens-shorts --html --hashtml
|| filter "(attributes.class == 'justify-between product-attr-container mt-2 relative')"
|| sequence
|| html innerHTML
|| filter "(_html.nodeName == 'A') or (_html.attributes.class == 'flex' or _html.attributes.class == 'text-discount-price')"
|| excludes _html.innerHTML "line-through"
|| table _html.innerText outerHTMLHash _sequence
|| groupBy outerHTMLHash
|| rename _group.0._html.innerText product
|| rename _group.1._html.innerText price
|| sort _group.0._sequence --order "ascending"
|| addcolumn time $TIMESTAMP.ISO$
|| table product price time

Stage 1: Opening the page​

open https://www.tentree.ca/collections/mens-shorts --html --hashtml

The first stage will open a URL, and set the html and --hashtml flags to include the raw html of each page element in the results, as well as hashes of the html for future grouping.

Stage 2: Filtering for products​

|| filter "(attributes.class == 'justify-between product-attr-container mt-2 relative')"

The second stage will filter the page data to only include elements matching the filter expression. This filter expression is only including elements on the page that represent the containers of product listings (includes the product image, name, price, etc. that will need to be expanded out - see stage 4).

Stage 3: Adding as sequence number​

|| sequence

The third stage will add a _sequence column to each row containing the row number using the sequence command.

Stage 4: Extracting html from products​

|| html innerHTML

The innerHTML column of this data set contains raw HTML. In the fourth stage, we will use the html command to expand this html element into all its child elements.

Stage 5: Filtering for products and prices​

|| filter "(_html.nodeName == 'A') or (_html.attributes.class == 'text-discount-price')"

The fifth stage will filter the page data to only include elements matching the filter expression. This filter expression will include product links and prices.

Stage 6: Excluding non sale prices​

|| excludes _html.innerHTML "line-through"

The sixth stage will use the excludes command to exclude any row containing the value line-through in the _html.innerHTML column. The data set contains both the regular price and the sale price, so this stage will remove the regular price entry if the item is on sale.

Stage 7: Removing columns​

|| table _html.innerText outerHTMLHash _sequence

The seventh stage will use the table command to include only relevant columns.

Stage 8: Grouping by HTML hash​

|| groupBy outerHTMLHash

The eighth stage will group page elements by the parent hash calculated per element in the first stage of this query (through the --hashtml flag). This stage uses the groupBy command.

Stage 9: Renaming column​

|| rename _group.0._html.innerText product

The ninth stage will rename the _group.0._html.innerText column to product.

Stage 10: Renaming column​

|| rename _group.1._html.innerText price

The tenth stage will rename the _group.1._html.innerText column to price.

Stage 11: Sorting results by page order​

|| sort _group.0._sequence --order "ascending"

The eleventh stage will sort the results according to the previously added _sequence number to preserve the order of elements as they initially appeared on the page.

Stage 12: Adding a timestamp​

|| addcolumn time $TIMESTAMP.ISO$

The twelth stage will add a timestamp column with the current ISO timestamp to each row.

Stage 13: Removing columns​

|| table product price time

The thirteenth stage will use the table command to include only the product, price, and time columns in the final set of results.