Don’t always scrape HTML, use website’s xhrRequests instead

ImportJSON is so powerful it allows to extract data from dynamic web pages!

Goodbye HTML, Long Life to the Twitter API

The web has been changing for a while. Need for better performances and mobile access has widespread the use of APIs to load a webpage content.

What does it mean?
Whereas content used to be loaded in one big trunk on page’s load, modern websites do it in two times:

  • Load the minimum data on page’s load so the user sees something quickly
  • Make call to an API that will request the required data. This takes some time and the website’s developers are usually nice enough to show you attractive spinners while you’re waiting

There is big chance that the data we’re looking for is usually coming from the API requests

This change of paradigm has some pros and cons

Pros

  • No HTML knowledge is required anymore
  • The data that stands there is usually well structured and easily understandable. It is therefore a better strategy to get content from the API than scraping the HTML content of a page.
  • Once we get the right call, it’s usually easy to modify the call’s parameters to output similar data (modify profile id, or locations, …)
  • In many cases it’s possible to access private data

Cons

  • There are thousands of ways for a website to load data form an API and finding the right call to the right API endpoint can be a real hunt
  • Developers can protect their API in certain cases, which can dramatically complicate its access outside the scope of the website

Extracting data from website’s APIs, step-by-step

Here is a short step-by-step guide to extract data from website’s APIs

  • Load the page and check visually if the data you’re looking for is displayed at first sight or if you need to wait a moment. If a spinner shows up, it’s a good sign!
  • Load Chrome Developer tool with Fn+F12 (Mac) or F12 (PC) and go the the Network tab
  • Reload the page so the tool can record the requests made under the hood
  • In the filter options, hit “XHR”. If the website has loaded some XHR files, your data must be in there!
  • Open the Network Search by with Cmd+F (Mac) Ctrl+F (PC). Take care of clicking somewhere in the Network tab or Chrome might load the usual search tool
  • Type some words or the value of the content yo want to extract. If you’re looking for a value, type it as a plain number even if the webpage displays it differently. For example for $19,000.00, type 19000.
    See what’s output in the search list. Discard results that don’t output JSON
    You might need to analyze each search result in order to check that the resulting JSON contains the information you’re looking for
  • Under the “Name” header, right-click on the file highlighted by the search and choose Copy > Copy as cURL

In your spreadsheet:

  • Make sure you have installed ImportJSON and activated it (Add-ons > ImportJSON > Activate)
  • paste your freshly copied cURL request in A1
  • in A2, type =IMPORTJSON(A1)

Here we go! You should see the full set of data in one column.

That’s just the start! Now it is time to manipulate your JSON and make good use of its data.
Learn how you can filter your freshly imported JSON

Leave a Reply

Your email address will not be published. Required fields are marked *