Scraping, i.e. the extraction of data from websites, is one of the most widely used techniques on the web. The uses are innumerable: Analysis of competitors, its SEO positioning, content generation, influencer tracking, etc., etc., etc..
Once the data is extracted, you have to process it, and there, nothing like our good old spreadsheet… In short, Scraper + Google Sheets is a marketers’ paradise
One of the most coveted sites for scraping is of course our dear Google. Scrapping Google allows to:
- Know the position of your site with regard to several keywords
- Understand your environment by analyzing the sites that rank best and follow the positions of competitors
- Analyze the most effective content
- Understand the keywords and sentence formulations that work best
So here is the ultimate guide to generating a table of more than 500 Google results in Google Sheets. You are then free to use the content according to your needs….
The basics of scraping include a URL and a path to the element of the page to be extracted
To our great happiness (and sometimes misfortune) Google often does things well! Hence, the URL of a search is easily modifiable to adapt it to our needs.
No need to analyze all the optional parameters here, which are also available here as we will focus on 3 parameters:
- q=keyword+key, indicating the request
- hl=fr, indicating the language of the results
- num=100, the number of results to be returned. By default the value is 10
Come on, let’s try them on a keyword that will do our planet some good…
The second step is the description of our element! No need for a Master in engineering here, but you do need to have a basic knowledge of HTML.
XPath is a language for describing an element of an XML script. And the good news is that, HTML being derived from XML, XPaths also apply to our beloved web pages.
A simple example: //h1 indicates all the titles of the page, the double slash indicating that the path is relative so not necessarily a direct child of the document root.
In our case, the XPath will be a little more complex:
//div[@class="g"] indicates all containers for each result. That is, all “div” of class “g”
With XPath Helper:
And by adding
//h3, we grab the title of each container
Are you still following me? Now, let’s compile this in our Google Sheets function
Let’s first install the IMPORTFROMWEB addon from the G Suite Marketplace
Open a spreadsheet …
Pss … other secret by the way, in the Google address bar, just type sheets.new
In the menu, search for Add-ons> IMPORTFROMWEB> Activate Add-on.
Wait a moment …
|Note: Google and many sites do not always make life easy for scraping tools.|
The engine behind IMPORTFROMWEB will usually find a way to load the page in less than 20 seconds, but in some cases more time is needed.
If you see #PENDING_REQUESTS, open the add-on sidebar through the menu: Add-ons > IMPORTFROMWEB > Open sidebar, then in the Monitor tab click on “Refresh pending requests”
You should now see something like this:
IMPORTFROMWEB has the particularity to accept ranges of cells in place of the selector, thus allowing to return more information on our search results.
Using the previous formula:
|2||// div[@class=”g”]/ /h3||//div[@class=”g”]//h3/../@href@class||//div[@=”g”]//div[@class=”slpf”]|
IMPORTFROMWEB accepts a 3rd parameter. This is a range of cells containing options, which makes the function more powerful and flexible.
The list of options is available here.
In our example we will need baseSelector.
To complete our table, let’s assume that we want to return the note given to some results.
We could add the XPath to our range of selectors, like this:
Woe! Since not all the results have a note, the cells are no longer synchronized. For instance, the note in D4 does not correspond to the result of line 4.
The option baseSelector is our remedy!
|4||=IMPORTFROMWEB(A2, A3: D3, A3:B3)|
By modifying each XPath and passing //div[@class=”g”] to the baseSelector, we tell the function to search inside of the elements described by the baseSelector.
Let’s see it in action…
As seen before,
//div[@class="g"] actually describes the containers for each result.
Hence, after executing the function, we have all the elements in place again:
As for the selectors, the URL parameter also accepts a range of cells.
Here’s how to get in a few clicks the first 100 results of multiple keywords!
Let’s slightly modify our spreadsheet to generate a list of URLs from Google Search
|1||bamboo toothbrush||=ARRAYFORMULA(“https://www.google.com/search?q=”&SUBSTITUTE(A1:A5,” “,”+”)&”&hl=en&num=100″)|
|2||bamboo toothbrush subscription|
|3||cheap bamboo toothbrush|
|Bamboo toothbrush review|
|4||Buy bamboo toothbrush|
Which should result as follows:
Let’s now reference the range of URLs in our function
=IMPORTFROMWEB(B1:B5, A9:D9, A10:B10)
All results for each XPath are now concentrated in one cell! This is because IMPORTFROMWEB now processes a 3-dimensional table.
We will have to add another option: stackPages.
By switching stackPages to TRUE, we tell the function that google pages are one single sequence of results and therefore, they must be stacked in the same table.
As our sheet becomes slightly more complex, I have reorganized it for greater clarity. Here is the result:
From now on, we do not know what key word each line refers to.
Another option will help us here: showUrls.
With showUrls set to TRUE, a new column will appear reminding the URL of the page for each line.
Using a good old matrix search method (VLOOKUP, FILTER, INDEX & MATCH), we can return the keywords for each result.
Here is the final result:
Get the working spreadsheet here: