WordPress is one of the most widespread CMS in the world. It is hard to get a real number, but its market share is usually between 40% and 60% based on the study you are looking at. Even if WordPress can be used for a lot of different projects, it is especially popular in the blogging sector.Â
With our custom-built function, ImportFromWeb, you can easily scrape and extract data from any WordPress site. Whether you need to track when a competitor is publishing a new article or track your own articles because a lot of persons are contributing to your website, the formula can be very useful to scrape data from WordPress websites.
In this tutorial, we’ll explain how you can scrape useful data quickly without having to worry to write complex code. It will be as easy as using a regular Excel function!
How does it work?
It uses the ImportFromWeb add-on that we built.
ImportFromWeb provides a simple function to extract data from any websites. You just need to find the urls of the pages you want to load and tell ImportFromWeb about the location of the elements you want to extract from those pages. How cool is that?
The only pieces of information you need are:
- An URL or a list of URLs you want to scrape data from
- The Xpath or CSS of the elements you want to extract. If you are not familiar with this concept, don’t be afraid because we’ll provide them to you during this guide.
What are we going to scrape?
We can choose a lot of websites to scrape data from for this tutorial. For the sake of the example, we are going to use Backlinko, a blog well-known in the SEO industry, as it uses WordPress.
Let’s imagine that we work for the competition and that we know that the owner often updates its content – adding more or better content. One thing we might want is a Google Sheet with some of its URLs, scraping the title, and the update date.
Indeed, we want to know when Backlinko is updating its content to do at least better than him if we want to keep our rankings.
Get the URLs we are going to scrape
The first step to scrape data from a list of contents from this website is simply to get their URLs. There are several ways to achieve that:
- Hand-pick a selection of URLs we want to track for some reason
- Find a way to get a comprehensive list
In our case, we are going to our formula to get a comprehensive list. Indeed, we know that Backlinko is using a sitemap for its posts, hence we can scrape its content to retrieve just the URLs in our Google Sheets.
The Xpath //url/loc should work with any sitemap and is actually pretty handy if you are working in the SEO field.
List the Xpath you need
As explained at the beginning, if you want to scrape data from a WordPress blog using ImportFromWeb, you need to find the Xpath of the elements you want.
For our example, we just need 2:
- Title: //h1
- Update date: //time/@datetime[1][1]
If you need to extract something else, you can follow our Xpath tutorial or get in touch with us and we’d be happy to help.
Add the Xpath to the Google Sheets
These Xpath will be used by our formula, hence you need to add it to the Google Sheets you use to extract your data.
Install and activate ImportFromWeb
To use our custom formula, you need to install it from the Google Marketplace and then activate it under Add-ons > IMPORTFROMWEB > Activate add-on.
This step is mandatory to be able to use the formula.
Run the function
We can now run our function, which needs only two parameters to work correctly:
=IMPORTFROMWEB(url_to_crawl, Xpath_of_elements_to_scrape)
How simple and cool is that?
To compare easily the scraped information, just add headers to your table:
And then use the function. Here we use a TRANSPOSE() function with our ImportFromWeb because, by default, results are stacked vertically and not horizontally.
Even if the data seem to be retrieved correctly, it seems that the update date is not included correctly and we face a “#SELECTOR_RETURNS_NULL” error. This is not an issue, but simply because this specific content has never been updated and our Xpath is designed to retrieve an update date, which is different from a published date in our example.
Based on the website you want to scrape, the Xpath may be the same, but in our example, it is not.
Using this Google Sheets, you can easily track and be alerted every time Backlinko updates one of its articles. Awesome if you need to do competitor benchmarking proactively without having to open a website.
Did you find this tool useful?
Has it helped you to create your own content benchmarks? We’d love to hear all of your feedback so that we can keep providing the best information on web scraping.