Scrape post data from WordPress
WordPress is one of the most widespread CMS in the world. It is hard to get a real number, but its market share is usually between 40% and 60% based on the study you are looking at. Even if WordPress can be used for a lot of different projects, it is especially popular in the blogging sector.
With our custom-built function, ImportFromWeb, you can easily scrape and extract data from any WordPress site. Whether you need to track when a competitor is publishing a new article or track your own articles because a lot of persons are contributing to your website, the formula can be very useful to scrape data from WordPress websites.
In this tutorial, we’ll explain how you can scrape useful data quickly without having to worry to write complex code. It will be as easy as using a regular Excel function!
Our function, build on top of Google Sheets extending its functionalities, can extract any element available on a webpage. You can retrieve directly in Google Sheets any information you need and cross it with other data sources. How cool is that?
The only pieces of information you need are:
- An URL or a list of URLs you want to scrape data from
- The Xpath of the elements you want to extract. If you are not familiar with this concept, don’t be afraid because we’ll provide them to you during this tutorial.
We can choose a lot of websites to scrape data from for this tutorial. For the sake of the example, we are going to use Backlinko, a blog well-known in the SEO industry, as it uses WordPress.
Let’s imagine that we work for the competition and that we know that the owner often updates its content – adding more or better content. One thing we might want is a Google Sheets with some of its URLs, scraping the title, and the update date.
Indeed, we want to know when Backlinko is updating its content to do at least better than him if we want to keep our rankings.
The first step to scrape data from a list of contents from this website is simply to get their URLs. There are several ways to achieve that:
- Hand-pick a selection of URLs we want to track for some reason
- Find a way to get a comprehensive list
In our case, we are going to our formula to get a comprehensive list. Indeed, we know that Backlinko is using a sitemap for its posts, hence we can scrape its content to retrieve just the URLs in our Google Sheets.
The Xpath //url/loc should work with any sitemap and is actually pretty handy if you are working in the SEO field.
As explained at the beginning, if you want to scrape data from a WordPress blog using ImportFromWeb, you need to find the Xpath of the elements you want.
For our example, we just need 2:
- Title: //h1
- Update date: //time/@datetime
These Xpath will be used by our formula, hence you need to add it to the Google Sheets you use to extract your data.
To use our custom formula, you need to install it from the Google Marketplace and then activate it under Add-ons > IMPORTFROMWEB > Activate add-on.
This step is mandatory to be able to use the formula.
5) Run the function
We can now run our function, which needs only two parameters to work correctly:
How simple and cool is that?
To compare easily the scraped information, just add headers to your table:
And then use the function. Here we use a TRANSPOSE() function with our ImportFromWeb because, by default, results are stacked vertically and not horizontally.
Even if the data seem to be retrieved correctly, it seems that the update date is not included correctly and we face a “#SELECTOR_RETURNS_NULL” error. This is not an issue, but simply because this specific content has never been updated and our Xpath is designed to retrieve an update date, which is different from a published date in our example.
Based on the website you want to scrape, the Xpath may be the same, but in our example, it is not.
Using this Google Sheets, you can easily track and be alerted every time Backlinko updates one of its articles. Awesome if you need to do competitor benchmarking proactively without having to open a website.