Documentation

Get started

ImportFromWeb is the easiest way to start scraping data from any website. Find out how you can start filling your new blank sheet

Check that the add-on is enabled in your current spreadsheet

Type =IMPORTFROMWEB(). If #NAME! is displayed, have a look at our FAQ:
When I write =IMPORTFROMWEB() in a cell, it shows the error #NAME?

Create your first function

Importing data is as easy as typing

=IMPORTFROMWEB([the url you want to search], [an XPath or CSS selector to describe the content you want to import])

Try

=IMPORTFROMWEB("https://www.imdb.com/chart/boxoffice?ref_=nv_ch_cht", "td.ratingColumn")
None GIF

Open the ImportFromWeb sidebar

In the Google Sheets menu, go to
Add-ons ▶︎ ImportFromWeb | Easy Web Scraping ▶︎ ? Open sidebar.

It will load a sidebar on the right side giving you a lot of support to create your function and discover all the capabilities.

Syntax

Simple syntax

IMPORTFROMWEB(urls, xpaths/CSS)

With options

The function allows a series of options like rendering Javascript content, choosing a location, setting up the duration of the cache, etc.

IMPORTFROMWEB(urls, xpaths/CSS, options)

With regexps

For advanced users, ImportFromWeb lets you manipulate the content with regular expressions. it’s especially useful to clean up text, replace commas in numbers or remove unwanted characters

IMPORTFROMWEB(urls, xpaths/CSS, regexps, replaceBy, options)

See more below

Parameters

inputs (value or range)

(required) Specify the source that you want to get content from.

A source can be either

  • a single URL or a reference to one cell or a range of cells that contain URLs.
  • An HTML code
  • The ID of an HTML file that is hosted in a Google Drive. The file has to be public

There is a limit of 50 inputs per function. If you’re looking to extract more than 50 pages, consider splitting the inputs into various IMPORTFROMWEB() functions.

xpaths or CSS selectors (value or range)

(required) One or a list of XPaths or CSS selectors describing all the elements to retrieve in the pages.

See how to work with URLs and xpaths

CSS selectors are also supported. They are less powerful than XPaths to find complex elements in complex HTML pages, but they usually do the job and are also easier to build.

Standard CSS Selectors don’t allow to load values contained in attributes. With ImportFromWeb it is possible by indicating /attributeName just after the element.
For example div a img/src should return the source of the image.

regexps (value or range) – optional

(optional) One or a list of regular expressions to extract or replace specific parts from the content returned by each selector.

  • If the next argument replaceBy is not filled or NULL, it will extract some text from the content.
  • If replaceBy has a list of values, they will be used to replace the extracted content.

Each regexp from the list is associated to one selector from the list of selectors in such way that the first regexp will extract content from the results of the first selector, etc.

It works the same way as the JavaScript String match() function

Use regexps in order to clean up your data.

Check this article for the most useful regex rules to convert numbers, dates or currencies: Useful regex rules to clean up your imported data.

Note that, because they’re optional and mostly reserved to advanced users, regexps don’t appear in the tooltip. regexps should always be the 3rd argument.

replaceBy (value or range) – optional

(optional) One or a list of texts to replace the content extracted by the regexps.

Each string of the list is associated to one regex from the list of regexs in such way that the first string will replace the content extracted by the first regex, etc.

Note that, because it’s optional and mostly reserved to advanced users, replaceBy don’t appear in the tooltip. replaceBy should always be the 4th argument.

Options (table or text) – optional

The choice of options is what makes the function so powerful and flexible.

Options can be added in two ways:

  • A text of keys and values

Example:

=IMPORTFROMWEB("https://www.imdb.com/chart/boxoffice?ref_=nv_ch_cht", "td.ratingColumn", "countryCode:us, jsRendering, contractStack")

Bear in mind that when the value is omitted, the function will interpret it as TRUE. Hence, the example above is equal to

"countryCode:us, jsRendering:true, contractStack:true"

(Spaces after the comma are not compulsory)

  • A reference to a two-column range.

The left column defines option names and the second column contains corresponding values.

Only list the options for which you want to override the default values.

Some options might be limited or unavailable according to your subscription. Please check out our plans for more details

jsRendering (TRUE | FALSE)

  • TRUE: forces the engine to render the page with Javascript
  • FALSE: Never runs JavaScript.
  • Default is FALSE.

For speed and resource consumption purposes, the engine does not run JavaScript on a page by default. We recommend enabling it only if needed on a web page.

Note that rendering JavaScript content will generally take more time. If 30 seconds have passed, the function displays #PENDING_REQUEST while our server fetches the page.

countryCode (us, ca, de, fr, es, br, mx, in, jp, cn, au)

To retrieve localized content, use the code from the following list of countries to load the page from a specific location:

  • us: United States
  • ca: Canada
  • de: Germany
  • fr: France
  • es: Spain
  • br: Brazil
  • mx: Mexico
  • in: India
  • jp: Japan
  • cn: China
  • au: Australia

cacheLifespan (hours)

Google Sheets controls how the functions recalculate according to different behaviours (modifying cells, opening the spreadsheet, …)
To avoid that IMPORTFROMWEB fetches the target page constantly, the content is cached 24 hours by default. Therefore, whenever IMPORTFROMWEB recalculates, the content is retrieved from the cache.

You can customize the cache lifespan by setting the cacheLifespan option to the number of hours you will want the cache to last.
Bear in mind that the function will not necessarily update directly after the cache expires but only when Google Sheets lets the function recalculate.
Use automatic triggers if you need the function to update periodically.

A new request will be counted whenever the function has to fetch the content from the target page.

cacheLifespan is available from the Premium plan. The maximum value for cacheLifespan depends on your plan: 15 days (360 hours) for Premium, 30 days (720 hours) for High Volume.

join (TRUE | FALSE | Character)

When a selector returns various entries, the results are typically split so they expand into the surrounding cells.

  • TRUE: All values returned by a selector are compressed in one cell
  • A character: Results are concatenated in one cell and separated by this character.
  • Default is FALSE.

Note: You cannot use the comma-separated notation for join if you want to use the “,” character to join the results.

contractStack (TRUE | FALSE)

contractStack forces the function to display one page per row instead of expanding naturally to the neighbouring cells.

Default is FALSE.

showUrls (TRUE | FALSE)

Whether to show the URL of the page for each result displayed.

Default is FALSE

baseSelector (XPath or CSS selector)

The base selector will be the root from which the other selectors will be evaluated.

It is useful for search results as it ensures that each result will be aligned on its base.

Here in a example:

image.png

In that case, we’ve selected the baseSelector so the function understands that it should retrieve elements relatively to the containers of each search result, hence forcing elements of a same search result to align on the same row (check the ratings on column C)

shake (TRUE | FALSE)

Functions in Google Sheets recalculate only on some conditions. One condition is that it detects a change in the function definition. That is what shake is for!

The shake option is just a way to tell Google Sheets that the function definition has changed and that it has to recalculate. It doesn’t matter if the value is TRUE or FALSE. What matters instead is that it goes from one state to another.

Note that shake does not necessarily fetch the latest data from the target page. It tries beforehand to retrieve the cached data. If the cache is too old (more than 24 hours) or the function displayed an error, then it will fetch the target page.

Use it to:

  • “shakes” the function in case the function returned an error (starting with #)
  • retrieve pending requests (as indicated by the #PENDING_REQUEST message)

A convenient use is to couple it with a tick box (Menu > Insert > Tick box)

outputErrors (TRUE | FALSE)

Set to TRUE, it will output #ERROR! when an error occurs.

By default the function returns custom messages when an error occurs (such as #SELECTOR_NOT_VALID). Those messages are not recognized by Google Sheets as proper errors. Therefore you cannot use the ISERROR or IFERROR functions.

Turn outputErrors to TRUE in order to tell the function to return a Google Sheets standard error that can be handled with the ISERROR or IFERROR functions.

The downside of it is that the #ERROR! message displayed by Google Sheets doesn’t tell much about the type of error. Pass the mouse over the cell to see more information about the error.

Default is FALSE

isXML (TRUE | FALSE)

Use this option if the source code is plain XML instead of HTML.

Default is FALSE

hideTableHeaders (TRUE | FALSE)

In case the element returned is a table and the table has headers, you can define whether or not you want to display the headers.

You can also use hideTableRowsHeaders and hideTableColumnsHeaders

Data arrangement

ImportFromWeb is smart enough to detect the best arrangement for the returned data.

For example, if the function is filled with a reference of URLs arranged in a row, it will dispose the values of each page in columns.
Let’s say 3 URLs are in A1:C1, then by calling the function =IMPORTFROMWEB(A1:C1, ... ), the output of page 1 will be in column A, page 2 in column B and page 3 in column C.

On a similar way, if the URLs are in one column and the selectors are in one row, it will create a table with results for each page in rows.
In case you invert the arrangement of the input data, the resulting table will be transposed.

The function will always try to expand in rows and columns as much as it can. You can try to constrain the function by disabling spitResults. As a consequence, if a selector returns several values, it will constrain them into one cell.

The contractStack option can be used to force the function to detect that the target pages share the same type of content, and therefore should be displayed as a long list. it is particularly useful in case you want to display results for multiple pages of search results.

The sidebar

The IMPORTFROMWEB_getMeta() function

IMPORTFROMWEB_getMeta() is an additional function to IMPORTFROMWEB() available within the same add-on.
It takes as reference a cell that contains an IMPORTFROMWEB() function and returns additional information about the results.

Simple usage

IMPORTFROMWEB_getMeta() allows to monitor the cells containing an IMPORTFROMWEB() function.
It gives similar information to what can be seen in the Monitor tab from the sidebar.

=IMPORTFROMWEB_getMeta(A1)

A1 being a cell that contains IMPORTFROMWEB()

The output is a range of two columns with the metadatas’ keys in the 1st column and their respective values in the 2nd column, like

values[[[“Product Manager”]]]
updateDate2019-05-02
updateTime22:18:12
executionDurationMs9009
sourceCodeUrlsa link to the page as seen by the function

Syntax

reference (range)

A reference to a cell containing the IMPORTFROMWEB() function

metaName (optional)

If filled, returns only the value corresponding to the metadata key

split (boolean)

If FALSE and metaName is not filled, returns all the results compressed in one cell
Default to TRUE

Output

values

A JSON object representing the results of the function

updateDate

The date of the last update

updateTime

The time of the last update

updateDate and updateTime are particularly useful to control if the results returned are fresh

executionDurationMs

How long it took for the function to execute. The duration can vary according to the number of urls in the first parameter and whether the engine had to run javascript in order to retrieve the elements

SourceCodeUrls

Returns the url of an HTML version of the page as seen by the engine.

This information is useful for debugging.

Set up automatic updates

ImportFromWeb functions can be executed automatically even though the spreadsheet is closed. Setting up an automatic trigger can be done through the sidebar.

Only available in Premium and High Volume plans.

Debugging the results

When it comes to website data extraction, a lot of things can happen during the process. By default, the function displays verbose errors in order to make it easier to debug.
If the function is wrapped into other Google Sheets functions, the errors might not be displayed correctly. A first reflex should be to test the function alone and see what it outputs.

When you see an error (marked by a #), go to the sidebar > Monitor tab where you will find details about the error and, if the page has been loaded, a link to the page as seen by the function.

Frequently asked questions

Go to the FAQ section