Lecture: Intro to Scraping using Google Sheets

Scraping is using a tool to capture information from online sources (web page, database, word doc, pdf, etc.) and put it into a form that you can use.

There are many tools and techniques for scraping. Which one you use depends on where the information is and what format it is in.

For example, the book Scraping for Journalists covers how to scrape:
-an HTML document
-multiple web pages
-poorly formatted web page
-a series of websites that have same CMS
-databases
-PDF
-CSV file
-Excel spreadsheets
-ASPX pages
-etc

There are lots of scraping tools. For example:

OutWit Hub
Chrome extension Scraper
Scraperwiki
• IFTTT – Scrape Tweets into a spreadsheet
Tabula – Scrape pdfs

Each of these are unique, but if you understand some basic concepts of scraping you can figure out which to use and how to do it.

One of the key concepts of scraping is structure.

• Look for structure, which a scraper can recognize.
• Set up a scraper to do repetitive tasks.
• The more structure, the more repetition, the easier to scrape.

For example, the tool Tabula is able to recognize structure within text-based pdf’s. (Note: it does not work with scanned pdfs). Here is a demo of how it works

Now, we are going to creating your first HTML scraper using Google Sheets

Open a Crome browser.

Log into your Google Rowan account (or another Gmail account) and click on Drive.

Click on NEW button. Select Google Sheet.

Give it the title My First Scraper.

Open a second Browser Window. Go to Wikipedia’s List of the most streamed songs on Spotify.

In your spreadsheet, go to cell A1

Type =IMPORTHTML

This formula imports data from a table or list within an HTML page.

The formula is =IMPORTHTML(“url”, “query”, index)

Type in the formula

=IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify”,”table”, 2)

Why does this work?

This scraping formula has two main elements:

1. function
2. parameters

importHTML is the function. Functions do things. This one imports data from a table or list within an HTML page.

Everything within the parentheses are parameters. Parameters determine how the function works. In this case, there are three parameters: URL, table, and number 1

Within this parameter are there two elements:

1. Strings
2. Index

Strings are a series of words or phrases. In this case the URL and table, which are both in “”.

Index – The third parameter is not in quotes because it is a number – it is the number of the table we are looking for. In this case, 1.

Go to the formula in A1 and replace table 1 with 2. This is the second table on the page.

Now replace the 2 with a 3. This is the third table on the page.

How did I know which one to put in my formula? Look at the HTML code.

View the source of the page

View > Developer > View Source

You will see the HTML for the page.

Now search for tables

Edit > Find > Find
Type in <table

You will see the where tables are inserted into the web page.

When you put 1 in your formula you are telling GoogleDocs to get info from the first table.

Another common html structure is a list.

For example, go to a Wikipedia List of prisons by country

This web page is organized as a list.

In A1 of your Google spreadsheet, type
=IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_prisons”, “list”, 1)

Try replacing the number 1 with 2, 3, 4, 5, etc. What do you get? Why?

With “list”, the formula is looking for the tags <ul> which means unordered list or bullet points.

View the source for this page

View the source of the Crime log page.
View > Developer > View Source

You will see the HTML for the page.

Now search for tables

Edit > Find > Find
Type in <ul

You will see there are 140 of them.

You can look for them or you can just keep trying by trial and error until you get what you want.

Trial and error are key to scraping. If you don’t get what you want, try something else.

FINALLY, WE NEED TO GET THE DATA IN A FORM THAT WE CAN ANALYZE.

Select all of the data in your spreadsheet by clicking in cell in upper left hand corner of your spreadsheet – above A and 1. The entire spreadsheet will turn blue.

From Google Sheets menu select

Edit > Copy

Create a new sheet by clicking on the + in the bottom left hand corner

From Google Sheets menu select

Edit > Paste Special > Paste Values Only

This strips the scraping formula out of the spreadsheet and allows you to do things with the data (sort, filter, pivot, etc).

GOOGLE SHEETS HAS A NUMBER OF SCRAPING FUNCTIONS FOR DIFFERENT TYPES OF FILES

For example:

=IMPORTXML – Imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.

=IMPORTDATA – Imports data at a given url in .csv (comma-separated value) or .tsv (tab-separated value) format.

=IMPORTFEED – Imports a RSS or ATOM feed.

=IMPORTRANGE – Imports a range of cells from a specified spreadsheet.

NOW YOU TRY – IN CLASS ACTIVITY

1. Find an online list of every Super Bowl winner from 1966-present. Scrape it into a Google sheet.

2. Create a new sheet. Find an online list of every Super Bowl MVP from 1966-present. Scrape it.

3. Create a new sheet. Find an online list of the current Philadelphia Eagles roster of players. Scrape it.

TELL ME WHEN YOU ARE DONE AND I’LL COME LOOK.

This entry was posted in Lecture. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s