How to Extract Information From a Website Using AppleScript
In previous tutorials we learned how to make applescript open a web page, how to use AppleScript to fill out forms on a web page, and how to click buttons on web pages with AppleScript. Today we are going to learn how to extract data from web pages using AppleScript!
In a later tutorial I will teach how to put all of this information together to make a fully automated application that collects and/or inputs multiple bits of data from a website or websites.
For this first example we are going to use Google Chrome’s inspect element tool and grab the first line of a Google search result.
If have not read my previous two tutorials on clicking and inputting data, this will not make much sense. Please view these first!
As in the previous examples we are going to need either an ID, Class, Name, or if all else fails a the tag that contains the information that we want.
First go to the web page that you would like to grab information from…
We are going to grab the first headline of how to peel a banana and pull it into applescript. Right click on the element you want to grab and click on inspect element to bring up the source code.
If you are following along the code should look something like this…
It looks like we do not have an ID or Name to go off of, so we will have to use Class.
Grabbing Data from a Website Using Class
First, paste this code into the top of your AppleScript Doc…
to getInputByClass(theClass, num) -- defines a function with two inputs, theClass and num tell application "Safari" --tells AS that we are going to use Safari set input to do JavaScript " document.getElementsByClassName('" & theClass & "')[" & num & "].innerHTML;" in document 1 -- uses JavaScript to set the variable input to the information we want end tell return input --tells the function to return the value of the variable input end getInputByClass
Now that we have our data scraper function we can take our first stab at pulling the info. Enter the following code in your AppleScript doc to get the data…
getInputByClass("r", 0)
In this instance the 0 would allude to which headline we would like to pull.
The first result would be 0, the second 1, third 2, etc…
When we try out our code we get…
Hmm, this is good that we have the information that we want, but we also picked up a lot of the HTML. To get rid of the HTML we are going to use AppleScripts Text Delimiter functions.
Enter the following code into the top of your AppleScript doc…
to extractText(searchText, startText2, endText) set tid to AppleScript's text item delimiters set startText1 to "x" set searchText to ("x" & searchText) set AppleScript's text item delimiters to startText1 set endItems to text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text item 1 of endItems set AppleScript's text item delimiters to startText2 set finalText to (text items 2 thru -1 of beginningToEnd) as text set AppleScript's text item delimiters to tid return finalText end extractText
We can set up this function to pull out what is between the lines of code. First we need to set our grabbed text to a variable.
set theText to getInputByClass("r", 0)
Next we set up the call to our function. This function takes 3 parameters.
The first is searchText, this is going to be what we retrieved from our getInputByClass Function, or theText above.
Second is the startText2 parameter. In order to get this we need to look at our code for what comes right before the information we want to extract and what comes directly after. I’ll explain…
This is the result of our getInputByClass function:
“<a href=”http://www.instructables.com/id/The-correct-way-to-peel-a-banana/” onmousedown=”return rwt(this,”,”,”,’2′,’AFQjCNEPeA-Fa4A9BF4tiZnULdYQoDAvmA’,’tWzuP2eryJt6miXyiNuHRQ’,’0CCIQFjAB’,”,”,event)”>The correct way to peel a banana – Instructables</a>”
We want to extract “The correct way to peel a banana – Instructables”, which is between
<a href=”http://www.instructables.com/id/The-correct-way-to-peel-a-banana/” onmousedown=”return rwt(this,”,”,”,’2′,’AFQjCNEPeA-Fa4A9BF4tiZnULdYQoDAvmA’,’tWzuP2eryJt6miXyiNuHRQ’,’0CCIQFjAB’,”,”,event)”>
and
</a>
We are going to set startText to “> which is the last part of:
<a href=”http://www.instructables.com/id/The-correct-way-to-peel-a-banana/” onmousedown=”return rwt(this,”,”,”,’2′,’AFQjCNEPeA-Fa4A9BF4tiZnULdYQoDAvmA’,’tWzuP2eryJt6miXyiNuHRQ’,’0CCIQFjAB’,”,”,event)“>
And then we set endText to:
</a>
Which comes directly after the information that we want.
Now we enter this into our AppleScript…
set theResult to extractText(theText, "">", "</a>")
and get the result
See below for how to use other methods to grab data. If you read the previous tutorials mentioned at the beginning of this article you will understand how to use the below code.
Grabbing Data from a Website Using ID
to getInputById(theId) tell application "Safari" set input to do JavaScript " document.getElementById('" & theId & "').innerHTML;" in document 1 end tell return input end getInputById
Grabbing Data from a Website Using Tag
to getInputByTag(theTag, num) -- defines a function with two inputs, theTag and num tell application "Safari" --tells AS that we are going to use Safari set input to do JavaScript " document.getElementsByTagName('" & theTag & "')[" & num & "].innerHTML;" in document 1 end tell return input end getInputByTag
Grabbing Data from a Website Using Name
to getInputByName(theName, num) -- defines a function with two inputs, theName and num tell application "Safari" --tells AS that we are going to use Safari set input to do JavaScript " document.getElementsByName('" & theName & "')[" & num & "].innerHTML;" in document 1 end tell return input end getInputByName
Hi,
So I managed to do the getinputbyid section, But when I tried to copy and paste the extracttext code, it came up with an error:
Syntax error
expected “,” but found identifier.
I believe that this was at the “ofbeginningtoend” area. Any suggestions?
Hey Kay,
If you send me your applescript, or even that snippet of code you believe is the issue I can take a look for you.
Or more specifically, what are you putting here:
extractText(?,?,?)
How can i use applescript to get informAtion from a list of urlS? I Need to retrieve a specific information, the url of The img in a for each specific url, thEy are mAny thousands so i can’t proceed manually.
Hey Michael,
Can you send me an example of what you are looking to do? I sent you an email so we can talk more in depth.
Hello there,
Would it be possible to go more in depth with this? This would be very useful for work but I’m trying to extract text from a simple table and output it in a certain format so that I can paste it into another program.
Hey Ryan,
I sent you an email. Shoot me an example and I’ll see how we can make it work for you.
Hey Samuel, THis is brilliant, I have pretty much combined all this information together and have it all working. I was hoping if there is a way to use the extracted information to add it to an input box on the site. Sounds spammy I know but to explain a little. We joined a wedding photographer sort of directory where couples enquire for availability. To inform of our availability we have to load up a site and input all our date over and over again including a randomly generated number. Help very much apprecaited!
Hey Tom, Check out my other post How to fill out forms on a website using applescript. If this post doesn’t answer your question shoot me an email and I’ll help you out.
Hey Samuel, yeh followed that and it worked out great thank you. Was basically a combination of both posts. Have zapped you an email anyway 🙂
Hey Sam, just thought I would see if you ever got chance to look at this?
All the best
Great scripts! Exactly what I´ve been looking for.
i had many problems with extracttext , so i change it to this:
to extractText2(searchInText, textPre, textPost)
set tid to AppleScript’s text item delimiters
set AppleScript’s text item delimiters to textPre
set searchInText to second text item of searchInText
set AppleScript’s text item delimiters to textPost
set searchInText to first text item of searchInText
set finalText to searchInText
set AppleScript’s text item delimiters to tid
return finalText
end extractText2
Thanks Ignacio! That part of the script can be very finicky… I’m glad you figured this out and shared it.
Hi, great Tutorial.
Is it possible to get the data for multiple nums?
Hey Has thanks! Yes just run the code twice for example:
set numOne to getInputByClass(“r”, 0)
set numTwo to getInputByClass(“r”, 1)
or you can do a repeat function:
set x to 0
repeat 3 times
set someVar to getInputByClass(“r”, x)
–Enter code here what you want to do with someVar…
set x to x + 1 — moves to the next number
end repeat
when i do the first part it says missing value why?
thanks!
There could be a number of different reasons. Please email me your code and the source code and I’ll take a look.
Hi Samuel,
I’m new to javascript and am having trouble getting this code to work on El Capitan & Chrome 52. When I step through the code, it looks Safari isn’t finding any elements with the classname. However if I manually call getElementsByTagName with the appropriate tag and index in Chrome’s Javascript console, I get exactly what I’m looking for. Any thoughts? Here’s the entire AppleScript document:
use AppleScript version “2.4” — Yosemite (10.10) or later
use scripting additions
to getInputByClass(theClass, num) — defines a function with two inputs, theClass and num
tell application “Google Chrome” –tells AS that we are going to use Safari
activate first tab of first window
set input to execute javascript “document.body.getElementsByClassName(‘” & theClass & “‘)[” & num & “].innerHTML;” — uses JavaScript to set the variable input to the information we want
end tell
return input –tells the function to return the value of the variable input
end getInputByClass
getInputByClass(“moduletable”, 1)
Thanks for any insight you can provide!
Hey Brad,
You have to use Safari, this code will not work with Chrome.