AppleScript Screen Scraping to Excel Example
NEW! AppleScript Maker Beta: Hey Everyone, I am working on a new AppleScript tool that is going to BLOW YOUR MIND! After reading this tutorial use this tool to easily build your AppleScripts. Click Here!
In previous tutorials we learned how to extract information from a web page. In this tutorial I am going to show you another real world way to extract information and then output it to an excel document.
If have not read my previous tutorials on clicking, extracting, and inputting data from a web page, this will not make much sense. Please view these first!
In this example we are going to be grabbing the previous episode number of a few top television shows and then outputting them to an Excel sheet.
Scraping the Text
First we are going to go to next-episode.net and inspect the element for the first top show Game of Thrones.
It looks like there is no ID, class, or name element… but we can grab it by the “a” attribute tag.
Here is a nifty solution I came up with to find the element and then click on it.
First place this at the top of your AppleScript doc.
to getInputByTag(theTag, num) -- defines a function with two inputs, theTag and num tell application "Safari" --tells AS that we are going to use Safari set input to do JavaScript " document.getElementsByTagName('" & theTag & "')[" & num & "].innerHTML;" in document 1 end tell return input end getInputByTag
Also place this at the top of your doc, it will let you click on an element by tag name:
to clicktagName(thetagName, elementnum) tell application "Safari" do JavaScript "document.getElementsByTagName('" & thetagName & "')[" & elementnum & "].click();" in document 1 end tell end clicktagName
Next place this in your code to search for and click on an element:
set x to 0 -- this is our test variable set tvshow to "Game of Thrones" repeat 1000 times -- this is going to repeat with every element number until what we are searching for is found set link to getInputByTag("a", x) -- this is going to retrieve the input of the element with tag a and the element number x if tvshow is in link then -- checks to see if Game of Thrones is in the output exit repeat -- if its found then we stop looking end if set x to x + 1 -- if its not found we continue to the next element end repeat clicktagName("a", x) -- clicks the tag that was found to have game of thrones in it
Now, when you run this script it will open the page for Game of Thrones.
Next we need to take a look at our desired information… the previous episode.
It looks like, again, there is no way to grab the information directly. There is no name, class, or ID to reference for us to pull.
So… in this case we need to back up and grab the id from the div that contains what we are looking for. Here we are going to use the id=”previous_episode”
To do this, input the following code into the top of your AppleScript:
to getInputById(theId) tell application "Safari" set input to do JavaScript " document.getElementById('" & theId & "').innerHTML;" in document 1 end tell return input end getInputById
Next we add the following code to our Applescript under what we already have. I would add a delay of a second or two to make time for the page to load.
delay 2 getInputById("previous_episode")
When we run this we get this whole mess of HTML it should look like this:
<div class=”headline” style=”padding-left:0px;”><h2>Previous Episode</h2></div>
<div></div>
<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”><h3>Name:</h3></div><div style=”display:inline-block;width:260px;margin-left:90px;”>The Dance of Dragons</div></div>
<div></div>
<div class=”subheadline”><h3>Date:</h3></div>Sun Jun 07, 2015<div></div>
<div class=”subheadline”><h3>Season:</h3></div>5<div></div>
<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”>Episode:</div><div style=”display:inline-block;width:260px;margin-left:90px;”>9</div></div>
<div></div>
<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”>Summary:</div><div id=”lastEpSummary” style=”display:inline-block;width:260px;margin-left:90px;”><a href=”javascript: void(0)” onclick=”getSummary2(‘lastEpSummary’, ‘http://www.tvrage.com/Game_of_Thrones/episodes/1065797704’)”>Episode Summary</a></div></div>
<div></div>
If we take a look at this we find the following with our desired information right in the middle.
Episode:</div><div style=”display:inline-block;width:260px;margin-left:90px;”>9</div></div>
Next we are going to run our extraction script to pull out the episode number, place this at the top of your page:
to extractBetween(SearchText, startText1, startText2, endText) set tid to AppleScript's text item delimiters set AppleScript's text item delimiters to startText1 set endItems to text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text item 1 of endItems set AppleScript's text item delimiters to startText2 set finalText to (text items 2 thru -1 of beginningToEnd) as text set AppleScript's text item delimiters to tid return finalText end extractBetween
Next place this code in your Script:
set prevEpisodeDiv to getInputById("previous_episode") -- grabs mass of HTML set startText1 to "Episode:</div>" -- looks for something that says Episode set startText2 to "<div style="display:inline-block;width:260px;margin-left:90px;">" -- looks for something that matches this line of text after episode set endText to "</div>" -- looks for </div> after start text 1 and 2 set prevEpisodeNumber to extractBetween(prevEpisodeDiv, startText1, startText2, endText) --grabs what is inbetween the start text and the end text which is our episode number!
Using AppleScript to Write to Excel
Next we want to output this to an Excel doc. So we will tell applescript to open Excel, make a new doc, add headers, and paste the show and previous episode number into cells A2 and B2:
Use this script to accomplish this:
tell application "Microsoft Excel" activate make new document set value of cell ("a1") to "TV Show" set value of cell ("b1") to "Previous Episode Number" set value of cell ("a2") to tvshow set value of cell ("b2") to prevEpisodeNumber end tell
Awesome right??
Automating the Scrape
Ok this is cool… but this isn’t really automation if we have to do this for every page…
Now here is where the fun begins with AppleScript’s repeat function. In order to do a list of shows we need to reconfigure the order of the applescript a bit and add in a few things.
First we are going to change the “Make the new Excel document” part of our script and place first instead of last…Keep the whole writing to the excel part at the bottom of the script. (below is a full example of what the finished script should look like)
QUICK EXCEL TUTORIAL
After that we are going to make a list of all the shows we want to grab. I am going to do this by copying the sidebar of the site to an Excel doc. Paste the shows into cells A2 and below, we need a blank cell in order to do the formula.
Next copy this formula into cell B2 and copy it down. (Click on cell B2 and double click on the box at the bottom right corner)
=CONCATENATE(B1,””””,A2,””””,”,”)
Now copy the last cell in column B that you are using.
And in the top of your AppleScript doc write this:
set tvShows to {}
Then paste your list into the middle of the two brackets. Make sure to delete the comma for the last item in our list.
It should now look like this:
set tvShows to {"Game of Thrones", "The Big Bang Theory", "The Walking Dead", "Homeland", "Suits", "Arrow", "True Blood", "Modern Family", "Sherlock", "Supernatural", "Under the Dome", "New Girl", "Marvel's Agents of S.H.I.E.L.D.", "The Blacklist", "The Vampire Diaries", "Person of Interest", "Grey's Anatomy", "Vikings", "Once Upon a Time"}
Now above the line that says:
set x to 0
You are going to write:
repeat with tvshow in tvShows
delete the line below that that says:
set tvshow to "Game of Thrones"
And for the very last line in your script write:
end repeat
Ok sorry that was a lot really fast…
What we are doing here is setting the script to go through our list of shows one by one.
The last thing we need to do is set up Excel to move to the next line every time a new show’s information is pulled.
To do this write this code above repeat with tvshow in tvShows :
set y to 2
Then change the part where you are writing your info to excel to to this:
tell application "Microsoft Excel" set value of cell ("a" & y) to tvshow set value of cell ("b" & y) to prevEpisodeNumber end tell
And finally right above end repeat write:
set y to y + 1
And you are done!
I know that last part kind of got hairy… so here is a recap of what the final script should look like. I hope you found this example useful!
to getInputById(theId) tell application "Safari" set input to do JavaScript " document.getElementById('" & theId & "').innerHTML;" in document 1 end tell return input end getInputById to clicktagName(thetagName, elementnum) tell application "Safari" do JavaScript "document.getElementsByTagName('" & thetagName & "')[" & elementnum & "].click();" in document 1 end tell end clicktagName to getInputByTag(theTag, num) -- defines a function with two inputs, theTag and num tell application "Safari" --tells AS that we are going to use Safari set input to do JavaScript " document.getElementsByTagName('" & theTag & "')[" & num & "].innerHTML;" in document 1 end tell return input end getInputByTag to extractBetween(SearchText, startText1, startText2, endText) set tid to AppleScript's text item delimiters set AppleScript's text item delimiters to startText1 set endItems to text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text item 1 of endItems set AppleScript's text item delimiters to startText2 set finalText to (text items 2 thru -1 of beginningToEnd) as text set AppleScript's text item delimiters to tid return finalText end extractBetween tell application "Microsoft Excel" activate make new document set value of cell ("a1") to "TV Show" set value of cell ("b1") to "Previous Episode Number" end tell set tvShows to {"Game of Thrones", "The Big Bang Theory", "The Walking Dead", "Homeland", "Suits", "Arrow", "True Blood", "Modern Family", "Sherlock", "Supernatural", "Under the Dome", "New Girl", "Marvel's Agents of S.H.I.E.L.D.", "The Blacklist", "The Vampire Diaries", "Person of Interest", "Grey's Anatomy", "Vikings", "Once Upon a Time"} set y to 2 repeat with tvshow in tvShows set x to 0 -- this is our test variable repeat 1000 times -- this is going to repeat with every element number until what we are searching for is found set link to getInputByTag("a", x) -- this is going to retrieve the input of the element with tag a and the element number x if tvshow is in link then -- checks to see if Game of Thrones is in the output exit repeat -- if its found then we stop looking end if set x to x + 1 -- if its not found we continue to the next element end repeat clicktagName("a", x) -- clicks the tag that was found to have game of thrones in it delay 2 set prevEpisodeDiv to getInputById("previous_episode") -- grabs mass of HTML set startText1 to "Episode:</div>" -- looks for something that says Episode set startText2 to "<div style="display:inline-block;width:260px;margin-left:90px;">" -- looks for something that matches this line of text after episode set endText to "</div>" -- looks for </div> after start text 1 and 2 set prevEpisodeNumber to extractBetween(prevEpisodeDiv, startText1, startText2, endText) --grabs what is inbetween the start text and the end text which is our episode number! tell application "Microsoft Excel" set value of cell ("a" & y) to tvshow set value of cell ("b" & y) to prevEpisodeNumber end tell set y to y + 1 end repeat