[How-To] How to parse a web page

electron

Administrator
Staff member
THIS TUTORIAL IS A DRAFT

By popular request, I decided to write a small tutorial on how to parse a webpage (also known as screen scraping), and grab the data you are looking for. The example is a script for the Homeseer automation software, and uses the hs.GetUrl function, but it's written in VBscript, and many of these functions can be found in other languages too, so this is a great start, no matter what language you want to use.

In this example, I am going to grab the data from http://www.randomjoke.com/topic/oneliners.php which shows a random one-liner/joke every time you load the page. The functions I used for this page are also the ones you would use when parsing other websites. Everyone has their own "coding" style, so there are different (and more efficient) ways of doing the same thing, but I am trying to keep things easy here.

These are the most important functions used in this example:
Homeseer Functions:
hs.GetUrl
hs.WriteLog

VBscript Functions:
Dim (http://www.devguru.com/Technologies/vbscript/quickref/dim.html)
Replace (http://www.devguru.com/Technologies/vbscript/quickref/replace.html)
InStr (http://www.devguru.com/Technologies/vbscript/quickref/instr.html)
Mid (http://www.devguru.com/Technologies/vbscript/quickref/mid.html)

Most scripts have sub's and functions, a sub can execute many instructions, while a function is designed to do something with the data it was given, and return the result. All Homeseer scripts have a sub named main(), parameters can be given, but in this example won't be needed.

Before we do the actual processing, we have to declare (allocate memory space) the variables we want to use in this script. It's a good habit to have certain variable naming conventions, to keep things organized, in this case, instead of using "Site", "Document" and "Body", I used "strSite", "strDocument" and "strBody", showing that these variables are of the type String (some of the other types are Integers, Variants, Booleans).


sub main()
  We use 'Dim' to declare the variables we want to use
  Dim strSite
  Dim strDocument
  Dim strBody
  Dim x

'hs.GetUrl' requires us to split up the url in several parts, the host (strSite), the document we want to retrieve (strDocument), and the port (80, default http port).

  strSite = "www.randomjoke.com"
  strDocument = "/topic/oneliners.php"

We use the 'hs.GetUrl' function to connect to the website and retrieve the HTML code, and store it in 'strBody'. Once we have the data, we remove all the Tab's and Carriage Returns, by replacing them with "" (nothing). chr(9) is the ASCII code for the Tab character, while chr(10) and (chr(13) are the ASCII values for Carriage Returns & Line Feeds (the enter key on your windows keyboard). This doesn't have to be done, but it does make parsing easier (and since you are removing all CR's and LF's you are putting everything on 1 line).

  strBody = hs.GetUrl(strSite,strDocument,false,80)
  strBody = Replace (strBody, chr(9), "")
  strBody = Replace (strBody, chr(10),"")
  strBody = Replace (strBody, chr(13), "")

This is a snippet of the HTML source code that contains the actual quote. You can view the source by going to the site with your browser, right click and select 'View Source'. I'll highlight the keywords we are using here to grab the data.
<MAP NAME="backnewMap3"><AREA SHAPE="rect" COORDS="105,4,189,27" HREF="/topic/oneliners.php?53480"><AREA SHAPE="rect" COORDS="39,33,162,58" HREF="../topiclist.html"></MAP><IMG SRC="/images/backnew.gif" WIDTH="197" HEIGHT="77" ALIGN="top"
      BORDER="0" NATURALSIZEFLAG="3" USEMAP="#backnewMap3" ISMAP ALT="next joke|back to topic list"></P>
<P>
If we learn by our mistakes then I am getting a fantastic education.
<CENTER>
<div align="center">
<p>

The Instr() function will tell us the position of a given string, and Mid() allows us to select a part of a string, by specifying the start and end position (which is where you use Instr to figure out where the data you are looking for starts). When you are looking for data in a webpage, you have to try to use a unique 'keyword', in this case, "topic lists" occurs only once and brings us pretty close to the data we are looking for. Once you get to where the quote starts, you have to figure out where the quote ends, so you can chop off the rest of the useless html data.

look for the keyword "topic list" and return the position
x = Instr(1,strBody,"topic list",1)
---
update strBody, starting with the result from the "topic list" search in the previous line, basically chopping off a big chunk of useless html code.
strBody = Mid(strBody,x)
strBody's value is now: topic list"></P><P>If we learn by our mistakes then I am getting a fantastic education.<CENTER>...
---
now we move closer to the quote by looking for <p>, which is followed right by the quote.  We increase the return position with 3 since "<p>" is 3 characters, and we want the data starting AFTER <p> (try removing them and you will see what happens).
x = Instr(1,strBody, "<p>",1) + 3
---
update strBody again
strBody = Mid(strBody,x)
strBody's value is now: If we learn by our mistakes then I am getting a fantastic education.<CENTER>
---
we now want to remove all data after the quote, the first set of characters following the quote are "<CENTER>", so we look for "<c"
x = Instr(1,strBody, "<c",1) - 1
---
update StrBody again, but in this case, the start position has been specified (first character), and end where "<c" begins.
strBody = Mid(strBody,1,x)
strBody's value is now:  If we learn by our mistakes then I am getting a fantastic education.

In this example, I am just going to print the result to the log view window, but you can use i.e. hs.SetDeviceString "v5", strBody to update a device with the status

  hs.writelog "debug", strBody

end sub

This is what the finished script looks like:

sub main()

  Dim strSite
  Dim strDocument
  Dim strBody
  Dim x

  strSite = "www.randomjoke.com"
  strDocument = "/topic/oneliners.php"
  strBody = hs.GetUrl(strSite,strDocument,false,80)

  strBody = Replace (strBody, chr(9), "")
  strBody = Replace (strBody, chr(10),"")
  strBody = Replace (strBody, chr(13), "")


  x = Instr(1,strBody,"topic list",1)
  strBody = Mid(strBody,x)
  x = Instr(1,strBody, "<p>",1) + 3
  strBody = Mid(strBody,x)
  x = Instr(1,strBody, "<c",1) - 1
  strBody = Mid(strBody,1,x)



  hs.writelog "debug", strBody

end sub


Some good VBscript references:
http://www.devguru.com/Technologies/vbscri...ript_intro.html
 
Thanks Electron. I'm going to play with this later on tonight. Gonna use the DECENT weather today to work on my truck. Hey...maybe I should have used this script to grab the weather and have HS tell me" Bill, get off your butt and work on your truck"! ;)
 
lol, I do the exact same thing with my supertrigger plugin. I have the temperature in a virtual device, so I just told supertrigger to announce "it's so cold outside" when the temperature drops below 21F (just as a test). We had nice weather up to yesterday, we are back into our lake effect snow rythm.
 
Back
Top