When the AATA mobile ride track is up, I have a script that runs periodically to decode the web pages and turn them into data. The pages are formatted for humans, so I need to do screen scraping on them to reconstruct the original data values.
There are a lot of tools for screen scraping, with each of them appealing to a programmer who has a certain world view of the web and how things are put together. The tool that I find myself recommending when someone who sounds like a programmer asks is Beautiful Soup (for python) or Rubyful Soup (for ruby), in part at least because the coder's attitude toward pragmatism is appropriate for the task at hand:
You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser.
Rubyful Soup's page says "no longer maintained, try hpricot instead"; I looked for hpricot and got some 404s. Not sure really what's up with that.
As for what I really do, rather than what I recommend, well you have to understand that most of the code that I write that actually does anything useful is less than a page of text and mostly is full of regular expressions. HTML in its unadorned state is mostly resistant to simple-minded regex parsing, but if you constrain the world enough you can get most data out of most pages by running the page through a preprocessor that normalizes it into something sane.
My favorite for that effort is the old school "HTML Tidy" application, which has a zillion options by now for taking your weird web pages and making them pass syntax checks.
Thus the source of the main loop for the "mobile ride guide decoder":
curl -s -o $TMP/route.5.html http://mobile.theride.org/rideguide_m.asp?route=5
tidy -f /dev/null -q -w 0 -o $TMP/route.5.tidy.html $TMP/route.5.html
perl ./mobiletocsv.pl < $TMP/route.5.tidy.html >> route.5.log.csv
and the "mobiletocsv" perl script is very simple minded
#!perl
while(<>) {
/AS OF (\d+:\d\d)/ && ($curtime = $1);
$realtime = `date +%H:%M`;
chop $realtime;
/(\d+) min behind/ && ($late = $1);
/(\d+) min ahead/ && ($late = -$1);
/(\d+) on time/i && ($late = 0);
/(\d\d\d) to (Ann Arbor|Ypsilanti)/ && ($busno = $1, $dest = $2);
/(\d\d\d) (OutBound|Loop)/ && ($busno = $1, $dest = $2);
/^@ (.*)<br>/ && ($curloc = $1);
/^(.*) (\d+:\d\d)/ && ($newloc = $1, $timepoint = $2,
print "$realtime,$curtime,$late,$busno,$dest,$curloc,$newloc,$timepoint\n
");
}
It's probably not resistant to aggressively malformed input, and I simply don't run it when the result is "please check your timetable for times".
The first and second iterations that I wrote of this code tried to do it all without any tidy step, and it was hard especially because I had to try to parse syntax and parse data in the same step. This version is compact enough that it might reasonably be the jumping off point for future enhancements.
still waiting for the bus data to come back....
