Writing Intelligent Web Crawlers: Consume HTML the 'Right Way' TM.
Time: Aug 30, 10:00 a.m.
Location: SI 060
Presentation Download(s)

While no one ever wants to scrape HTML for data, however all too often information cannot be accessed any other way. While most crawlers are treated as hacks, new libraries make manipulating HTML almost as easy as XML. In this session we will not only cover the basics of good web crawler design (fault proof, retry queues, multiple threads, interval mapping, and mechanization tools) but we will study workarounds for common pitfalls as well as tools to help reverse-engineer dynamic sites.

About Kevin Kubasik

Kevin Kubasik (My Parents Basement Software)

Kevin Kubasik is an avid open source developer with a passion for learning. At 14 he go involved in his first project (Beagle) and never looked back. Almost 6 years later he is a member of the Gnome Foundation, an Ubuntu Member and Packager and a full time Computer Science Student at Neumont University (Having transferred from Case Western Reserve).

LUGs

Don't see your LUG here? Let us know!

UTOSC 2008 Sponsors


Diamond














Sapphire






Emerald












General








Community






Media






Hosting