Network programming for PyS60 (XIV)
by Marcelo Barros
Have you already heard about Beautiful Soup ? Beautiful Soup is a Python HTML/XML parser with a lot of useful methods to collect data from pages or for navigating, searching, and modifying a parse tree. In this post I will show some usage examples of BeautifulSoup and a link checker application for S60, combining BeautifulSoup and urllib.
Suppose you have a large (and confuse) HTML file. How about to show it totally organized and indented ? It is possible, with prettify() method. Just create a BeautifulSoup object and feed it with your HTML, like below:
from BeautifulSoup import BeautifulSoup html = u"""<html><body><h1 style="text-align:center">Heading 1</h1> <p>Page content goes here. <h2>And here.</h2></p><a href="http://croozeus.com/blogs" alt="Croozeus link">Croozeus</a><br/> <a href="http://www.python.org">Python</a><br/> <h1>The end.</h1></body> </html>""" soup = BeautifulSoup(html) print soup.prettify()
The output is below:
<html> <body> <h1 style="text-align:center"> Heading 1 </h1> <p> Page content goes here. <h2> And here. </h2> </p> <a href="http://croozeus.com/blogs" alt="Croozeus link"> Croozeus </a> <br /> <a href="http://www.python.org"> Python </a> <br /> <h1> The end. </h1> </body> </html>
The parser tree is available with nice operations like findAll(). For instance, how about to print all links in the page ? Just use a dictionary with all tags you want as argument:
links = soup.findAll({'a':True}) for link in links: print "-->", link['href'].encode('utf-8')
The output is below:
--> http://croozeus.com/blogs --> http://www.python.org
Now, suppose you want to add the attribute alt to all links, modifying the parser tree. Simple:
for link in links: link['alt'] = link['href'] print soup.prettify()
The output is below:
... <a href="http://croozeus.com/blogs" alt="http://croozeus.com/blogs"> Croozeus </a> <br /> <a href="http://www.python.org" alt="http://www.python.org"> Python </a> ...
As you can see each link is like a dictionary and it is really straightforward to modify it.
The contents between <p> and </p> can be retrieved with find() and contents calls. find just looks for the first tag, not all, and contents is used by BeautifulSoup to store all the stuff for such tag. The result is an array of elements. In this case, the first is a string and the second is a new parseable element of the tree:
p = soup.find('p') print p.contents print p.contents[0] print p.contents[1] print p.contents[1].contents
The output is below:
[u'Page content goes here.\n', <h2>And here.</h2>] Page content goes here. <h2>And here.</h2> [u'And here.']
There are a complete set of function to navigate through the tree. For instance, it is possible to start our search at first tag ‘p’ and after locating its child just with the following code:
p = soup.p h2 = p.findChild() print h2
The output is below:
<h2>And here.</h2>
Or start the search at first tag ‘h1’ but now looking for next siblings:
h1=soup.h1 while h1: print h1 h1 = h1.findNextSibling('h1')
The output is below:
<h1 style="text-align:center">Heading 1</h1> <h1>The end.</h1>
Time to use this knowledge in a new application: a basic link checker for S60 devices. The idea is to download the contents of some URL and check all link inside it using BeautifulSoup and urllib. I had problems with some pages when running from mobile (Python for S60 1.9.5 only). Sometimes BeautifulSoup failed but the same pages worked when using Python 2.6. The code is below. The time required for detection a bad link could be decreased if the timeout argument could be set. However, PC version of urllib supports timeout but it is not available for PyS60.
# -*- coding: utf-8 -*- # Marcelo Barros de Almeida # marcelobarrosalmeida@gmail.com # License: GPL 3 import sys try: # http://discussion.forum.nokia.com/forum/showthread.php?p=575213 # Try to import 'btsocket' as 'socket' - ignored on versions < 1.9.x sys.modules['socket'] = __import__('btsocket') except ImportError: pass import socket from BeautifulSoup import BeautifulSoup import os import e32 import urllib import hashlib from appuifw import * class LCOpener(urllib.FancyURLopener): """ For mediawiki it is necessary to change the http agent. See: http://wolfprojects.altervista.org/changeua.php http://stackoverflow.com/questions/120061/fetch-a-wikipedia-article-with-python """ version = 'Mozilla/5.0' class LinkChecker(object): def __init__(self): self.lock = e32.Ao_lock() self.dir = "e:\\linkchecker" if not os.path.isdir(self.dir): os.makedirs(self.dir) self.apo = None self.url = u'' self.running = False app.title = u"Link Checker" app.screen = "normal" app.menu = [(u"Check URL",self.check_url), (u"About", self.about), (u"Exit", self.close_app)] self.body = Text() app.body = self.body self.lock = e32.Ao_lock() def close_app(self): self.lock.signal() def sel_access_point(self): """ Select and set the default access point. Return the access point object if the selection was done or None if not """ aps = socket.access_points() if not aps: note(u"No access points available","error") return None ap_labels = map(lambda x: x['name'], aps) item = popup_menu(ap_labels,u"Access points:") if item is None: return None apo = socket.access_point(aps[item]['iapid']) socket.set_default_access_point(apo) return apo def about(self): note(u"Link checker by Marcelo Barros (marcelobarrosalmeida@gmail.com)","info") def check_url(self): if self.running: note(u"There is a checking already in progress",u"info") return self.running = True url = query(u"URL to check", "text", self.url) if url is not None: self.url = url self.apo = self.sel_access_point() if self.apo: self.body.clear() self.run_checker() self.running = False def run_checker(self): self.body.add(u"* Downloading page: %s ...\n" % self.url) fn = os.path.join(self.dir,'temp.html') try: urllib.urlretrieve(self.url,fn) except Exception, e: self.body.add(repr(e)) return self.body.add(u"* Parsing links ...\n") page = open(fn,'rb').read() try: soup = BeautifulSoup(page) except: self.body.add(u"* BeautifulSoup error when decoding html. Aborted.") return tags = soup.findAll({'img':True,'a':True}) links = {} bad_links = [] for n,tag in enumerate(tags): if tag.has_key('href'): link = tag['href'] elif tag.has_key('img'): link = tag['src'] else: link=u'' # just check external links if link.startswith(u'http'): # not handling internal links link = link.split(u'#')[0] # using a hash to avoid repeated links h = hashlib.md5() h.update(link.encode('utf-8')) links[h.digest()] = link nl = len(links) for n,k in enumerate(links): link = links[k] msg = u"[%d/%d] Checking %s " % (n+1,nl,link) self.body.add(msg) (valid,info) = self.check_link(link.encode('utf-8')) if valid: msg = u"==> Passed\n" else: msg = u"==> Failed: %s\n" % info bad_links.append(link) self.body.add(msg) msg = u"* Summary: %d links (%d failed)\n" % (nl,len(bad_links)) self.body.add(msg) for link in bad_links: self.body.add(u"==> %s failed\n" % link) self.body.add(u"* Finished") def check_link(self,link): """ Check if link (encoded in utf-8) exists. Return (True,'') or (False,'error message') """ try: page = LCOpener().open(link) except Exception, e: return (False,unicode(repr(e))) else: return (True,u'') lc = LinkChecker()
Related posts:
- Network programming for PyS60 (X) To celebrate our tenth post, I will talk about urllib...
- Network programming for PyS60 (XIII) In our last post we talked about multicast, a special...
- Network programming for PyS60 (VI) Before presenting some server code it is important to discuss...
- Network programming for PyS60 (VII) Everything is about "protocols" in computer networks, doesn't it ?...
- Network programming for PyS60 (VIII) Did you do your homework ? So, I would like...
Related posts brought to you by Yet Another Related Posts Plugin.
Pretty useful functions out there :)!