Network programming for PyS60 (XIV)

by Marcelo Barros

Have you already heard about Beautiful Soup ? Beautiful Soup is a Python HTML/XML parser with a lot of useful methods to collect data from pages or for navigating, searching, and modifying a parse tree. In this post I will show some usage examples of BeautifulSoup and a link checker application for S60, combining BeautifulSoup and urllib.

Suppose you have a large (and confuse) HTML file. How about to show it totally organized and indented ? It is possible, with prettify() method. Just create a BeautifulSoup object and feed it with your HTML, like below:

from  BeautifulSoup import BeautifulSoup
 
html = u"""<html><body><h1 style="text-align:center">Heading 1</h1>
<p>Page content goes here.
<h2>And here.</h2></p><a href="http://croozeus.com/blogs" 
alt="Croozeus link">Croozeus</a><br/>
<a href="http://www.python.org">Python</a><br/>
<h1>The end.</h1></body>
</html>"""
 
soup = BeautifulSoup(html)
print soup.prettify()

The output is below:

<html>
 <body>
  <h1 style="text-align:center">
   Heading 1
  </h1>
  <p>
   Page content goes here.
   <h2>
    And here.
   </h2>
  </p>
  <a href="http://croozeus.com/blogs" alt="Croozeus link">
   Croozeus
  </a>
  <br />
  <a href="http://www.python.org">
   Python
  </a>
  <br />
  <h1>
   The end.
  </h1>
 </body>
</html>

The parser tree is available with nice operations like findAll(). For instance, how about to print all links in the page ? Just use a dictionary with all tags you want as argument:

links = soup.findAll({'a':True})
for link in links:
    print "-->", link['href'].encode('utf-8')

The output is below:

--> http://croozeus.com/blogs
--> http://www.python.org

Now, suppose you want to add the attribute alt to all links, modifying the parser tree. Simple:

for link in links:
    link['alt'] = link['href']
print soup.prettify()

The output is below:

...
  <a href="http://croozeus.com/blogs" alt="http://croozeus.com/blogs">
   Croozeus
  </a>
  <br />
  <a href="http://www.python.org" alt="http://www.python.org">
   Python
  </a>
...

As you can see each link is like a dictionary and it is really straightforward to modify it.

The contents between <p> and </p> can be retrieved with find() and contents calls. find just looks for the first tag, not all, and contents is used by BeautifulSoup to store all the stuff for such tag. The result is an array of elements. In this case, the first is a string and the second is a new parseable element of the tree:

p = soup.find('p')
print p.contents
print p.contents[0]
print p.contents[1]
print p.contents[1].contents

The output is below:

[u'Page content goes here.\n', <h2>And here.</h2>]
Page content goes here.
<h2>And here.</h2>
[u'And here.']

There are a complete set of function to navigate through the tree. For instance, it is possible to start our search at first tag ‘p’ and after locating its child just with the following code:

p = soup.p
h2 = p.findChild()
print h2

The output is below:

<h2>And here.</h2>

Or start the search at first tag ‘h1’ but now looking for next siblings:

h1=soup.h1
while h1:
    print h1
    h1 = h1.findNextSibling('h1')

The output is below:

<h1 style="text-align:center">Heading 1</h1>
<h1>The end.</h1>

Time to use this knowledge in a new application: a basic link checker for S60 devices. The idea is to download the contents of some URL and check all link inside it using BeautifulSoup and urllib. I had problems with some pages when running from mobile (Python for S60 1.9.5 only). Sometimes BeautifulSoup failed but the same pages worked when using Python 2.6. The code is below. The time required for detection a bad link could be decreased if the timeout argument could be set. However, PC version of urllib supports timeout but it is not available for PyS60.

Typing URL

Checking...

Results

# -*- coding: utf-8 -*-
# Marcelo Barros de Almeida
# marcelobarrosalmeida@gmail.com
# License: GPL 3
import sys
try:
    # http://discussion.forum.nokia.com/forum/showthread.php?p=575213
    # Try to import 'btsocket' as 'socket' - ignored on versions < 1.9.x
    sys.modules['socket'] = __import__('btsocket')
except ImportError:
    pass
import socket
from BeautifulSoup import BeautifulSoup
import os
import e32
import urllib
import hashlib
from appuifw import *
 
class LCOpener(urllib.FancyURLopener):
    """ For mediawiki it is necessary to change the http agent.
        See:
        http://wolfprojects.altervista.org/changeua.php
        http://stackoverflow.com/questions/120061/fetch-a-wikipedia-article-with-python
    """
    version = 'Mozilla/5.0'
 
class LinkChecker(object):
    def __init__(self):
        self.lock = e32.Ao_lock()
        self.dir = "e:\\linkchecker"
        if not os.path.isdir(self.dir):
            os.makedirs(self.dir)
        self.apo = None
        self.url = u''
        self.running = False
        app.title = u"Link Checker"
        app.screen = "normal"
        app.menu = [(u"Check URL",self.check_url),
                    (u"About", self.about),
                    (u"Exit", self.close_app)]
        self.body = Text()
        app.body = self.body
        self.lock = e32.Ao_lock()
 
    def close_app(self):
        self.lock.signal()
 
    def sel_access_point(self):
        """ Select and set the default access point.
            Return the access point object if the selection was done or None if not
        """
        aps = socket.access_points()
        if not aps:
            note(u"No access points available","error")
            return None
 
        ap_labels = map(lambda x: x['name'], aps)
        item = popup_menu(ap_labels,u"Access points:")
        if item is None:
            return None
 
        apo = socket.access_point(aps[item]['iapid'])
        socket.set_default_access_point(apo)
 
        return apo
 
    def about(self):
        note(u"Link checker by Marcelo Barros (marcelobarrosalmeida@gmail.com)","info")
 
    def check_url(self):
        if self.running:
            note(u"There is a checking already in progress",u"info")
            return
        self.running = True
        url = query(u"URL to check", "text", self.url)
        if url is not None:
            self.url = url
            self.apo = self.sel_access_point()
            if self.apo:
                self.body.clear()
                self.run_checker()
        self.running = False
 
    def run_checker(self):
        self.body.add(u"* Downloading page: %s ...\n" % self.url)
        fn = os.path.join(self.dir,'temp.html')
        try:
            urllib.urlretrieve(self.url,fn)
        except Exception, e:
            self.body.add(repr(e))
            return
        self.body.add(u"* Parsing links ...\n")
        page = open(fn,'rb').read()
        try:
            soup = BeautifulSoup(page)
        except:
            self.body.add(u"* BeautifulSoup error when decoding html. Aborted.")
            return
        tags = soup.findAll({'img':True,'a':True})
        links = {}
        bad_links = []
        for n,tag in enumerate(tags):
            if tag.has_key('href'):
                link = tag['href']
            elif tag.has_key('img'):
                link = tag['src']
            else:
                link=u''
            # just check external links 
            if link.startswith(u'http'):
                # not handling internal links
                link = link.split(u'#')[0]
                # using a hash to avoid repeated links
                h = hashlib.md5()
                h.update(link.encode('utf-8'))
                links[h.digest()] = link
        nl = len(links)
        for n,k in enumerate(links):
            link = links[k]
            msg = u"[%d/%d] Checking %s " % (n+1,nl,link)
            self.body.add(msg)
            (valid,info) = self.check_link(link.encode('utf-8'))
            if valid:
                msg = u"==> Passed\n"
            else:
                msg = u"==> Failed: %s\n" % info
                bad_links.append(link)
            self.body.add(msg)
        msg = u"* Summary: %d links (%d failed)\n" % (nl,len(bad_links))
        self.body.add(msg)
        for link in bad_links:
            self.body.add(u"==> %s failed\n" % link)
        self.body.add(u"* Finished")
 
    def check_link(self,link):
        """ Check if link (encoded in utf-8) exists.
            Return (True,'') or (False,'error message')
        """
        try:
            page = LCOpener().open(link)
        except Exception, e:
            return (False,unicode(repr(e)))
        else:
           return (True,u'')
 
lc = LinkChecker()

Related posts:

  1. Network programming for PyS60 (X) To celebrate our tenth post, I will talk about urllib...
  2. Network programming for PyS60 (XIII) In our last post we talked about multicast, a special...
  3. Network programming for PyS60 (VI) Before presenting some server code it is important to discuss...
  4. Network programming for PyS60 (VII) Everything is about "protocols" in computer networks, doesn't it ?...
  5. Network programming for PyS60 (VIII) Did you do your homework ? So, I would like...

Related posts brought to you by Yet Another Related Posts Plugin.