MYGALE :: API


 

Extending Mygale to make it suit your specific needs is not difficult (presuming you know at least a little about programming in Python). This document describes the Mygale API and other things you need to know for extending the program.

 

First, here's the general process:

  • For all the news sites known, get the base page (HTML) and extract URLs from it, using a custom extractor object.
  • If there is a next page, get that page too and extract URLs from it as well. Repeat until there is no next page.
  • After collecting all the URLs of all the news sites, write the URL data to a Python file (usually called urls.py). (This is simply the format I chose to export it in.)
  • When displaying these data, this file urls.py is imported by an URLDatabase instance. This object can be used for sorting, filtering etc., and it will write a HTML file (output.html) that presents the articles.

 

URLExtractor

The basic building block for extractors is the URLExtractor class (extractors/urlextractor.py). This is the base class; all extractors derive from it. Usually, when rolling your own extractor, you'll want to do the same. (It *is* possible, though, to roll your own class that is not a subclass of URLExtractor, or even of sgmllib.SGMLParser... as long as the class can handle the correct method calls.)

 

The following methods and attributes are supported by URLExtractor:

 

urls

As the name indicates, a list of URLs collected (so far). Every item of the list is a 3-tuple (url, description, date).

prefix

Can be set if the URLs found are relative; that is, they are written as 'file.html' rather than 'http://news.source.com/file.html'. In such cases, you can set .prefix to e.g. 'http://news.source.com/' to make the URLs absolute.

nextpage

For news sites with article lists of multiple pages, this contains the URL to the next page of articles. Starts out as None.

filter(url)

Returns 1 if the given URL needs to be added to the collection; 0 otherwise. Note that at this stage we only have an URL and no description, so we cannot base anything on that. This method is called when scanning the HTML document for URLs.

get_urls()

Not usually overridden. Returns all URLs collected (so far). If .prefix is set, and the URLs do not start with 'http://', then the prefix will be prepended to them.

nextpagefinder(url, desc)

Overridden if the news site's article list is spread over multiple pages. This method attempts to find the URL that points to the next page, by inspecting both URL and description. (Sometimes it's really easy, and you only have to match a description like "Next..." or something.)

If the link to the next page is found, the .nextpage attribute will be set to that URL, otherwise it will remain None.

If a next page is found, Mygale will get the HTML of that page, and create a new extractor instance to process it (which in turn can yield another next page, etc. etc.).

getdate(url, desc)

Attempts to extract an article's date, using its URL and description. By default, this returns the so-called null-date "0000-00-00", and as you will see, this method will often remain untouched, as a considerable number of sites don't provide any date info that can be read by the extractor.

finalize()

This is called when all the parsing and extracting is done. If you want to do something with .nextpage, .urls, or other attributes, then this is the place. Often used to add .prefix to .nextpage, if necessary.

URLDatabase

 

...

 

 

 

<TODO: add an annotated extractor file as an example>

 

 

 


<mygale> <python>