|
MYGALE :: API
Extending Mygale to make it suit your specific needs is not difficult (presuming
you know at least a little about programming in Python). This document describes
the Mygale API and other things you need to know for extending the program.
First, here's the general process:
- For all the news sites known, get the base page (HTML) and extract URLs
from it, using a custom extractor object.
- If there is a next page, get that page too and extract URLs from it as well.
Repeat until there is no next page.
- After collecting all the URLs of all the news sites, write the URL data
to a Python file (usually called urls.py). (This is simply the format I chose
to export it in.)
- When displaying these data, this file urls.py is imported by an URLDatabase
instance. This object can be used for sorting, filtering etc., and it will
write a HTML file (output.html) that presents the articles.
URLExtractor
The basic building block for extractors is the URLExtractor class (extractors/urlextractor.py).
This is the base class; all extractors derive from it. Usually, when rolling
your own extractor, you'll want to do the same. (It *is* possible, though, to
roll your own class that is not a subclass of URLExtractor, or even of sgmllib.SGMLParser...
as long as the class can handle the correct method calls.)
The following methods and attributes are supported by URLExtractor:
urls
As the name indicates, a list of URLs collected (so far). Every item of the
list is a 3-tuple (url, description, date).
prefix
Can be set if the URLs found are relative; that is, they are written as 'file.html'
rather than 'http://news.source.com/file.html'. In such cases, you can set
.prefix to e.g. 'http://news.source.com/' to make the URLs absolute.
nextpage
For news sites with article lists of multiple pages, this contains the URL
to the next page of articles. Starts out as None.
filter(url)
Returns 1 if the given URL needs to be added to the collection; 0 otherwise.
Note that at this stage we only have an URL and no description, so we cannot
base anything on that. This method is called when scanning the HTML document
for URLs.
get_urls()
Not usually overridden. Returns all URLs collected (so far). If .prefix is
set, and the URLs do not start with 'http://', then the prefix will be prepended
to them.
nextpagefinder(url, desc)
Overridden if the news site's article list is spread over multiple pages.
This method attempts to find the URL that points to the next page, by inspecting
both URL and description. (Sometimes it's really easy, and you only have to
match a description like "Next..." or something.)
If the link to the next page is found, the .nextpage attribute will be set
to that URL, otherwise it will remain None.
If a next page is found, Mygale will get the HTML of that page, and create
a new extractor instance to process it (which in turn can yield another next
page, etc. etc.).
getdate(url, desc)
Attempts to extract an article's date, using its URL and description. By
default, this returns the so-called null-date "0000-00-00", and
as you will see, this method will often remain untouched, as a considerable
number of sites don't provide any date info that can be read by the extractor.
finalize()
This is called when all the parsing and extracting is done. If you want to
do something with .nextpage, .urls, or other attributes, then this is the
place. Often used to add .prefix to .nextpage, if necessary.
URLDatabase
...
<TODO: add an annotated extractor file as an example>
<mygale> <python>
|