MYGALE



2002.07.12: Mygale is now a project on Sourceforge.

Due to important changes in the code, some of the explanations below are not accurate anymore. Updated documentation will follow Real Soon Now.



:: What is it? ::

Mygale is a news-gathering webcrawler. In other words, it's a little program that crawls the web, looking for (new) articles related to Python.
When run regularly (e.g. daily), it provides you with an update of new Python-related articles.

:: Download/License ::

Mygale is published under the GPL.
The latest version can be downloaded at Sourceforge.

:: News sites currently supported ::

  • Linux Journal
  • Linux Today
  • Dr. Dobbs
  • IBM Developerworks
  • Byte
  • Infoworld
  • Newsforge
  • linux.com
  • O'Reilly Network
  • ZDNet
  • LinuxProgramming /* new in 0.6.2. */
  • Sourceforge
  • Activestate Python Cookbook
  • Savannah
  • InformIT /* new in 0.6.3. */
Note that sites like Sourceforge, Savannah and Python Cookbook are also included. These are not really news sites, but the release of a new Python project or snippet is worth an entry in the list, methinks. (If you disagree, you are of course free to edit servers.py and remove the sites you don't like.)

:: How it works ::

The principle is simple. Mygale has a list of news sites it is going to search, for example, Linux Journal, Newsforge, Dr. Dobbs, etc. For each of these sites, it downloads a base page and extracts URLs to articles from the HTML. If the page leads to more pages with articles, then those will undergo the same treatment.

Because all these sites store and display their article info in different ways, there are extractor objects for every news site, dealing with the specific page format.

Retrieving the articles is the first step in the process. The second one is the generation of an output file (in HTML), with the articles, usually sorted in a specific way, possibly filtered, etc.

For extending or changing Mygale, have a look at the Mygale API.

:: Command line interface ::

mygale [options] <command> [<argument>]

where command can be:
  • get:
    get all
    gets articles from all the news sites, while get <newssite> only gets a specific one.
  • display:
    display all displays all articles; display <newssite> displays articles from a specific one; display persistent displays all articles from the persistent database.
  • list:
    displays the names of all news sources known (to stdout).
  • clear:
    ...
Normally, you'll want to use mygale get all and mygale display all.

:: Known bugs and problems ::

  • date problems
  • irrelevant articles
  • duplicate articles
First of all, Mygale isn't perfect. Extracting info like dates or a meaningful title from a HTML document is not as easy as it sounds.

Sometimes you're lucky, and the date is part of the URL; but all too often this is not the case. Articles without a date usually get the so-called null-date, "0000-00-00"; this is far from ideal, especially when new articles are supposed to show up at the beginning of the output document; articles from sites with no (extractable) date info would end up at the other end!

To avoid this problem, Mygale attempts to substitute a date. To do so, it must know if an article is "new"; if that is the case, it will get the current date rather than the null-date. This only works well if Mygale is run regularly, though... running it, say, once a week, will give every new article the same date.

Aside from the date problem, the spider is dependent on the search mechanisms of the news sites. This doesn't always work well. When searching for "python", they often return articles which only mention this word once, and which are unrelated to Python programming. While this can't be helped, some searches also return articles that don't mention Python at all! You wonder how such hits end up in the result list...
Anyway, so the program cannot guarantee that all articles are relevant. Also, it cannot guarantee that there are no duplicate articles either; the program will find and eliminate duplicate URLs, but articles from different sites that are essentially the same will stay in. A notorious example of this is Python-URL.  /* This problem might be fixed in the future. */

Aside from these known problems, which may or may not be solved in future versions, there might be some real bugs. When I see them, I will (try to) fix them, but I only have some much time...
When you encounter a bug, you can mail me, or try to fix it yourself (which will be greatly appreciated). (After all, it is open source. ;-)