|
MYGALE 2002.07.12: Mygale is now a project on Sourceforge.
:: How it works :: The principle is simple. Mygale has a list of news sites it is going to search, for example, Linux Journal, Newsforge, Dr. Dobbs, etc. For each of these sites, it downloads a base page and extracts URLs to articles from the HTML. If the page leads to more pages with articles, then those will undergo the same treatment. Because all these sites store and display their article info in different ways, there are extractor objects for every news site, dealing with the specific page format. Retrieving the articles is the first step in the process. The second one is the generation of an output file (in HTML), with the articles, usually sorted in a specific way, possibly filtered, etc. For extending or changing Mygale, have a look at the Mygale API. :: Command line interface :: mygale [options] <command> [<argument>] where command can be:
:: Known bugs and problems ::
Sometimes you're lucky, and the date is part of the URL; but all too often this is not the case. Articles without a date usually get the so-called null-date, "0000-00-00"; this is far from ideal, especially when new articles are supposed to show up at the beginning of the output document; articles from sites with no (extractable) date info would end up at the other end! To avoid this problem, Mygale attempts to substitute a date. To do so, it must know if an article is "new"; if that is the case, it will get the current date rather than the null-date. This only works well if Mygale is run regularly, though... running it, say, once a week, will give every new article the same date. Aside from the date problem, the spider is dependent on the search mechanisms of the news sites. This doesn't always work well. When searching for "python", they often return articles which only mention this word once, and which are unrelated to Python programming. While this can't be helped, some searches also return articles that don't mention Python at all! You wonder how such hits end up in the result list... Anyway, so the program cannot guarantee that all articles are relevant. Also, it cannot guarantee that there are no duplicate articles either; the program will find and eliminate duplicate URLs, but articles from different sites that are essentially the same will stay in. A notorious example of this is Python-URL. /* This problem might be fixed in the future. */ Aside from these known problems, which may or may not be solved in future versions, there might be some real bugs. When I see them, I will (try to) fix them, but I only have some much time... When you encounter a bug, you can mail me, or try to fix it yourself (which will be greatly appreciated). (After all, it is open source. ;-)
|