The web is the largest collection of information known, and it is open for anyone to analyze. So, while “hard” AGI (Artificial General Intelligence) may be in a three decade long slump, using the web as a vast database for data mining, analysis and machine learning is a fruitful field of endeavor. In addition, the web makes millions of users available to give rapid feedback to ingenious programmers. The book "Programming Collective Intelligence" is aimed squarely at using this collected data and feedback from the web to make intelligent applications. 


I acquired the book when it first came out in October 2007. After reading the first third of it and scanning the rest, I “loaned” it out to my computer scientist nephew but soon began to miss it. Upon acquiring my second copy of the book, I began a more thorough reading and used tidbits of it in my own web scripting exploits. Thus, in contrast to my usual habit of devouring good books in a wild weekend orgy of reading, I have slowly absorbed the book’s contents over a one year time period, and I find that to be a valuable way to use the book. It is a useful hands-on reference for active web programming.


The book covers many varieties of web data mining and analysis, and provides actual, working code to support each subject. The code is all in the python language, which is especially useful to python hackers, but is also quite readable and understandable to just about any programmer. 


Many of the algorithms and techniques covered are accompanied by an open source tool for doing the analysis, and this makes it very easy to experiment and use the coding techniques. Also, there are discussions of several open API’s for various web sites, such as Yahoo, Delicious, My Space, etc. At the end of the day, however, I find the most useful tool to be Beautiful Soup, the python library for screen scraping where there are no standard API’s offered, which is also covered along with the Python Imaging Library, pysqlite, NumPy, matplotlib, and Mark Pilgrim's Universal Feed Parser. 


Author Tony Segaran states in the preface that no specialized mathematical knowledge is required, and he does a good job of explaining every concept he introduces. I have an undergraduate degree in math and the coverage of statistical analysis is really the heart of the book. Mathematical and statistical algorithms covered include Euclidean distance, Pearson Correlation Coefficients, weighted means, Tanimoto coefficients, conditional probabilities, Gini impurities, entropy, variance, Gaussian functions, and dot products. I believe the math is accessible to most programmers but YMMV. Do not buy the book unless you are interested in statistical analysis.


 Each chapter explores a different way of analyzing and using data collected from the web to solve a particular kind of problem. One chapter explores web crawlers and page rankings and includes a version of Google Page Rank. Another chapter shows how to make a site that recommends movies to a user based on how similar their likes and dislikes are to a vast user database of individual likes and dislikes culled from del.icio.us using that site's standard API. Other kinds of web apps covered include use of Google maps and analyzing word frequency in textual material. There is also a coverage of spam filtering software.


Algorithms that are given extensive coverage include Bayesian classifiers, decision tree classifiers, neural networks, genetic programming and genetic algorithms, Support Vector Machines, k-nearest neighbors, clustering, hierarchical clustering, k-means clustering, multidimensional scaling, non-negative matrix factorization, optimization, cost functions, and simulated annealing. By now you should be getting the idea that this is a very technical book, but don't let that scare you off. If you are interested in this sort of thing, then this book will walk you through it and help you use the working code the book freely offers without trying to make you a world class expert in each and every technique, which of course would be impossible in a book that summarizes such a broad area.


Web sites, blog entries, and similar data sets can be analyzed to find similarities and differences in many different ways, and filtered to pick out the significant features of the data from all the noise.  One section of the book goes through political blogs to find these kinds of trends in the data. Another chapter analyzes stock market data from Wall Street to find trends and tendencies. 


Genetic programming and genetic algorithms are covered and working code is introduced that can be used to play around with genetic programming in all sorts of situations. This is a real strength of the book, that it walks you through the creation of object oriented, working code that you can apply to problems and applications completely unrelated to the specific problems explored in the book. For instance, the file go.py created to use genetic algorithms is applied in the book to a simple AI game and also to analyzing a specific mathematical function. However, the classes and functions in go.py can be used equally well to apply genetic programming to other subjects. The same is true for the book's coverage of neural networks and other techniques. The book creates jumping off points from which you can hack and explore in endless directions and for endless hours. I would not be surprised if the book starts a few intrepid souls down pathways that lead to significant new applications. For most of us, it will be an educational exercise that at least allows us to better appreciate the creations of those intrepid few.


I think the book is accessible to any programmer who is not afraid of a little math. If you are a Python programmer who is interested in Web 2.0 type applications, you really can't live without this book, so to speak. It was published in late 2007 and will stay fresh for at least a decade because the kind of statistics and mathematical analysis covered do not change over time. Even if you don't actually code web 2.0 type apps it is interesting to see how this stuff works behind the scenes and so reading the book is highly educational. For me, the book will be a source of scripting fun for many years. It is a book that I will turn to again and again. If you've ever wondered "how does Amazon make recommendations of books I might like?" then this is the book to answer your question.