Tuesday, December 9, 2008

Getting Started in Collective Intelligence

Have you ever wanted to write a program which learns from user feedback and recommends the perfect song, news article, stock pick, or significant other? Is your startup’s alternate revenue model listed as “Win Netflix Prize”? While a bit more complicated than your average web application, the skills needed to create an application using machine learning, natural language processing, or information retrieval algorithms can be learned using tools accessible to the leanest of startups.

Since going out on my own as a startup founder I’ve found that the field has some pros and cons for bootstrapping. Since the number of features available to improve your algorithm are effectively infinite, the issue of processor management becomes a factor day one. The other key challenge is to resist the lure of academia and to put product first. The possibility of advancing the state of the art is absolutely the most exciting thing about this (or any) science, but as a startup founder it’s important to focus on pragmatic improvements over technology for its own sake. If it helps, remember that the discipline of a solution focus can be a great lens for innovation. However you are able to strike this balance, plan to do some serious homework and a lot of trial and error.

Start Studying

If you’ve got a background in Computer Science with exposure to Linear Algebra and Statistics you’re in great shape to get going, otherwise you may need to brush up on those areas as you go. Without further ado these are my favorite sources for self study:

Programming Collective Intelligence
– The classic programmer’s introduction to machine learning algorithms. If your trade is in software development, this is a soft place to start. All of the scary math equations are hidden at the back of the book which helps when you’re just getting familiar - though you’ll need to face them eventually. I suggest coding the examples and learning the equations behind them as you go along, or you’ll find yourself backtracking later. Bonus tip - the source code for all the examples is available here.

The Carnegie Mellon Slides” – Andrew Moore’s slideshows provide a more solid foundation in applied Bayesian statistics. There are many others out there but his wry sense of humor makes these an entertaining place to get comfortable with the math side. He covers advanced topics as well but for those I personally like…

The Stanford Lectures” – Comprehensive online video courses on Natural Language Processing and Machine Learning. If you take one thing away from this guide, check these out.

JMLR – The Journal of Machine Learning Research. Free papers representing research’s bleeding edge. The papers are great for thinking about creative startup ideas when you’re ready for them.

Utility Belt


As you’re studying, you’ll want to get acquainted with some handy tools of the trade.

Wordnet – Practical applications of synsets have their strengths and weaknesses, but overall Wordnet deserves its ubiquity.

Calais – Dead simple API from Reuters that returns keywords for a given piece of text. Great especially when you’re starting out.

Link Grammar – Necessary for breaking up sentences and finding parts of speech. I personally like tools based on the Penn Treebank, but I don’t know any that are free for commercial use. The Stanford Parser or Python’s extensive NLTK package are also useful for evaluation.

LDC – This is the source for “Big Data”. I would caution the thrifty bootstrapper to resist the seductive power of the Google Corpus and its 5-gram goodness unless you know for sure it’s what you need - it takes a lot of cores to query that much data effectively. The highest quality data for your app will always be the data you generate from your actual source materials, but there are many instances where a large amount of data really helps.

There are online source code versions of most of the common machine learning models available, so when you know what approach you want to evaluate a quick Google search should lead you to a version to play with. Always make sure to read the license fine print carefully if you plan to use it commercially, or just plan to write your own version.

Python is a natural choice for experimentation for the PCI examples and the NLTK. The Stanford guys use Java, and most of the packaged engines are written in C/C++. In the end expect to mix and match depending on the application, however, so don’t get too hung up on the platform. My current codebase is actually based in Ruby but has used all of the languages above for various tasks.

Like other topics in software engineering the key to a successful implementation is in a judicious use of the right algorithms for the job. Making those choices well requires experience and perspective, and the latter is something more difficult to get through self-study than at a university - you miss out on a professor’s point of view on the comparative advantages of different approaches. On the other hand there is a wealth of information available online to develop experience, and often practical implementations and academic theory diverge in emphasis – especially in the startup world.

Adam is the creator of Semantic Gifts, a gift recommender that mines social media. He’s been developing software for over a decade, and has an uncommon fascination with grammar rules.

0 Comments:

Post a Comment

<< Home