Monday, February 2, 2009

Semantic Gifts: Valentine's Day Edition

Just in time for Valentine's Day shopping, a new version of Semantic Gifts ( http://semanticgifts.com ) went live today. Representing much more than new romantically themed gift ideas, the update approaches the problem of gift recommendation using social media in some fundamentally different ways.

The first obvious change is the target audience: guys looking for gifts for their girlfriends or wives. A particularly challenging purchase for many men, romantic gestures can be much more stressful than a more open ended gift like those used by the original Semantic Gifts. Additionally being a more tightly "themed" holiday there is naturally a smaller scope of gifts which are pertinent, and we found they fit well into more of a categorization model.

Though the app is targeted at guys looking for gifts for their significant others, girls can participate as well, with a short questionnaire on the types of gifts they'd like to receive.

The results have moved to a categorical model - instead of suggesting a particular product, the engine now recommends one of ten types of gifts that most closely match the text streams entered. In a move toward further transparency the system also provides some feedback on the observations that led up to the recommendation, something that the users of the first edition indicated would be helpful.

From a technology perspective the engine is largely unchanged. The emphasis on the first edition leaned more on profiling users against archetypal sources using topic modelling techniques, while the direct feedback of the Valentine's Edition and some better extraction algorithms wound up with more mass on the statistical side this time around.

Check out the new version and let us know what you think - many of the changes (and not-changed things) were based on feedback from the first edition. And stay tuned for the next version coming up, we've got some exciting ideas in the works.

Wednesday, January 28, 2009

On Facebook, Friendfeed

One of the more difficult decisions working on the latest Valentine's Day version of Semantic Gifts was the removal of Facebook and FriendFeed as source options for gift recommendations. In my Christmas edition I had the ability to use Facebook, FriendFeed, Twitter, or an open Http:// string. For the Valentine's edition I've decided to use only Twitter. Though my extreme minimalism was the antagonist for the removal of each, they all had their own list of pros and cons.

I get many things from Facebook integration. Name, sex, location, friend ratios, and occasionally interests, hobbies, movies, etc. But as the API stands I can't get the text of the recipient's comments, posts, or historical status updates. Technically speaking I have named entities to extract but I don't have natural language to process, and Semantic Gifts is very much an NLP driven app. The first response from everyone is "Aren't you cutting out a wide swath of your potential userbase?" and it's true. There are lots of people who use Facebook who aren't yet on Twitter. But without free writing samples I can't demonstrate what Semantic Gifts can do, and there's nothing I can say based on a list of interests that you couldn't figure out on your own. It's actually worse than that, because a given user will assume I can get everything that person has ever said on Facebook, which isn't possible, and will thus be doubly disappointed.

I'd add that in conjunction with other sources Facebook is awesome though, and is slotted to return for the next iteration of Semantic Gifts "Classic". The hard data used to seed the algorithm is instead coming from a Girls' page this time around. Each has advantages and disadvantages.

Twitter is the 500 lb gorilla. I loathe the idea of Semantic Gifts being categorized as a "Twitter app", but in general it gives us the data we need to make Semantic Gifts do what it does. And it is pretty awesome. Sooner or later everyone is going to catch on that combining microblog pithiness with modern concept mining you can extract some pretty good scores without waiting for ratings or search strings, and everyone will do recommendation this way. Twitter is good, and would be better if everyone used it the same way...

Which is what makes FriendFeed much harder to exclude from this round. On the one hand it seems much more heterogeneous - you have all kinds of different primary sources. But when you look closer at the text that comes back... the ubiquity of opinion expression that is our best recommendation fuel... you find topic introduction and discussion in a wonderfully contextualized way. I am coming around to the idea that for our type of recommendation to truly take over it will be running against lifestreams, not microblogs.

In the meantime the traffic numbers haven't justified the real estate for FriendFeed in the Valentine's version. I expect it to stay on for the classic edition and potentially become my primary source in the future (I am looking forward to that).

A little design commentary is necessary to put the cuts in context... the primary driver to remove anything at all is a strong (possibly relentless) drive for minimalism in the interface. The longer I do this the more strongly I believe each element has to justify its continued presence every major release.

Which is easy for me to preach about because Semantic Gifts is a casual app with a nascent technology in a not quite emerging market, but the exercise of touching each item on your site with a skeptical eye and forcing yourself to justify not commenting it out is an approach that I think should be applied to any application.

Tuesday, December 9, 2008

Getting Started in Collective Intelligence

Have you ever wanted to write a program which learns from user feedback and recommends the perfect song, news article, stock pick, or significant other? Is your startup’s alternate revenue model listed as “Win Netflix Prize”? While a bit more complicated than your average web application, the skills needed to create an application using machine learning, natural language processing, or information retrieval algorithms can be learned using tools accessible to the leanest of startups.

Since going out on my own as a startup founder I’ve found that the field has some pros and cons for bootstrapping. Since the number of features available to improve your algorithm are effectively infinite, the issue of processor management becomes a factor day one. The other key challenge is to resist the lure of academia and to put product first. The possibility of advancing the state of the art is absolutely the most exciting thing about this (or any) science, but as a startup founder it’s important to focus on pragmatic improvements over technology for its own sake. If it helps, remember that the discipline of a solution focus can be a great lens for innovation. However you are able to strike this balance, plan to do some serious homework and a lot of trial and error.

Start Studying

If you’ve got a background in Computer Science with exposure to Linear Algebra and Statistics you’re in great shape to get going, otherwise you may need to brush up on those areas as you go. Without further ado these are my favorite sources for self study:

Programming Collective Intelligence
– The classic programmer’s introduction to machine learning algorithms. If your trade is in software development, this is a soft place to start. All of the scary math equations are hidden at the back of the book which helps when you’re just getting familiar - though you’ll need to face them eventually. I suggest coding the examples and learning the equations behind them as you go along, or you’ll find yourself backtracking later. Bonus tip - the source code for all the examples is available here.

The Carnegie Mellon Slides” – Andrew Moore’s slideshows provide a more solid foundation in applied Bayesian statistics. There are many others out there but his wry sense of humor makes these an entertaining place to get comfortable with the math side. He covers advanced topics as well but for those I personally like…

The Stanford Lectures” – Comprehensive online video courses on Natural Language Processing and Machine Learning. If you take one thing away from this guide, check these out.

JMLR – The Journal of Machine Learning Research. Free papers representing research’s bleeding edge. The papers are great for thinking about creative startup ideas when you’re ready for them.

Utility Belt


As you’re studying, you’ll want to get acquainted with some handy tools of the trade.

Wordnet – Practical applications of synsets have their strengths and weaknesses, but overall Wordnet deserves its ubiquity.

Calais – Dead simple API from Reuters that returns keywords for a given piece of text. Great especially when you’re starting out.

Link Grammar – Necessary for breaking up sentences and finding parts of speech. I personally like tools based on the Penn Treebank, but I don’t know any that are free for commercial use. The Stanford Parser or Python’s extensive NLTK package are also useful for evaluation.

LDC – This is the source for “Big Data”. I would caution the thrifty bootstrapper to resist the seductive power of the Google Corpus and its 5-gram goodness unless you know for sure it’s what you need - it takes a lot of cores to query that much data effectively. The highest quality data for your app will always be the data you generate from your actual source materials, but there are many instances where a large amount of data really helps.

There are online source code versions of most of the common machine learning models available, so when you know what approach you want to evaluate a quick Google search should lead you to a version to play with. Always make sure to read the license fine print carefully if you plan to use it commercially, or just plan to write your own version.

Python is a natural choice for experimentation for the PCI examples and the NLTK. The Stanford guys use Java, and most of the packaged engines are written in C/C++. In the end expect to mix and match depending on the application, however, so don’t get too hung up on the platform. My current codebase is actually based in Ruby but has used all of the languages above for various tasks.

Like other topics in software engineering the key to a successful implementation is in a judicious use of the right algorithms for the job. Making those choices well requires experience and perspective, and the latter is something more difficult to get through self-study than at a university - you miss out on a professor’s point of view on the comparative advantages of different approaches. On the other hand there is a wealth of information available online to develop experience, and often practical implementations and academic theory diverge in emphasis – especially in the startup world.

Adam is the creator of Semantic Gifts, a gift recommender that mines social media. He’s been developing software for over a decade, and has an uncommon fascination with grammar rules.

Tuesday, November 25, 2008

Semantic Gifts

The holidays are coming right up, and for many the season means a perennial challenge selecting the right gift for family, loved ones, and secret santas. Well fret no longer because there is a new way to find the perfect idea, Semantic Gifts (http://semanticgifts.com).

Here's how to use it:

Step 1: Select whatever information you have about their Facebook, Twitter, FriendFeed, or other public profile.
Step 2: Get gift ideas chosen for their unique interests.

Like any good oracle if the recommendations miss you can always try again by selecting "More please." The app will return gift ideas until it runs out of picks it feels enthusiastic about.

Gifts are chosen using an algorithm that finds matches using two approaches. First a natural language processing filter analyzes the things your friend has said and matches the concepts they've discussed to gifts associated with similar topics. Second, a topic modeler looks at everything they've said and creates a profile, making some guesses about what they might be like. We can then recommend gifts that we've identified for people with similar overall interests. It's this combination of "what are they like" as well as "what interests can we determine they have" that can give us a good idea about what gift they'd enjoy.

Semantic Gifts is the first consumer-facing app using the analytics suite created over the past few months under the Morning Set banner. The engine could be described as a concept mining/information retrieval program with a heavy focus on NLP using "dirty" sources like RSS and micro blog feeds, so gift recommendation is a perfect fit for the technology. The two big challenges that Semantic Gifts posed for the app were the tendency to overfit because of the often small input size, and the need to perform the analysis very quickly.

The first problem was solved by creating tiers of analysis intensity based on how much information our crawler is able to extract. Very small and very large inputs are worked over lightly, though for opposite reasons - it's easy to draw too many conclusions when you have little to work with, and you don't need to work very hard at all when they've said a lot. Fortunately the system is tuned in such a way that average blog and twitter feed sizes fall right in between.

The second problem was speed. Especially when you start to re-index statistical machine learning models at a corpus large enough to be useful, you can blow the budget of a bootstrapped startup just on EC2 processor spikes. When developing the analytics suite this problem was avoided by handling it asynchronously. Give me your data today, then come back tomorrow for your model. For Semantic Gifts that clearly wouldn't work, so we looked for components of the suite which would give us the most bang for the processor buck and also ways to "pre-bake" as much as possible. Though there is a substantial laundry list of things to add the compromise appears successful so far.

Anyway I hope Semantic Gifts is as fun to use as it was to make, and I look forward to getting some good feedback.

Wednesday, October 22, 2008

How Rackspace can keep Slicehost awesome

Extracted from an email I sent my contact at Rackspace, here are the things I love about Slicehost that I hope Rackspace maintains:

1. Self serve sizing on the fly. I can move from one plan to another automatically online and the only change to my environment are the resources available and the bill at the end of the month.

2. Self serve, unlimited free server wipes and reinstallation. Was more of an issue early on, but knowing I have a reset button is great. The last straw with my previous host was when they tried to charge me for this and I knew Slicehost would do it for free.

3. Root access and Ubuntu Hardy. It's a wonderful thing.

4. Really great documentation. Baby steps instructions for locking down and setting up your environment that are the best I've ever used. "PickledOnion" is a particularly gifted technical writer.

Tuesday, October 21, 2008

Fun with Capitalization

Though I'm still disinclined to talk about high level topics in NLP/ML/Semantic Hype I thought it'd be fun to bring up some little discussions about specific features. Today I'm going to talk about my approach to deciding which features should be considered when deciding how much more likely a word being capitalized is to that word being picked as a keyword in a document.

So the clean class of capitalization consists of three states: lower case, Capitalized, and ALLCAPS. I treat camelCase as lower which is working ok for my purposes so far, though I haven't parsed any Java blogs yet.

Moving outward there are two positional features that are immediately required, "Begins a Sentence" and "In the Title". Both of these pull the capitalization coefficient down, though both are positive indicators in aggregate.

For my purposes I don't currently take it further into literal positioning - whether a capital letter is in the 14th position in the document or sentence is statistically relevant, even more so because I deal with small documents, but its range is within my margin of error so it isn't the best way to spend my time right now.

Grammar is certainly a bigger deal, but so far I haven't seen it diverge much from its component probabilities. What I mean by that is that the odds that a word tagged NNP (proper noun) is a keyword plus the odds that a capitalized word is a keyword is roughly the same as the odds that an NNP Capital is a keyword. It isn't something I've crunched though, so I may adjust my opinion down the road.

Same goes for "grammar position" - whether the word was preceded by a verb or a pronoun has significant predictive relevance for keywording but my instinct is that capitalization is barely associated with that number - again the component probabilities have sway.

After that we get into the magical, non-linear universe of composite features that aren't in the scope of this blog. I mention it mainly because I believe dependence features for grammar and literal position become relevant at that cardinality.

Another capitalization topic that is harder to winnow out with seed data (because we present one permutation to the user) are words that appear with multiple capitalization profiles across the corpus and in a particular document. Three cases can give throw us a life preserver - Person's Name (the lower instance was either a typo or we need to disassociate their scores as in "Lily picked a lily" ), Begins a Sentence, and Appears in Title (the lower instance is probably the correct one). Easy peasey, but it's the cases where we don't have this helping us which are challenging to me.

A most likely cap profile in document weighted along with cap profile in corpus is the most intuitive approach ( 43 times we saw the word appear mid sentence capitalized and 23 times lower so when we present it we'll capitalize it) but I see two problems. First, we're keywording very short documents. We'll very rarely see the same word appearing but handled differently mid-sentence twice in the same document, so we lose that nice context. Second is that my gut says there are features at work that are definable better than simple probability...

Across the corpus we can use an author's upcase or downcase bias coefficient (not too hard to figure out). If they have a strong one to help us we're set, but this is easy to overfit unless you're using microblogs where MR. ALLCAPS is fairly common. Which we are. But what about in the case where our problem is in two instances from the same author? What does the instance which is bucking their capitalization bias tell us separately for an upcase or a downcase bias?

When trying to figure out whether to weight more heavily the downcased or the upcased instance, what we're really asking is whether the downcased version was "uncharacteristic carelessness" or the upcased version was "uncharacteristic overemphasis". Remember they had a version of each. Two good ways to help determine that are the capitalization of the words around the word - whether they're anomalous the same way (up or down uncharacteristically - but be careful as this is extremely easy to cause an echo chamber effect if you let each word affect each of the others) and slightly more cleverly how many hits we get from the slang table compared to the author's normal "slang quotient". I have a theory that emphasis and carelessness will both hit the slang table with a similar spike - though on different words. Similarly we can compare their misspelling rate on that sentence to their normal rate of misspellings, and we expect to a similar uptick when they are being uncharacteristically careless or prone to overemphasis.

If none of that helps me decide, I just capitalize the darn thing and move on.

I should qualify all that by pointing out we're just talking about how present the keyword as a final product. When we read a doc we store the cap profile of everything we get then downcase and stem everything for crunching and comparison. The challenge I'm discussing here is the best algorithm to "reconstitute" the stems when you have multiple cap profiles for the same word.

So that's where I currently am on answering the question "If a word is capitalized, how much more likely is it a keyword than if it was not?"

Thursday, October 9, 2008

Tricks for language agnosticism

When evaluating different approaches for a non-trivial problem like NLP I've found the libraries created by others to be invaluable for benchmarking different techniques. Unfortunately these libraries, kits, and code snippets are written in every language under the sun and are in various stages of broken, so some native language programming is generally required. For me the ability to try out different algorithms in any language is critical for being able to be the sole technical resource working on challenging problems. Here are some general tricks I've learned:

1. Develop an "interlingua"
In the spirit of proto-Esperanto and the machine translation concept, the idea of an interlingua in programming for me is a set of algorithm syntax broad enough to cover most programming tasks but general enough to be applied in most languages. Though language-unique techniques are often critical for tuning a production-ready app, straightforward code expressions are ideal for iterative programming cycles and ramping up newer programmers. I usually use simple loops and conditions and break apart multi-stage tasks onto their own lines. Basic classes and methods in a standard MVC structure work in most environments. One exception to this approach: ORM. When using a language which can handle my SQL for me, I'm happy to do whatever it takes to get that off my plate.

2. Use batch processes liberally
One of the hardest elements of creating a multi-lingual application is the "putty" layer - making code in one language talk to the other. Interoperability is eventually necessary for threaded or synchronous tasks, but if you're just testing an approach you can often skip this layer. Try to find a way for both codebases to talk to the same data layer, and then run one as a batch process. MySQL extensions are fairly ubiquitous these days.

3. Get good at scripting
This goes right along with number two... the better you are at scripting Perl, Python, or Ruby, the more you can massage the data going in and out of the unfamiliar languages' code and the less native programming you'll need to do for testing purposes.

4. Invest in the native runtime
When experimenting with a new language it is important to be able to do trial and error and iterate quickly. It is worth the investment to write good make/ant/rake files and check the source into subversion for testing. This will sound silly, but also make sure you write down the directory structures and execution commands for each environment you'll need to remember.

5. Have a Escape Plan
Obviously you don't want to end up with a seven-headed hydra of a production application, so you should have a sense of how you're going to detangle your technology stack. Generally this means rewriting the modules using the approach you end up with in one of your home languages. Personally I prefer a two language stack - one interpreted language which is fast to develop in and one compiled language which performs well.

There's no one language which offers the libraries that all languages do and for more intensive research-related tasks a multi-language approach can save a huge amount of time. If you follow the KISS principal and the "build one to throw away" approach, you'll be able to get hands-on experience with many more techniques in areas like computational linguistics, image mining, or AI than you could with a single stack.