Wednesday, October 22, 2008

How Rackspace can keep Slicehost awesome

Extracted from an email I sent my contact at Rackspace, here are the things I love about Slicehost that I hope Rackspace maintains:

1. Self serve sizing on the fly. I can move from one plan to another automatically online and the only change to my environment are the resources available and the bill at the end of the month.

2. Self serve, unlimited free server wipes and reinstallation. Was more of an issue early on, but knowing I have a reset button is great. The last straw with my previous host was when they tried to charge me for this and I knew Slicehost would do it for free.

3. Root access and Ubuntu Hardy. It's a wonderful thing.

4. Really great documentation. Baby steps instructions for locking down and setting up your environment that are the best I've ever used. "PickledOnion" is a particularly gifted technical writer.

Tuesday, October 21, 2008

Fun with Capitalization

Though I'm still disinclined to talk about high level topics in NLP/ML/Semantic Hype I thought it'd be fun to bring up some little discussions about specific features. Today I'm going to talk about my approach to deciding which features should be considered when deciding how much more likely a word being capitalized is to that word being picked as a keyword in a document.

So the clean class of capitalization consists of three states: lower case, Capitalized, and ALLCAPS. I treat camelCase as lower which is working ok for my purposes so far, though I haven't parsed any Java blogs yet.

Moving outward there are two positional features that are immediately required, "Begins a Sentence" and "In the Title". Both of these pull the capitalization coefficient down, though both are positive indicators in aggregate.

For my purposes I don't currently take it further into literal positioning - whether a capital letter is in the 14th position in the document or sentence is statistically relevant, even more so because I deal with small documents, but its range is within my margin of error so it isn't the best way to spend my time right now.

Grammar is certainly a bigger deal, but so far I haven't seen it diverge much from its component probabilities. What I mean by that is that the odds that a word tagged NNP (proper noun) is a keyword plus the odds that a capitalized word is a keyword is roughly the same as the odds that an NNP Capital is a keyword. It isn't something I've crunched though, so I may adjust my opinion down the road.

Same goes for "grammar position" - whether the word was preceded by a verb or a pronoun has significant predictive relevance for keywording but my instinct is that capitalization is barely associated with that number - again the component probabilities have sway.

After that we get into the magical, non-linear universe of composite features that aren't in the scope of this blog. I mention it mainly because I believe dependence features for grammar and literal position become relevant at that cardinality.

Another capitalization topic that is harder to winnow out with seed data (because we present one permutation to the user) are words that appear with multiple capitalization profiles across the corpus and in a particular document. Three cases can give throw us a life preserver - Person's Name (the lower instance was either a typo or we need to disassociate their scores as in "Lily picked a lily" ), Begins a Sentence, and Appears in Title (the lower instance is probably the correct one). Easy peasey, but it's the cases where we don't have this helping us which are challenging to me.

A most likely cap profile in document weighted along with cap profile in corpus is the most intuitive approach ( 43 times we saw the word appear mid sentence capitalized and 23 times lower so when we present it we'll capitalize it) but I see two problems. First, we're keywording very short documents. We'll very rarely see the same word appearing but handled differently mid-sentence twice in the same document, so we lose that nice context. Second is that my gut says there are features at work that are definable better than simple probability...

Across the corpus we can use an author's upcase or downcase bias coefficient (not too hard to figure out). If they have a strong one to help us we're set, but this is easy to overfit unless you're using microblogs where MR. ALLCAPS is fairly common. Which we are. But what about in the case where our problem is in two instances from the same author? What does the instance which is bucking their capitalization bias tell us separately for an upcase or a downcase bias?

When trying to figure out whether to weight more heavily the downcased or the upcased instance, what we're really asking is whether the downcased version was "uncharacteristic carelessness" or the upcased version was "uncharacteristic overemphasis". Remember they had a version of each. Two good ways to help determine that are the capitalization of the words around the word - whether they're anomalous the same way (up or down uncharacteristically - but be careful as this is extremely easy to cause an echo chamber effect if you let each word affect each of the others) and slightly more cleverly how many hits we get from the slang table compared to the author's normal "slang quotient". I have a theory that emphasis and carelessness will both hit the slang table with a similar spike - though on different words. Similarly we can compare their misspelling rate on that sentence to their normal rate of misspellings, and we expect to a similar uptick when they are being uncharacteristically careless or prone to overemphasis.

If none of that helps me decide, I just capitalize the darn thing and move on.

I should qualify all that by pointing out we're just talking about how present the keyword as a final product. When we read a doc we store the cap profile of everything we get then downcase and stem everything for crunching and comparison. The challenge I'm discussing here is the best algorithm to "reconstitute" the stems when you have multiple cap profiles for the same word.

So that's where I currently am on answering the question "If a word is capitalized, how much more likely is it a keyword than if it was not?"

Thursday, October 9, 2008

Tricks for language agnosticism

When evaluating different approaches for a non-trivial problem like NLP I've found the libraries created by others to be invaluable for benchmarking different techniques. Unfortunately these libraries, kits, and code snippets are written in every language under the sun and are in various stages of broken, so some native language programming is generally required. For me the ability to try out different algorithms in any language is critical for being able to be the sole technical resource working on challenging problems. Here are some general tricks I've learned:

1. Develop an "interlingua"
In the spirit of proto-Esperanto and the machine translation concept, the idea of an interlingua in programming for me is a set of algorithm syntax broad enough to cover most programming tasks but general enough to be applied in most languages. Though language-unique techniques are often critical for tuning a production-ready app, straightforward code expressions are ideal for iterative programming cycles and ramping up newer programmers. I usually use simple loops and conditions and break apart multi-stage tasks onto their own lines. Basic classes and methods in a standard MVC structure work in most environments. One exception to this approach: ORM. When using a language which can handle my SQL for me, I'm happy to do whatever it takes to get that off my plate.

2. Use batch processes liberally
One of the hardest elements of creating a multi-lingual application is the "putty" layer - making code in one language talk to the other. Interoperability is eventually necessary for threaded or synchronous tasks, but if you're just testing an approach you can often skip this layer. Try to find a way for both codebases to talk to the same data layer, and then run one as a batch process. MySQL extensions are fairly ubiquitous these days.

3. Get good at scripting
This goes right along with number two... the better you are at scripting Perl, Python, or Ruby, the more you can massage the data going in and out of the unfamiliar languages' code and the less native programming you'll need to do for testing purposes.

4. Invest in the native runtime
When experimenting with a new language it is important to be able to do trial and error and iterate quickly. It is worth the investment to write good make/ant/rake files and check the source into subversion for testing. This will sound silly, but also make sure you write down the directory structures and execution commands for each environment you'll need to remember.

5. Have a Escape Plan
Obviously you don't want to end up with a seven-headed hydra of a production application, so you should have a sense of how you're going to detangle your technology stack. Generally this means rewriting the modules using the approach you end up with in one of your home languages. Personally I prefer a two language stack - one interpreted language which is fast to develop in and one compiled language which performs well.

There's no one language which offers the libraries that all languages do and for more intensive research-related tasks a multi-language approach can save a huge amount of time. If you follow the KISS principal and the "build one to throw away" approach, you'll be able to get hands-on experience with many more techniques in areas like computational linguistics, image mining, or AI than you could with a single stack.