Though I'm still disinclined to talk about high level topics in NLP/ML/Semantic Hype I thought it'd be fun to bring up some little discussions about specific features. Today I'm going to talk about my approach to deciding which features should be considered when deciding how much more likely a word being capitalized is to that word being picked as a keyword in a document.
So the clean class of capitalization consists of three states: lower case, Capitalized, and ALLCAPS. I treat camelCase as lower which is working ok for my purposes so far, though I haven't parsed any Java blogs yet.
Moving outward there are two positional features that are immediately required, "Begins a Sentence" and "In the Title". Both of these pull the capitalization coefficient down, though both are positive indicators in aggregate.
For my purposes I don't currently take it further into literal positioning - whether a capital letter is in the 14th position in the document or sentence is statistically relevant, even more so because I deal with small documents, but its range is within my margin of error so it isn't the best way to spend my time right now.
Grammar is certainly a bigger deal, but so far I haven't seen it diverge much from its component probabilities. What I mean by that is that the odds that a word tagged NNP (proper noun) is a keyword plus the odds that a capitalized word is a keyword is roughly the same as the odds that an NNP Capital is a keyword. It isn't something I've crunched though, so I may adjust my opinion down the road.
Same goes for "grammar position" - whether the word was preceded by a verb or a pronoun has significant predictive relevance for keywording but my instinct is that capitalization is barely associated with that number - again the component probabilities have sway.
After that we get into the magical, non-linear universe of composite features that aren't in the scope of this blog. I mention it mainly because I believe dependence features for grammar and literal position become relevant at that cardinality.
Another capitalization topic that is harder to winnow out with seed data (because we present one permutation to the user) are words that appear with multiple capitalization profiles across the corpus and in a particular document. Three cases can give throw us a life preserver - Person's Name (the lower instance was either a typo or we need to disassociate their scores as in "Lily picked a lily" ), Begins a Sentence, and Appears in Title (the lower instance is probably the correct one). Easy peasey, but it's the cases where we don't have this helping us which are challenging to me.
A most likely cap profile in document weighted along with cap profile in corpus is the most intuitive approach ( 43 times we saw the word appear mid sentence capitalized and 23 times lower so when we present it we'll capitalize it) but I see two problems. First, we're keywording very short documents. We'll very rarely see the same word appearing but handled differently mid-sentence twice in the same document, so we lose that nice context. Second is that my gut says there are features at work that are definable better than simple probability...
Across the corpus we can use an author's upcase or downcase bias coefficient (not too hard to figure out). If they have a strong one to help us we're set, but this is easy to overfit unless you're using microblogs where MR. ALLCAPS is fairly common. Which we are. But what about in the case where our problem is in two instances from the same author? What does the instance which is bucking their capitalization bias tell us separately for an upcase or a downcase bias?
When trying to figure out whether to weight more heavily the downcased or the upcased instance, what we're really asking is whether the downcased version was "uncharacteristic carelessness" or the upcased version was "uncharacteristic overemphasis". Remember they had a version of each. Two good ways to help determine that are the capitalization of the words around the word - whether they're anomalous the same way (up or down uncharacteristically - but be careful as this is extremely easy to cause an echo chamber effect if you let each word affect each of the others) and slightly more cleverly how many hits we get from the slang table compared to the author's normal "slang quotient". I have a theory that emphasis and carelessness will both hit the slang table with a similar spike - though on different words. Similarly we can compare their misspelling rate on that sentence to their normal rate of misspellings, and we expect to a similar uptick when they are being uncharacteristically careless or prone to overemphasis.
If none of that helps me decide, I just capitalize the darn thing and move on.
I should qualify all that by pointing out we're just talking about how present the keyword as a final product. When we read a doc we store the cap profile of everything we get then downcase and stem everything for crunching and comparison. The challenge I'm discussing here is the best algorithm to "reconstitute" the stems when you have multiple cap profiles for the same word.
So that's where I currently am on answering the question "If a word is capitalized, how much more likely is it a keyword than if it was not?"