What Makes Two Pieces of Content Similar?

Thinking more about how to dampen the web's echo chamber, I wrote a Python script that uses OpenCalais to come up with a "similarity score" for two web pages you feed it. The output I'm expecting is a basic "yes" or "no" that answers the question, "Are these two articles about the same thing?" OpenCalais analyzes text and determines a list of topics the text is about, along with a relevance score for each. My script takes this report for the two pages, finds all the topics that overlap, sums and weights the relevance scores, and comes up with an overall similarity index.

This works fairly well, but produces some false positives for articles that are similar but not about exactly the same thing. For instance, this article about HP buying palm comes up as moderately similar to this Mashable review of the iPad 3G. This is probably because the HP article includes a snippet below the fold about the iPad. Granted, my script knows that these are not closely related, but in my mind they're distantly related, and the iPad article that's included on the same page as the HP article isn't even a review of the iPad 3G as the Mashable article is.

So what refinements can I include that might be able to avoid false positives like this? It occurred to me that the headlines of closely similar articles are also closely similar. One idea I had is to run the headlines through OpenCalais as well, and sum and weight those along with the contextual analysis of the articles themselves. This might help to avoid the case above, but if one article's headline is, hypothetically, "Apple iPad 3G Jailbreak Released" and the other's is "Apple Releases iPad 3G," OpenCalais might not be able to distinguish that these are in fact not closely similar. I could also skip OpenCalais and just do a count of overlapping words on the two headlines, but that probably won't work in this case either.

But what if I apply that method to the whole of the article? Count the number of overlapping words in the two articles, with the exception of generic words like "this," "that," "is," "was," etc., and then use this analysis to increase or decrease the similarity score from above. This may work.

But all of this really begs the question: how do our minds do this, and so easily too? It's an incredibly amazing, and completely ordinary, skill.

Reblog this post [with Zemanta]