Firstly apologies for the absolute laziness and inertia on my part to not update this blog at all. After starting this blog I was totally bogged down by the fact that there are so many good technical blogs on AI and NLP so what must I write on. Slowly and lately I have come to realize that keeping a research blog is like keeping a research and learning log. Keep a log on things that have excited you and you want the world to know how they excited you. So here I begin with a small article on our winning hack in the science category at the university HackU! event conducted by Yahoo! R&D at IIT Madras.
The name of our hack was Tweets of Interest. The motivation is simple. We follow many users on Twitter because their interests match ours. But it does not mean that they always tweet interesting things. If you follow a researcher on Twitter, a tweet about his recent paper interests you more than his tweet on an adventure he/she recently undertook. Won’t it be nice if there was an application that polls our timeline taking our interests and mails ‘interesting’ tweets to us.
The app has a gui that takes your user id on Twitter, and words separated by commas that specify your interests. I won’t go much into the implementation details because it is pretty much straightforward. It was in Python, (and I fell in love with it:) ), and the other components like mail server, extracting the tweets using the twitter api, the oauth authentication and so on are all solved problems with sufficient documentation available. Since it was a 24-hr implementation you can imagine the dirtiest programming practices that came to the fore to deliver the app in the quickest time. Instead let me delve on the science behind this hack (and before I forget let me mention that it was in the science category).
Instead of a naive keyword based approach involving searching hash tags or complete matching of words, we tried to use a more semantic approach. We tried to LSA or the Latent Semantic Analysis. This is a well-known algo in the field of IR, and so I won’t delve on it either(and yes I am not an escapist). So now each tweet is a document and all the words in the tweet form the term-document matrix. For each word in the tweet, we augment it by adding the synonyms/synsets with the help of wordnet to get an expanded term set. From the term-document matrix, we apply LSA to get the concept space and the tweets are now projected onto this space, along with the query which is nothing but the interests that the user specifies. A simple cosine similarity will enable a ranking of the tweets and the top tweets are mailed and waiting for you in your inbox . The approach was quite good as a query with the interests given as google,papers gave tweets that talked about search.
But there is still a lot of work yet to be done… for this approach may not work for real world tweets that talk about Named Entities… For example a tweet talking on ECIR will be completely missed out for the app has no idea what ECIR is. So adding world information may lead to a better system, and it is a work in progress.