Customer satisfaction survey by sentiment analysis of tweets using R

June 28, 2013

Customer satisfaction survey by sentiment analysis of tweets using R

Heydi ho.

I'm back to continue my intense love affair with R, this time, notching up the intention a bit more than just popularity analysis. This time it's sentiment analysis of tweets to compare whether people are happy or not with products.

This experiment was done following Jeffrey Breen's brilliant blog post but with different products and brands. I won't bother going into the details of how they were done, (the set of slides on the page sums up everything) ... but I'll go straight into my findings.

Also, a word of caution : the post mentioned above was written in 2011, which was when Twitter had their API version 1.0. Of late, that has changed and given way to the API version 1.1. The primary change is the requirement to authenticate every request to the API using OAuth. To know how to do that, you can refer to the first bit of my post on Twitter popularity analysis using R.

Once you've gotten the text of the tweets, you can follow the instructions in the aforementioned post. My choice of comparison was phone brands ie, the big ones of the day - Apple, Samsung, HTC, Sony and Xperia.

So for Apple, I did a Twitter fetch of all tweets with keyword "Apple", then with "iPhone" and then with "iPad", and then did an rbind (row bind) to make one complete Apple dataframe of tweets.

Next, for Samsung, the keywords were "Samsung" and "Galaxy". For HTC, it was just HTC, because a search with the keyword "One" would give lots of irrelevant values. For Sony, I merely did "Xperia" because and for Nokia, "Nokia" and "Lumia".

Finally, I had 1000 tweets of each brand. Following Jeffrey's post, I loaded the lexicon of positive and negative English words into R, and then did a score calculation of each tweet. The score calculation was a simple technique. Analyse the tweets for "positive" words and "negative" words. Match each word of the tweet with the list of positive words and increment the score of that tweet with the number of positive word matches. Then match each word of the tweet with the list of negative words and decrement the score of that tweet with the number of negative word matches. So a score of +2 would mean a net of two positive words in the tweet, while a score -1 means a net of one negative word in the tweet. The added them all up.

Is this technique fool proof? No, it fails miserably at sarcasm. So if you tweet something like "You are unmatched and brilliant in your peerless efforts to be mediocre", it'll get a positive score (+3 -1). But then in a sample space of 1000 tweets, we can assume such cases will be rare, and that they'll also be considerably compensated by other tweets.

The graph plot of the histograms of the score for each brand gave this nice cheerful looking graphic.

Next, once you get the score of each tweet, find the number of very positive tweets ( score > +2) and the number of very negative tweets (score < -2) for each brand, and then the final score of each brand is the percentage of very positive tweets / (very positive + very negative tweets)

The bar plot of the final score was as follows :

Now, seeing Nokia's unprecedented high score got me thinking, and I dug into Nokia's dataframe to look into the matter, and I found the culprit - a certain user who's been tweeting in Filipino - and whose last 85 tweets were the same - having the words "Nokia Lumia" and "power" in them ... and who had been retweeted over 200 times.

Otherwise, the result is quite clear. HTC leads with 80, followed closely by Samsung (79) and Sony (76). Apple finished last with 60. Surprised? I wasn't.

Updated for automobile brands.

The ever-always auto-enthusiast that I am, I decided to do the same experiment with the following biggies from the car industry : Audi, Ford, Hyundai, Honda, Mercedes-Benz, Renault, Toyota and Volkswagen. BMW and General Motors failed to make it owing to a Twitter JSON parsing problem - which I'm unsure of myself. Nevertheless, here are the findings.

Forgive me, that the Audi bars are too light in colour. But you should be able to see them nevertheless.

The final score result was as follows :

So Mercedes Benz is the clear winner (100), with Toyota, second with 92. Renault is the only other brand with a score above 80, with 86. Ford at 76, Hyundai and VW at 74. And a rather surprising Audi and Honda bringing up the rear with 71 apiece.