Studying Twitter Sentiment of Football Superstars


The growth in micro-blogging activity on sites over the last few years has been phenomenal.  Platforms like Twitter offer an easy outlet for people to express their opinions and companies are increasingly getting interested in capturing these insights about customer behaviour and preferences that could help generate more revenues. The staggering amount of data that these sites generate cannot be manually analysed. Enter thus, Sentiment Analysis, the field where we teach machines to understand human sentiment.

Traditionally sentiment analysis under the umbrella term- ‘text mining’ focuses on larger pieces of text like movie reviews or news articles. On Twitter however, people post 140-character long informal messages called tweets. Analysing sentiment from these tiny pieces of text is challenging due to their unstructured nature- internet slang, abbreviations, non-conventional spelling and grammar, hashtags, urls and emoticons are just some of the complexities that need to be addressed.

Understanding the nuances of any language is another major problem. There is still a lot of research going on to train machines to be able to deal with complex grammatical structures, ambiguity, irony, sarcasm etc. Identifying the target for which we are analysing sentiment is also important. For instance a tweet comparing two players using a qualifier like ‘better’ or ‘worse’ would be labelled positive or negative depending on the target.

Our objectives

  1. Performing sentiment analysis on Twitter data.
  2. Extracting sentiment and gauging popularity of different players of the English Premier League from their Twitter footprint.

The basic question we are asking in this project is whether a given piece of tweet about an football player is positive, negative or neutral. The aggregate sentiment of a player will then be used to gauge the overall popularity of a player online. This information could be useful for companies looking for potential endorsees.

Overview of methodology

We collected tweets using the Twitter API in Python. The tweet data consisted of text and metadata like timezone, twitter handle of the poster, re-tweet count etc. We perform cleaning operations, feature extraction on noisy data (discussed later) and manually label about 4669 unique tweets.

Using this training data, we train several classifiers (Random Forest, Maximum Entropy, SVM, Naïve Bayes etc). We select the best classifier by testing on a dataset which consisted of about 1399 tweets of which 234 were negative, 380 were neutral and 785 were positive.

The next part of the problem is more exploratory in nature. Based on the inferences of our previous analysis we try to determine which players evoke the maximum sentiment (positive/negative) online, how the player performance affects their popularity ratings and how do two players compare against each other.

Performance metric

We rarely observe balanced frequency distribution of data across the different classes. Accuracy which is the measure of the ratio of current classification by total responses thus becomes a skewed measure of performance. In order to use accuracy as a performance metric, we should interpret the results relative to the baseline case.

An alternative form of performance measurement could be to use the F-score. F-score is the harmonic mean of the precision and recall figures.

Predicted\Actual-> True False
True True Positive (TP) False Positive (FP)
False False Negative (FN) True Negative (TN)

Precision is defined as the percentage of predicted labels that are correct. (TP/(TP+FP))

Recall (or sensitivity) is defined as the percentage of correct items that are selected. (TP/(TP+FN))

Data Cleaning and feature extraction and reduction

The cleaning, feature extraction and tokenization was done in R using regular expressions and R packages- tm and Korpus. We performed the following steps of cleaning to the tweets-

  1. Remove duplicate tweets, tweets with no relation to EPL and other junk tweets. We have also removed twitter handles, re-tweets(RTS) and unicode for special characters (currency symbols, ellipsis etc.).
  2. Split words in camel case: We found several tweets (especially ones that contain hashtags) have conjoined words written in camel case Eg- BringItOn, YouOnlyLiveOnce etc. We split these into their individual components.
  3. Repeated letters: Some users happily discard all acceptable norms of grammar and spelling and prefer their own creative spellings. We replace repeated occurrences of letters to a maximum of two repetitions per character to bring down the number of features. For example- Helllooo- would be converted to Helloo, similarly, soooo happppy would be converted to soo happy.
  4. Emoticons: Tweets are rich in emoticons and users frequently use them to express their emotions. To gain insights from these emoticons, we prepared an emoticon dictionary that maps an emoticon (in unicode format) to its sentiment.
Emoticon dictionary for mapping unicode to text
Emoticon dictionary for mapping unicode to text


  1. Slang: We create a slang/abbreviation dictionary and map the terms to their corresponding full forms. It also contains some commonly misspelt words- for example- alryt, alrite etc are mapped to the word alright. An excerpt is shown below- 
Slang word Mapping
awsm awesome
b4 before
bcuz because
kinda kind of
lol laughing out loud
  1. Tweets with comparison: If a tweet contains qualifiers like ‘better’ or ‘worse’ for example-

Ozil is way better than Coutinho in terms of Big Chances created per game”

The word better appears after the player Ozil, so this tweet is positive for Ozil but negative for Coutinho. In all tweets about Ozil, we replace better by a term KPOS and in all tweets about Coutinho we replace better by an indicator term KNEG. We follow a similar approach if the tweets contain the qualifier ‘worse’.

The visualization below shows a decision tree of the most important features used in a 2-class classification problem (TRUE denotes negative sentiment and FALSE denotes positive sentiment).

DTREE without using the KPOS/KNEG feature
DTREE without using the KPOS/KNEG feature
DTree when KPOS/KNEG feature is used
DTree when KPOS/KNEG feature is used
  1. We perform some more filtering operations- like getting rid of punctuation, numbers, Porter’s stemming, whitespaces etc. and then construct our corpus for tokenization.
  2. Tokenisation: We use the tm package for creating our document term frequency matrix which we then tokenize using inbuilt R functions.
  3. Lexicon scoring: We use Bing Liu’s opinion lexicon which consists of 2006 positive words and 4783 negative words. After tokenization of our corpus, we match the number of occurrences of positive vs. negative words and create a feature that measures the normalized score of net-positive sentiment.

Experimental findings

We experimented with several classifiers- Naïve Bayes, Decision Trees, Random Forest, SVM, Logit Boost and MaxEnt. The performance of these classifiers for a 3-way classification sentiment classification problem (positive/neutral/negative) gave us the results shown in the chart below. Naïve Bayes performed the worst, Random Forest much better but took more computation time, Logit Boost performed slightly better than Random Forest but MaxEnt performed incredibly well (macro averaged F-score of 0.96). For our further analysis we therefore restricted ourselves to the MaxEnt classifier.

F-scores for 3-class classification of tweets
F-scores for 3-class classification of tweets

A note on the Maximum Entropy Classifier (MaxEnt)

Maximum Entropy Model assumes in the absence of additional information/constraints that the distribution of features is uniform (entropy is maximum). For example, consider a completely biased coin (probability of head=1) then the entropy associated with the outcome of tossing the coin would be zero as there is no uncertainty regarding its outcome. For an unbiased coin however there is equally probability for a coin toss appearing head or tail, thus there is maximum uncertainty or entropy in this case.

Entropy (H) vs probability of heads appearing
Entropy (H) vs probability of heads appearing

Adding more features or constraints lowers the maximum entropy and brings the data closer to the actual distribution (ie increases the likelihood of obtaining that data). In the context of sentiment analysis, the features here are the words as well as our engineered features like the lexicon score. Each feature assigns a certain probability on the possible class (pos/neg/neutral) of the tweet and the net score is given by a linear combination of all features.

More formally, the conditional probability of a certain class c, given the document d can be expressed as:


Here λ, represent the weights assigned to the different features (f) found by maximizing the log of the conditional probabilities.

The MaxEnt model does not assume conditional independence of features (unlike Naïve Bayes) and thus proves to be much better when dealing with correlated features.

For our implementation we used the maxent package in R which is specially designed to minimize memory consumption on very large datasets like the sparse document-term matrices created in text mining.

In order to understand which features were most important in improving the accuracy of our MaxEnt classifier, we used an elimination approach. The chart below shows the effect of removing a particular engineered featured for eg- lexicon score/emoticon mapping/slang mapping on the accuracy of the classifier. We find that tweaking the features goes a long way in improving the accuracy. Our accuracy for a classifier that uses all these features is significantly higher (97.35%) compared to a classifier that doesn’t use any of these features (96.9%). Removing lexicon score leads to the maximum drop in accuracy thus we can gauge that it is indeed an important feature for our analysis. We have only considered unigrams here as we did not obtain any significant change in the accuracy when bigrams/trigrams were considered. This could be perhaps due to the constraint on the length of the tweet for these features to be useful.

Comparison of accuracy for different features
Comparison of accuracy for different features

More details about the maximum entropy classifier can be found in Chris Manning’s course on Natural Language Processing.

Determining popularity of football players

Post classification, we create visualisations to get a sense of how tweets for a particular player vary over time using Tableau. The plots below show the overall sentiment (sum of all ratings) for a player over time. Jamie Vardy became a sensation online after his 11-game record-scoring run in the Premier League even trumping Messi in ratings. A comparison chart for the same is shown below.

Time series plot of sentiment ratings of Messi and Vardy
Time series plot of sentiment ratings of Messi and Vardy
Time series plot of ratings of Messi and Vardy
Time series plot of ratings of Messi and Vardy

In order to get a clearer picture of the popularity of players across geographies, we plotted heat maps depicting average sentiment of a player across different countries.  The chart below shows the sentiment heat-map for the player Harry Kane. All tweets that did not contain a geo-location were mapped to Madagascar.

Heatmap showing sentiment for Harry Kane. NAs mapped to Madagascar.
Heatmap showing sentiment for Harry Kane.Ratings vary from -1(negative) to positive (+1). NAs mapped to Madagascar.

We discuss the magic quadrant representation for footballers next. The magic quadrant puts % of tweets or total number of tweets on the x axis with the number of % positive tweets on the y axis to create four classes to slot players in from a marketing perspective.

Magic Quadrant for some Premier League players
Magic Quadrant for some Premier League players

This visualization is a really intuitive way to compare both intra domain as well as cross comparison of potential investments. Its old news now that Tata signed Lionel Messi as their brand icon in India. And if you look at the magic quadrant for the top footballers today this seems perfectly logical. The first quadrant features the stars of this era, personalities that are talked about highly both in terms of quantity of content that they generate but also the overall positivity associated with them. Usual suspects like Messi, Neymar and Ronaldo feature into this quadrant. The 2nd quadrant features critically acclaimed players. These players may be talked about lesser than the stars but show a high proportion of positive responses. The 3rd quadrant is a marketing teams’ nightmare. Unless they are terrific players, they wouldn’t bring in much revenue. The 4th quadrant represents the players that can be the villains of the world, so to say. They may be hated, but they definitely know how to be in the headlines.

Both the trend charts and the heat maps for each of these 4 types of players show obvious differences. Minute level line charts that show instantaneous shifts in ratings in situations like goals, send-offs and controversial statements allow PR and marketing teams to act almost immediately when something alarming is picked up. Another decision this could help drive is the decision to buy or sell a player. Decisions like these have a direct impact on revenue in terms of shirt sales, tickets sold, merchandise bought. For example: the ISL team, Athletico de Kolkata tries to buy aging foreign players as their icons as they earn more revenues off-the field.

Resources and further reading

The presentation link can be found here- Group 2_presentation. For a primer on Twitter sentiment analysis these resources are useful to get started with-

  1. B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval.
  2. A. Pak and P. Paroubek. Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
  3. A Agrawal et al. Sentiment Analysis of Twitter Data.
  4. Saif Mohammad’s talk at EMNLP-2014.

CDS Group-2

Ritwik Moghe|Pranita Khandelwal|Riju Bhattacharyya|Sushant Rajput