US ELECTIONS – SENTIMENT AND NETWORK ANALYSIS BASED ON TWITTER DATA FEED
Social Media has become a new norm of life for billions of people today. With an ever increasing amount of time spent online sharing information, opinions and emotions, sites like Facebook ,Twitter and Google+ have become hot data spots, waiting to be retrieved and analysed.
Elections are no exception to this present mantra. The US President Barack Obama used Facebook and Twitter during his campaigning. Even in the Indian scene, PM Modi made Facebook and Twitter his primary tools to spread his message and propaganda. The BJP and the Congress were claimed to even have hired a ‘Social Media Army’ (source: Firstpost) dedicated to this very purpose.
It comes as no surprise that the twitterati also responded overwhelmingly. Posts and discussions debating the pros and cons became commonplace. ‘Opinion Wars’ were rife. Everybody online had something to say and promote.
Several studies have been done before to tap the sweeping potential of social media data to analyse certain issues. Sentiment Analysis, as the genre is broadly called, refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials (source: Wikipedia).
In this project, our attempt has been to replicate one such study for the upcoming US Elections. We have also attempted to perform a modified network analysis for analysing the tentative affiliations of the masses with regards to the political parties.
Twitter allows collecting data till 10 days from the past. We streamed live Twitter feed for 20 days for a given set of keywords (justification in ‘Assumptions’ below).
The data collected concerns a subset of the actual voting masses, which for practical purposes should not be assumed to be a uniformly random sample. Possible reasons include non-inclusion of the older populations, frequent over-reaction, ephemeral stand on issues and non-sampling fluctuations.
Several users deny permission for location tracking. The network analysis has been done based only on the available ones. Our assumption here is that the locations are missing at random, which can be justified in the light of lack of conclusive study on a possible skew in location restriction pattern amongst users.
Data collected belongs to only certain keywords concerning just the candidates. The rationale here is to exclude possible biases resulting from over-reactions to non-influential topics. Considering the short span of data collection, our focus was on reaction for the major events, which we assumed shall have a much greater impact in the long run, than the short bursts of emotional bubbling.
Tweets have a typical lexical structure, and their analysis poses a special problem. We chose to ignore the ones that were not that frequent, assuming that the majority content of a tweet is contained in its hashtag and words expressing emotions.
With 341 days to 2016 US Presidential Elections, there have been very interesting talks going on about candidates such as the 15 year old teen Deez Nuts, the potential woman president Hillary Clinton, the billionaire republican Donald Trump and Bernie Sanders.
There are two major parties democratic(Right Wing and blue color) and republican(Left Wing and red color). From each wing, candidates announce their candidacy for presidential elections(which they may or may not withdraw during the course of pre-election year), typically tending to speak against the other party and outshine amongst their own party candidates. During the year preceding election year (Starting March 2015 to November 2016), various talks and GOPs are held among these parties to enable them to decide one candidate (from each party) as their nominee for president(known as pre-elections). This is the time when the candidates go about addressing issues related to economy, foreign policies and immigration. Everyday is a new opportunity for different candidates to highlight their candidacy and change the perceptions of the citizens in USA in their favour but who will finally win the elections? Which parties are currently leading the support of the public and which one is planning a strategy against the other?
Our project: Sentiment analysis using tweets from twitter is based on prediction of some of these questions and we do so by evaluating scores of the parties based on the tweet texted.
We captured all the tweets with hashtags of any of the candidates and analyzing the data by assigning a sentiment scores based on the tweet text. In the process, we perform clustering (on the basis of left and right wings) and further classification to find out the top two most likely candidates to win presidency next year, based on the current trends.
The data went through several stages, some independent, of analysis. Our attempt shall be to introduce the procedures as smoothly and briefly as possible. We have tabulated our experiences with different packages for reference.
Data Collection: Since we did not have any readily available data at our hands, data collection was also a big step. It was important to get tweets continuously over a long period of time along with as many details as possible.
We started with R twitteR package and collected tweets but soon faced problems in storing it in files. Due to small RAM sizes in laptops, it was hard to keep storing tweets and R would eventually crash. Moreover, error handling was very complex.
Whenever we encounter a tweet with the names of any US elections candidates or their hashtags mentioned, we capture it. For this, we used python tweepy package to stream data and stored it in json files every day. The data was later on converted to csv to make it more efficient to process in R.
We kept on doing this for a period of 30 days (entire november 2015), changing the filenames as the day changes, to get a volume of 40Gbs. That’s a fat tweet!
Tweets are filled with abbreviations and evolving lexical structures. This proved to be a major challenge. We added to the existing tm package in R to work around this issue.
Data cleaning was a multiple step process. The first level of cleaning was already done while converting to csv using python. We removed emojis,and other information related to following.
In R, we used a standard dictionary of positive and negative words since creating a customized dictionary in return for a slight dip in accuracy of word clouds. To counter this loss, we performed classification on the basis of affiliation to parties and then assigned scores to tweets(both positive and negative).
We made the following assumption: First, we would only use scores generated from the text instead of entire tweet. Second, since there are only two parties being monitored, all scores are converted to positive, that is, if we have a negative score for a party, we count it as a positive score for the opposing party.
What initially began was the classification using various methods to identify the affiliation of a random tweet. But with the increasing complexity and passing time, we had to make certain assumptions to simplify the task owing to the reduced computational abilities of R.
Our assumption of long term impacting tweets were helpful, and we chose to only filter those tweets which contained certain hashtags, including possible candidate names, and their agendas.
Thus, we ended up with the tweets auto-classified, with a very negligible error rate.
We can analyze sentiments on the basis of various available factors: number of users, their language, location or on the basis of the candidates themselves. We made tests with various packages – Sentiment and tm.
Due to the time constraints, we didn’t focus much on the nuances of the packages, and decided to go ahead with the tm sentiment analyser and tm general enquirer.
We then faced the task of which method to choose for analysis and a notion for distance to compare similarities in candidates. We started from naive bayes method to various complex algorithms, creating linear and nonlinear but consistent distances and finally formed generalized views using timeline plots, mean plots, day-wise trends, cumulative proportions and stacked box plots. We assimilated all our results in a dashboard created with shiny package in R.
Looking at the generalized plots and timelines and day-wise plots, We could make the following inferences.
- Hillary Clinton tends to use actions such as praising women to win their support. She also avoids traditional methods to set herself apart from other candidates.
- The right wing candidate Donald Trump is using disruptive methods to win support. He uses
- Currently Hillary Clinton has majority in the Democratic party (followed by Bernie Sanders then Martin O’Malley)
- There is no clear majority in Republicans. However, Donald Trump is currently leading among the pack
- Hence, Hillary Clinton from Democratic party and Donald Trump from Republican party seem to be the likely winners of the pre-elections
- When pitched against each other, Donald Trump surpasses Hillary Clinton as on 3rd December 2015.
However, this cannot be taken as the final verdict of the presidential elections since these observations are based on a month’s data while there are still 11 months to go for the final elections.
We also note some spikes in popularities of the parties. Combined with the overall graph, we notice the following:
- Popularity of the democratic party is going down by the day
- 5 Nov:Mark Everson withdraws his candidacy for the Republican. Thus Republicans took a popularity dip and Democratic party received a boost.
- 14 Nov:Second Democratic debate in which Hillary Clinton emerged as the winner of the polls, resulting in a huge spike for the democratic party.
- Paris attacks during the last week of November. The Republicans this event seriously and Donald Trump spoke to the public about destroying ISIS completely which was a turnaround with Republicans surpassing the Democratic party in terms of popularity
We used a rather non-traditional network analysis method. We contacted Victor from UCSB, who helped us understand the concept of Social Network Distance (SND). He helped us with the code as well, but we were finally unable to obtain a useful of notion in the given time frame. SND basically refers to a distance measure designed for the comparison of snapshots of a social network containing polar (competing) opinions. (source: UCSB)
We tried to increase and simplify our understanding of the candidates but sadly, there were no visible clusters among the candidates even though every candidate was from one of the two parties in reality. We then resorted to day-wise and overall generalized analysis of the candidates to form our inferences.
There were many lessons to be learned from this. Regardless of whether sampling is performed or not, it is very skewed and classification and clustering should always be looked at with the mindset that clusters may not exist due to various number of possibilities. While it is tempting to expect and force candidates into clusters when large data is available (as in our case) overfitting would always lead to inappropriate results.
We started very ambitiously with the project. The initial plans were followed by extreme experimentation with packages and different hypotheses were tested. What followed out of the commotion was the realization of various limitations in the domain of memory and algorithmic complexity.
The data was very big to begin with, and the initial processing posed many problems. R was a major roadblock in processing big files. Then came the packages – their applications were limited to the purposes they were specifically designed for – thus leaving very little control of the code implementation for large datasets.
The failure to obtain the network graph was one of the major setbacks to our initial plans, and in hindsight, one reason which could be attributed to this would be the lack of definitive and concrete knowledge regarding the notion of distances.
This study is done just for academic purposes in a limited resource and time frame. The inferences should be taken into business considerations at personal risk only. The contributors shall, however, be happy to share the details of the project on request.
Our team ardently supports the Open Source Community, and acknowledges their role in silently making this world a better place. We would like to thank the R-Community, CRAN, and StackOverflow for their immense contributions.
Dr. Sourav Sen Gupta (webpage), Indian Statistical Institute, Kolkata, took us on a roller-coaster ride across various techniques and aspects of data analysis. He has been inspirational throughout the semester and has helped us carve a knack for the field.
We would also like to acknowledge the contributions of the following:
- Google search
- Victor from UCSB (link above)
囗 Chandra Bhanu Jha 囗 Niten Singh 囗 Deepu Unnikrishnan 囗 Madhur Modi
IIM Calcutta, ISI Kolkata, IIT Kharagpur