Modelling of Customer Behavior


In an environment where the only constant is innovation, the high-tech and telecommunications sectors are grappling with rapid changes in customer behavior and the competitive landscape. As the industry shifts its focus from products to solutions and emphasizes mobility and cloud-based delivery models, high-tech and telecommunications companies are rethinking all aspects of their go-to-market strategy:

Ø  How do we compete in an environment where customers are demanding outcome-focused, managed services delivered on a subscription basis?

Ø  How can we develop deep customer insights when often we don’t have a direct relationship with the end user?

Ø  What do we do with all the data generated and how do we leverage it to make smart business decisions?

Ø  How can we anticipate or, better yet, drive changes in the market that will give us a competitive advantage?

It is necessary to understand the customer behavior and patterns to make better business decisions in this competitive world.


The challenge is to model customers of an telecom company and predict the propensity of them buying add-on. The telecom sectors are trying to leverage data that is generated and in turn are rethinking different business strategies in order to get maximum ROI. In today’s world there are many factors that are influencing the customers to buy a product or service and hence it is essential to reach out to the masses in an efficient manner.

Our objective is to make a predictive model to predict as to whether the customer will opt for the add-on or not by utilizing the data at hand. The performance of the model can be checked by the level of accuracy of the prediction. We will be discussing about the various methods and techniques that was required to build the model and so prerequisite of random forest and few packages is needed for understanding.



In this, we create a holistic approach to solve the problem by following various steps given below



This stage focuses on understanding the data by doing some preliminary statistics. Our data has 30000 observations with 190 explanatory variables. Therefore, we split the data into Training, Testing and Validation.

The uniqueness of the data is that there were no labels for the explanatory variables, which made it difficult to correlate the explanatory variables along with the response variables. Another important feature of the data is the missingness. There were only 38 variables having the actual data and the rest had almost 90 percent missing values.



In this step, we remove the features (explanatory variables) in which  more than 90% missing values and missingness is completely at random (MCAR). We split the data into two response groups of zero’s and one’s and see if their spread was different or same for an explanatory variable. This tells us about the influence of the concerned explanatory variables with respect to the response variable.

For the remaining 38 variable we performed multiple imputation using MICE package. The imputation method used by us is Predictive Mean Matching.The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data point.



In order to generate insights from the data, we need to build a prediction model based on our our data. As we have limited resources with medium computing power, we decided to select just the important features to save time and memory. So feature selection is  an important criteria for the modelling.

We first compute the correlation matrix and draw heat map for the 38 features, to select the features independent of each other. Furthermore, we perform PCA (Principal Component Analysis) on these variables to select the explanatory variables which can explain the variance of whole data set.

From the above operations, we get 10 features for building our prediction model using Random Forest .



We had already split our training  data set  into two parts randomly. Testing is an iterative process where we try to tune the different parameters of our prediction model. In random forest we have parameters such as:-

  1. mtry :- No. of variables to be used for each decision tree
  2. maxnodes :- Maximum no. of terminal nodes for classification
  3. minnodes :- Minimum no.  of terminal nodes
  4. ntree :- no. of decision tree we want

These parameters help us in controlling  the underfit and overfit of data set.

In the data set we had only 7 percent of response variables are 1’s and rest are 0’s. Moreover, our objective was to predict the number of 1’s rather than 0’s.Therefore we increase the probability of response lying in the vicinity of 1 to 0.85.

Mean F1 score is  used to evaluate the submissions. It considers both precision and recall.


Contingency Table :-

contigency table

Precision P is the ratio of true positives (TP) to all predicted positives (TP + FP)

R is Recall or Sensitivity which is equal to the ratio of true positives (TP) to all actual positives (TP + FN)

As we can see what  matters most is true positive false negative and false positive ,so we need to tilt the line of partition towards the 1 rather than zero. For this we need to understand what is the output of random forest ?  It gives the probability of response lying in the vicinity of one or zero. In order to get more true positive we provided more weightage to tuples with response variables as one.



In the ideatory ZS Customer Modeling Challenge, we finally acquired 27th position with the F1 score of 0.026 in the leaderboard with the highest being 0.268. Later on, we further improved our model and got a F1 score of 0.23.



As it was our first hand’s on experience to solve an analytic problem.We learned the various aspects of a Big data problem.What are the various steps for solving a Big data problem? What are the common challenges faced in solving any analytic problem? Furthermore, we explored many R packages which could be also useful in these data set such as CARET  for data modelling ,Miss forest for data imputation and Random Forest for Classification and regression based on a forest of trees using random inputs .



  6.  ZS Customer Modelling Presentation


Effects of Lifestyle on Aging

Effects of lifestyle on Aging


There have been many studies focused on understanding physical and cognitive changes that define aging. These studies spread their effort in identifying genetic, physical, behavioral, and environmental factors. These are primary factors that affect the aging process and understand the interrelationship between aging and various diseases.


Explore the data to uncover insights around impacts of lifestyle on aging. Specifically, we need to predict three types of diseases like ‘Arthritis’, ‘Angina’ and ‘Chronic Lung’ conditions. If you want to know the detailed problem statement follow the link

Data Explanation:

In background 31 unique features like Economic condition, Personal health, Residence, Smoking/Alcohol etc were given to analyze dependent variables. Also there are a total of 269 features in the data set. Out of them 3 are dependent variables (Angina, Lung Cancer, Arthritis) & remaining 265 are explanatory variables. Moreover, the total number of observations around 13K.

Each set of unique features consist of mixed data type for e.g.  ‘Physical activity’ consists of categorical as well as continuous data. Some of the unique features set also consist of ordinal data like Self care, Memory. Initially most of raw data was also filled with NA values.

Levels of the data:




Work flow in steps :

  1. Consulting a domain expert (doctor) to know the dependency among 3 Response variables.
  2. Predicting NA values.
  3. Dimension reduction of category variables (Decision trees).
  4. Dimension reduction of continuous variables.
  5. Segregating Train & validate data.
  6. Implementing the prediction algorithms & testing for Accuracy of various models.

Handling Missing Data (NA) values:

The first hurdle of our data analysis was to predict and fill the missing values in our data. For this we came up with strategy of using Miss Forest package in R and predicting the missing Values. Also we have tested the prediction of NA values by actually predicting some known values & check for correctness.

About Miss Forest principle :

We have used missForest package to find the missing data values.The package missForest is used to impute missing values. It uses a random forest trained on observed values of data matrix to predict missing values. It can be used to predict both continuous and categorical variables.

In missForest, primarily the N.A. values are replaced with mean values for continuous variables and mode values for the categorical values. After each iteration the difference between the previous and the new imputed data matrix is assessed for the continuous and categorical parts. The stopping criterion is defined such that the imputation process is stopped as soon as both differences have become larger once.


where Xtrue the complete data matrix, Ximp the imputed data matrix

Code Snippet:

library(“missForest”, lib.loc=”C:/Program Files/R/R-3.2.2/library”)
test_fill<-missForest(test2,ntree=50,mtry=8,maxiter = 10)

Ntree :  No of trees to grow in each forest.

Mtry : No of variables that are randomly selected at each split.

Maxiter : maximum number of iteration to be performed given the stopping criterion is not met.

View of Training data after predicting missing Values:


Take away from Domain expert ( A doctor) :

We first check the interdependency of response variables. Going by statistical way and domain knowledge, we consulted doctor to know about this.


Lung cancer is totally independent of other two.Arthritis can be linked with Angina with probability of 0.05 that too in rare cases.As a result, we can safely assume that all response variables are independent of each other.

Dimension Reduction of categorical variables through Decision Trees:

The data was really complicated. It consists of categorical, continuous and ordinal as well. Also, the data dictionary suggests that one parameters may consist of multiple features. For e.g. Parameter -Economic conditions consist of several features from V17-V45 (28 features!!!). To reduce data from 28 to few features and decision trees prove its worth. We applied decision trees to every set of unique features for dimension reduction.

After using decision tree we were able to reduce the dimensions.

  • For Angina : 64
  • For Arthritis : 58
  • For chronic_ lung : 41

Important Categorical features extracted for each disease:

 Because of such reduction the overall efficiency increase which result in decrease of computation time and unnecessary parameters are also thrown out. Following figure shows exact selection of features for a particular disease.


The table shows that:

For the diseases Chronic Lung, Arthritis and Angina the important features are {V9},{V2,V6,V11},{V2,V3,V5,V6,V11} respectively which corresponds to  unique feature named ‘ General Health’ . This is how Decision trees helped in ‘feature Engineering’ the most important task of analytics.

Decision Tree Code Snippet:

  • chest_Arth=ctree(Angina~V258+V259+V260+V261,data=filled_data)
  • plot(filled_data,type=”simple”)
  • plot(chest_Arth,type=”simple”)


Dimension Reduction of continuous variables Using Principal component analysis (PCA):

We have about 35 attributes which are continuous and we used PCA to reduce the dimension of the data. Using the train data set, we found the principal components and their respective contributions towards variances. As first PC contributes around 99.7 % is selected.

We found loadings of that particular PC and calculated scores for train set and test set.

Scree Plot:


PCA code snippet:

  • count_data_pca = princomp(cont_data, cor=”False”)
  • summary(count_data_pca)
  • screeplot(count_data_pca, type=”lines”)
  • cont_data_pc <- cont_data_pca$loadings[,1]
  • cont_mat <- as.matrix(cont_data)
  • cont_train_scores <- cont_mat %*% cont_data_pc
  • cont_test_scores <-  as.matrix(filled_test) %*% cont_data_pc


Prediction Methods:

We tried to compare the prediction of diseases based on the explanatory mainly using three methods. Those are:

  1. Random forests.
  2. Maximum entropy(maxtent)
  3. Naive bayes.

Let me first describe about principle of randomForest and its accuracy.

Method 1: Random Forest

Random forests (Breiman, 2001) is a substantial modification of bagging that builds a large collection of de-correlated trees, and then averages them. Bagging or bootstrap aggregation is a technique for reducing the variance of an estimated prediction function. Bagging seems to work especially well for high-variance, low-bias procedures, such as trees.

Some of the important parameters are

  1. The size of the forest, i.e., number of trees
  2. The maximum allowed depth for each tree
  3. The amount of randomness between trees

We have divided the training data into two types:

  • 70 % of Data kept as train data to train random forests.
  • 30 % of Data is used as validation data for checking accuracy
  • For the random forest on training data, we used below parameters:

mtry = 10

ntree = 300

Code snippet :

lung_fit<-          randomForest(Chronic_Lung~V9+V11+V17+V19+V34+V36+V38+ V46+V47+V13+V14+V51+V67+V68+V69+V263+V265+V70+V102+V106+V114+V124+V256+V88+V89+V90+V92+V99+V100+V261+V258+V155+V160+V239+V242+V246+V249+V50+V115+V116, data = train_lung, ntree = 300, mtry=10)


lung_pred <- predict(lung_fit, train_dat_validate)

prop.table(lung_pred, train_dat_validate$lung)

After training , Using this model testing the result on validation data gave the following prediction accuracy levels.Prediction Accuracy :

Disease Prediction Accuracy
Chronic_lung 91
Arthritis 93
Angina 90

We can optimize the parameters like ntree, mtry to get the maximum accuracy levels of prediction.

Method 2: Maximum entropy (Maxent)

Features are often added during model development to target errors. Then, for any given feature weights, we want to be able to calculate: • Data conditional likelihood • Derivative of the likelihood wrt each feature weight • Uses expectations of each feature according to the model n then find the optimum feature weights .

Maxent behaviour:


Accuracy : 89.5 %

Method 3: Naive Bayes

We have a bunch of random variables (data features) which we would like to use to predict another variable (the class)

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.

Navlung=naiveBayes(Chronic_Lung~V9+V11+V17+V19+V34+V36+V38+V46+V47+V13+V14+V51+V67+V68+V69+V263+V265+V70+V102+V106+V114+V124+V256+V88+V89+V90+V92+V99+V100+V261+V258+V155+V160+V239+V242+V246+V249+V50+V115+V116, data = train_lung)


Accuracy : 88 %




Akshay Naik (02)

Anirudh Kuruvada (04)

Ramakrishna Ronanki (34)

Rohit Musle (37)















Taxi Travel-time Prediction


Taxi dispatch system is a complicated and challenging task given the traffic dynamics of a modern urban locality. A properly managed dispatch system reduces traffic congestion as well as improving local economy. Thus the real-time estimation of travel time for taxis is of great importance for existing electronic dispatch systems.

Current electronic dispatch system captures the geo-locations of different taxis at different point of time, thus generating the availability of taxis at the location where the booking is processed. But this collective information of the geo-locations and the time-stamp can cook much more in fact. It is a valuable source of information in predicting the future trip times, the traffic pattern and the demand of transportation generated at different locations.

Scenario 1:

A user books a Cab near Science City, Kolkata and the system shows him the available cabs nearby, the ones that are free roaming. Out of them, the best option, which is 15 minutes away from the booking location is allotted. But the system ignores the fact that a booked cab at that very moment, passing through Science City is about to terminate its trip within 5 minutes. As a result, this cab is going to be a better option than any other free cab in the vicinity. It is possible only if the system knows the trip time of every ride. It also aids in increasing the fuel efficiency by calculating the optimum path of travel.

Scenario 2:

 After finishing a trip, the cabs roam freely around and wait for the next booking. But the demand is pretty high in Salt Lake during the evening. Because the offices breaks off in the evening, there is a gamut of demand generated at that particular point of time in Salt Lake. A new driver in town is quite unlikely to know all these dynamics of a city, thus might miss some business opportunities due to this. A real time traffic condition and demand should be updated to every cab to optimize the positions of free cabs so as to minimize congestion and facilitate traffic movement.

Keeping these shortcomings of the existing system in mind, our project involves developing a model for predicting the travel time for a particular cab along different routes in a rectangular map of Porto. The output is based on several factors derived from floating car data, like –

CALL TYPE:(online/on road)
ORIGIN CALL: The identity of the caller
ORIGIN STAND: It contains a unique identifier for the taxi stand.
TAXI ID: (Integer): The unique ID of the taxis that are enrolled in the system.
TIMESTAMP: Unix Timestamp (in seconds). It identifies the trip’s start time.
DAYTYPE: It identifies the type of the day in which the trip starts. (Holiday, Day before Holiday, Weekday).
MISSING DATA: It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing.
POLYLINE: The array of geo-locations of a particular trip

Data can be collected in two ways –

  1. Sensors placed across the street
  2. Floating Car Data

FCD are the position fixes of the vehicles traversing in the city throughout the day. FCD is pretty effective despite the infrequent data stream because of its accuracy and inexpensive infrastructure cost.


The architecture involves detecting the path of a vehicle during the trip, then matching the positions to corresponding grid indexes, then throw it to form several ensemble methods of classification or regression (according to need). While predicting we only need to set the exploratory variables and the model will give us the shortest path and the time required to cover that.

1Motion Detection
Since the incoming streams of traffic data come from taxi cabs, delivery vehicles, or other commercial fleet vehicles, there will always be a certain amount of ambiguity between a slowdown in traffic and a commercial stop by the vehicle. (i.e. for a taxi customer, or a delivery vehicle dropping off packages). Therefore, any further processing must clean out all unwanted stops that are in the GPS logs.

The vehicle while hired starts transmitting its position at 15 seconds interval, thus due to the slowdown in traffic a clutter of points around the same stop will recur. Thus a cleaned up data is required to describe the trip in terms of evenly spaced geo-points. Moreover, as we require a set of explanatory variables in predictive analytics, we need the trip to be represented in terms of predefined variables. Here, the variables are the edges between the predefined points, known as the Standard Points spread across the city, whose value represents the average velocity at that road segment in one approach and approximate time of travel along that edge.

 # Getting all the location points
all_location = []
for i in range(len(main_data)):
all_location += main_data.iloc[i].POLYLINE
all_location = [list(x) for x in set(tuple(x) for x in all_location)]

Grid Matching
In order to create a set of Standard Points along the city, we use our training data, a set of 1.7 million observations (trip) of different cars running throughout the day during the period of a month and the polyline variable of each. The polyline variable is an array of points with latitude, longitude and timestamp {x1, y1, t1}.The maximum and minimum longitude and latitudes are used to draw the rectangle of interest. Now this rectangle is divided using several grids placed 100 meters apart. The conversion of longitude and latitude to distance notion is done using the Haversine Distance.

2where, 3

d : the distance between the two points (along a great circle of the sphere),
r : the radius of the sphere,
∅_1, ∅_2: latitude of point 1 and latitude of point 2
𝜆_1,𝜆_2  : longitude of point 1 and longitude of point 2
After that, the locations that the car traversed through are imposed on this grid. (Green – Start, Yellow – End)


Further, the points that are inside the same grid are represented by the first point of traversal and this point becomes this grid’s standard point. The other points in the similar grid are omitted and the time in between edges are added to the total count. Thus, the new map becomes –


And at the output we have a set of points with grid index as their location. {x1, y1, t1} => {i1, t1}

# Producing the standard points using square tilling and choosing one of the locations from each tile
lat = [i[0] for i in all_location]
lon = [i[1] for i in all_location]

# we are taking a square of size 100 meter * 100 meter 
int1 = int(distance1([max(lat),0],[min(lat),0])/.1)+1 
int2 = int(distance1([0,max(lon)],[0,min(lon)])/.1)+1 
array1 = np.linspace(min(lat),max(lat),int1)  
array2 = np.linspace(min(lon),max(lon),int2) 

# defining the position of a tile in the map 

def position(i): num1,num2 = 0,0 
for j in range(len(array1)-1): 
if array1[j] <= i[0] and array1[j+1] >= i[0]: 
num1 = j 
for j in range(len(array2)-1): 
if array2[j] <= i[1] and array2[j+1] >= i[1]: 
num2 = j 
return len(array1)*num2 + num1 
standard_points_dict = {i:[0,0] for i in range(0,len(array1)*len(array2))} for i in range(len(main_data)):  
x = list(main_data['POLYLINE'].iloc[i])  
for j in range(len(x)):  
if standard_points_dict[position(x[j])] == [0,0]:  standard_points_dict[position(x[j])] = x[j]


Directed Graph

Directed Graph
Directed Graph (Sample Trip)

From the points imposed on the grid we get the locations that are traversed through. These becomes our edges for the directed graph. The edges can be calculated in two ways-

  1. If we are interested in dealing with continuous variables, Regression is the only way. For that, we take average velocity between two nodes as our edges.
    e1,2 = (Haversine distance) / (t2-t1)
  2. In another approach, we can use Clustering techniques, for the sake of which the time between two consecutive nodes, the discrete variable is taken as the edge.
    e1,2 = t2-t1

Thus at the end of this step, we have a sparse matrix.

Supervised Learning

Now we are left with a set of directed graphs, each for a new trip. The edge weight depends on the time of travel. This data is fed into the predictive model as per our strategy. We used Decision Tree and Random Forest for both categorical and continuous data. Lastly we applied Gradient Boosted Trees and observed the improvement in performance in terms of several measures.

Decision Tree

The response variable Y is categorical, so we can use information theory to measure how much we learn about it from knowing the value of another discrete variable A:


where, I[Y | A = a] = H[Y] − H[Y | A = a]. The definitions of entropy H[Y] and conditional entropy H[Y |A = a].


In case of continuous data, we apply the regression technique, where we look to minimize the least square. Thus the entropy function is the least square distance of all the points from the mean.


# For training of decision tree
weights2 = [] 
# we will store each decision tree an element of a large array
for i in groups_roads: 
# groups road is an array consists of all the historical data along each road segment
 x = [[j[2]] for j in i]
 y = [j[3] for j in i]
 clf = tree.DecisionTreeClassifier()
 clf =,y)
 weights2 += [[i[0][0],clf]]
 weights2 += [[(i[0][0][1],i[0][0][0]),clf]]

Random Forest

Here, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. The final output is the maximum voted result or the average of each tree. While this process reduces the variance due to this averaging but the bias increases because instead of picking the best split, we are taking a random sample of the whole feature-set and taking the best split among them.

clf = ensemble.RandomForestRegressor()

Gradient Tree Boosting

In the boosted method of regression, we try to minimise the residuals, by assuming it to be a function of the explanatory variables.


Thus the goal is to minimise the residuals h(x). This residual is expressed in terms of step-size  and loss function L.


Here we’ve used a least square loss function to create the optimum graph for that particular time.

The decision algorithm is run for the prediction of each observation. As a result of this, we get a real-time update of the traffic condition. The decisive algorithm gives us the graph of the city for a particular time. All that is left is to find the best route given the initial and destination node.

clf = ensemble.GradientBoostingRegressor()

Predictive Modelling

Dijkstra’s Algorithm

In this section we used Dijkstra’s Algorithm of shortest path calculation.


  1. Demand Distribution
    11The Demand distribution of taxi along the daytime is shown here. The highest peak is at the range of 9 to 10 A.M. (Office Hours). A steady distribution thereafter till 6 P.M. This can be utilized to create a fare/Km model during the course of the day and can also be used to optimize the resource utilization of the system.
  2. Trip Time & Trip Distance
    Individual trip time during daytime is shown here. The trip time is shown in seconds, thus need to be divided by 60. We can infer that almost all of the trips in the morning and at night are short trips. During the day time only a few trips are made for long hours.Most of the trip lengths are within 10 km. This is in conformity with the previous  plot of travelling time where most of the travelling time are within half an hour.
  3. Most Traversed Area
    The heat map of the most traversed nodes are shown above. It shows that the red marked zones are the most central zone in the city. The centrality test of the map conforms with the result shown.
  4. Centrality
    Centrality gives us the nodes and edges that are traversed most. According to the sampled data, the Rua da Constituição is the busiest street in Porto and the centrality measures of both nodes and edges comes to the similar conclusion. 16 15



Four kinds of measures are used –


  1. Mean Absolute Percentage Error: The MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error.
  2. Mean Absolute Error: The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.
  3. Mean Percentage Error: MPE is the computed average of percentage errors by which forecasts of a model differ from actual values of the quantity being forecast.
  4. Hypothesis Testing: The PE>0 is tested against the null that PE = 0 and a the normalised version of it is assumed to follow the t-distribution.

The results of the different models are shown as follows –


A summary of the different model is depicted in the following picture, where we draw a visual comparison of a typical trip.



  1. The Dijkstra’s Algorithm is a appropriate for static map. While the traffic in a city is altering constantly, we should try out a layered approach and the shortest path should incorporate the real time of arrival at a particular node.
  2. Apply an incremental ARIMA model to each segment for a specific time of the day (Data needed)
  3. Use Map Matching Technique to position the standard points on the road accurately.
  4. Real Time Traffic Map (travelling route management, traffic signal optimization etc.)
  5. Optimizing of our approach in order to reduce model error.
  6. Implement it locally.
  7. Consolidate all these Features in an Android Application




Studying Twitter Sentiment of Football Superstars


The growth in micro-blogging activity on sites over the last few years has been phenomenal.  Platforms like Twitter offer an easy outlet for people to express their opinions and companies are increasingly getting interested in capturing these insights about customer behaviour and preferences that could help generate more revenues. The staggering amount of data that these sites generate cannot be manually analysed. Enter thus, Sentiment Analysis, the field where we teach machines to understand human sentiment.

Traditionally sentiment analysis under the umbrella term- ‘text mining’ focuses on larger pieces of text like movie reviews or news articles. On Twitter however, people post 140-character long informal messages called tweets. Analysing sentiment from these tiny pieces of text is challenging due to their unstructured nature- internet slang, abbreviations, non-conventional spelling and grammar, hashtags, urls and emoticons are just some of the complexities that need to be addressed.

Understanding the nuances of any language is another major problem. There is still a lot of research going on to train machines to be able to deal with complex grammatical structures, ambiguity, irony, sarcasm etc. Identifying the target for which we are analysing sentiment is also important. For instance a tweet comparing two players using a qualifier like ‘better’ or ‘worse’ would be labelled positive or negative depending on the target.

Our objectives

  1. Performing sentiment analysis on Twitter data.
  2. Extracting sentiment and gauging popularity of different players of the English Premier League from their Twitter footprint.

The basic question we are asking in this project is whether a given piece of tweet about an football player is positive, negative or neutral. The aggregate sentiment of a player will then be used to gauge the overall popularity of a player online. This information could be useful for companies looking for potential endorsees.

Overview of methodology

We collected tweets using the Twitter API in Python. The tweet data consisted of text and metadata like timezone, twitter handle of the poster, re-tweet count etc. We perform cleaning operations, feature extraction on noisy data (discussed later) and manually label about 4669 unique tweets.

Using this training data, we train several classifiers (Random Forest, Maximum Entropy, SVM, Naïve Bayes etc). We select the best classifier by testing on a dataset which consisted of about 1399 tweets of which 234 were negative, 380 were neutral and 785 were positive.

The next part of the problem is more exploratory in nature. Based on the inferences of our previous analysis we try to determine which players evoke the maximum sentiment (positive/negative) online, how the player performance affects their popularity ratings and how do two players compare against each other.

Performance metric

We rarely observe balanced frequency distribution of data across the different classes. Accuracy which is the measure of the ratio of current classification by total responses thus becomes a skewed measure of performance. In order to use accuracy as a performance metric, we should interpret the results relative to the baseline case.

An alternative form of performance measurement could be to use the F-score. F-score is the harmonic mean of the precision and recall figures.

Predicted\Actual-> True False
True True Positive (TP) False Positive (FP)
False False Negative (FN) True Negative (TN)

Precision is defined as the percentage of predicted labels that are correct. (TP/(TP+FP))

Recall (or sensitivity) is defined as the percentage of correct items that are selected. (TP/(TP+FN))

Data Cleaning and feature extraction and reduction

The cleaning, feature extraction and tokenization was done in R using regular expressions and R packages- tm and Korpus. We performed the following steps of cleaning to the tweets-

  1. Remove duplicate tweets, tweets with no relation to EPL and other junk tweets. We have also removed twitter handles, re-tweets(RTS) and unicode for special characters (currency symbols, ellipsis etc.).
  2. Split words in camel case: We found several tweets (especially ones that contain hashtags) have conjoined words written in camel case Eg- BringItOn, YouOnlyLiveOnce etc. We split these into their individual components.
  3. Repeated letters: Some users happily discard all acceptable norms of grammar and spelling and prefer their own creative spellings. We replace repeated occurrences of letters to a maximum of two repetitions per character to bring down the number of features. For example- Helllooo- would be converted to Helloo, similarly, soooo happppy would be converted to soo happy.
  4. Emoticons: Tweets are rich in emoticons and users frequently use them to express their emotions. To gain insights from these emoticons, we prepared an emoticon dictionary that maps an emoticon (in unicode format) to its sentiment.
Emoticon dictionary for mapping unicode to text
Emoticon dictionary for mapping unicode to text


  1. Slang: We create a slang/abbreviation dictionary and map the terms to their corresponding full forms. It also contains some commonly misspelt words- for example- alryt, alrite etc are mapped to the word alright. An excerpt is shown below- 
Slang word Mapping
awsm awesome
b4 before
bcuz because
kinda kind of
lol laughing out loud
  1. Tweets with comparison: If a tweet contains qualifiers like ‘better’ or ‘worse’ for example-

Ozil is way better than Coutinho in terms of Big Chances created per game”

The word better appears after the player Ozil, so this tweet is positive for Ozil but negative for Coutinho. In all tweets about Ozil, we replace better by a term KPOS and in all tweets about Coutinho we replace better by an indicator term KNEG. We follow a similar approach if the tweets contain the qualifier ‘worse’.

The visualization below shows a decision tree of the most important features used in a 2-class classification problem (TRUE denotes negative sentiment and FALSE denotes positive sentiment).

DTREE without using the KPOS/KNEG feature
DTREE without using the KPOS/KNEG feature
DTree when KPOS/KNEG feature is used
DTree when KPOS/KNEG feature is used
  1. We perform some more filtering operations- like getting rid of punctuation, numbers, Porter’s stemming, whitespaces etc. and then construct our corpus for tokenization.
  2. Tokenisation: We use the tm package for creating our document term frequency matrix which we then tokenize using inbuilt R functions.
  3. Lexicon scoring: We use Bing Liu’s opinion lexicon which consists of 2006 positive words and 4783 negative words. After tokenization of our corpus, we match the number of occurrences of positive vs. negative words and create a feature that measures the normalized score of net-positive sentiment.

Experimental findings

We experimented with several classifiers- Naïve Bayes, Decision Trees, Random Forest, SVM, Logit Boost and MaxEnt. The performance of these classifiers for a 3-way classification sentiment classification problem (positive/neutral/negative) gave us the results shown in the chart below. Naïve Bayes performed the worst, Random Forest much better but took more computation time, Logit Boost performed slightly better than Random Forest but MaxEnt performed incredibly well (macro averaged F-score of 0.96). For our further analysis we therefore restricted ourselves to the MaxEnt classifier.

F-scores for 3-class classification of tweets
F-scores for 3-class classification of tweets

A note on the Maximum Entropy Classifier (MaxEnt)

Maximum Entropy Model assumes in the absence of additional information/constraints that the distribution of features is uniform (entropy is maximum). For example, consider a completely biased coin (probability of head=1) then the entropy associated with the outcome of tossing the coin would be zero as there is no uncertainty regarding its outcome. For an unbiased coin however there is equally probability for a coin toss appearing head or tail, thus there is maximum uncertainty or entropy in this case.

Entropy (H) vs probability of heads appearing
Entropy (H) vs probability of heads appearing

Adding more features or constraints lowers the maximum entropy and brings the data closer to the actual distribution (ie increases the likelihood of obtaining that data). In the context of sentiment analysis, the features here are the words as well as our engineered features like the lexicon score. Each feature assigns a certain probability on the possible class (pos/neg/neutral) of the tweet and the net score is given by a linear combination of all features.

More formally, the conditional probability of a certain class c, given the document d can be expressed as:


Here λ, represent the weights assigned to the different features (f) found by maximizing the log of the conditional probabilities.

The MaxEnt model does not assume conditional independence of features (unlike Naïve Bayes) and thus proves to be much better when dealing with correlated features.

For our implementation we used the maxent package in R which is specially designed to minimize memory consumption on very large datasets like the sparse document-term matrices created in text mining.

In order to understand which features were most important in improving the accuracy of our MaxEnt classifier, we used an elimination approach. The chart below shows the effect of removing a particular engineered featured for eg- lexicon score/emoticon mapping/slang mapping on the accuracy of the classifier. We find that tweaking the features goes a long way in improving the accuracy. Our accuracy for a classifier that uses all these features is significantly higher (97.35%) compared to a classifier that doesn’t use any of these features (96.9%). Removing lexicon score leads to the maximum drop in accuracy thus we can gauge that it is indeed an important feature for our analysis. We have only considered unigrams here as we did not obtain any significant change in the accuracy when bigrams/trigrams were considered. This could be perhaps due to the constraint on the length of the tweet for these features to be useful.

Comparison of accuracy for different features
Comparison of accuracy for different features

More details about the maximum entropy classifier can be found in Chris Manning’s course on Natural Language Processing.

Determining popularity of football players

Post classification, we create visualisations to get a sense of how tweets for a particular player vary over time using Tableau. The plots below show the overall sentiment (sum of all ratings) for a player over time. Jamie Vardy became a sensation online after his 11-game record-scoring run in the Premier League even trumping Messi in ratings. A comparison chart for the same is shown below.

Time series plot of sentiment ratings of Messi and Vardy
Time series plot of sentiment ratings of Messi and Vardy
Time series plot of ratings of Messi and Vardy
Time series plot of ratings of Messi and Vardy

In order to get a clearer picture of the popularity of players across geographies, we plotted heat maps depicting average sentiment of a player across different countries.  The chart below shows the sentiment heat-map for the player Harry Kane. All tweets that did not contain a geo-location were mapped to Madagascar.

Heatmap showing sentiment for Harry Kane. NAs mapped to Madagascar.
Heatmap showing sentiment for Harry Kane.Ratings vary from -1(negative) to positive (+1). NAs mapped to Madagascar.

We discuss the magic quadrant representation for footballers next. The magic quadrant puts % of tweets or total number of tweets on the x axis with the number of % positive tweets on the y axis to create four classes to slot players in from a marketing perspective.

Magic Quadrant for some Premier League players
Magic Quadrant for some Premier League players

This visualization is a really intuitive way to compare both intra domain as well as cross comparison of potential investments. Its old news now that Tata signed Lionel Messi as their brand icon in India. And if you look at the magic quadrant for the top footballers today this seems perfectly logical. The first quadrant features the stars of this era, personalities that are talked about highly both in terms of quantity of content that they generate but also the overall positivity associated with them. Usual suspects like Messi, Neymar and Ronaldo feature into this quadrant. The 2nd quadrant features critically acclaimed players. These players may be talked about lesser than the stars but show a high proportion of positive responses. The 3rd quadrant is a marketing teams’ nightmare. Unless they are terrific players, they wouldn’t bring in much revenue. The 4th quadrant represents the players that can be the villains of the world, so to say. They may be hated, but they definitely know how to be in the headlines.

Both the trend charts and the heat maps for each of these 4 types of players show obvious differences. Minute level line charts that show instantaneous shifts in ratings in situations like goals, send-offs and controversial statements allow PR and marketing teams to act almost immediately when something alarming is picked up. Another decision this could help drive is the decision to buy or sell a player. Decisions like these have a direct impact on revenue in terms of shirt sales, tickets sold, merchandise bought. For example: the ISL team, Athletico de Kolkata tries to buy aging foreign players as their icons as they earn more revenues off-the field.

Resources and further reading

The presentation link can be found here- Group 2_presentation. For a primer on Twitter sentiment analysis these resources are useful to get started with-

  1. B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval.
  2. A. Pak and P. Paroubek. Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
  3. A Agrawal et al. Sentiment Analysis of Twitter Data.
  4. Saif Mohammad’s talk at EMNLP-2014.

CDS Group-2

Ritwik Moghe|Pranita Khandelwal|Riju Bhattacharyya|Sushant Rajput

Stock Value Prediction


Introduction and Motivation

The stock price movement has always been a topic of consideration by many people. If one gets a hack to know the future prices of the stocks, he/she could become a billionaire just by sitting at home, having a cup of coffee and using the mouse clicks to invest money in some stocks.

So here’s an attempt to predict the future prices of stocks through the power of mathematical modeling, coding and combination of one and both.



  1. Collect the stocks’ past data.
  2. Using the data implement some models.
  3. Testing of the models on validation set.
  4. Use the best combination to predict the future price of the stock.
  5. Also capturing the market sentiment of the stock.
  6. Finally reaching a conclusion.


DATA Collection

The data for a particular stock could be collected from the yahoo finance website ( using the following easy steps:

  1. Go to the “Historical Prices” tab on the left hand side of the web page.
  2. Fill in the “Get Historical Prices for:” tab with the name of the stock of which you want the prediction to be carried out (For eg. ACC for Accenture).
  3. Set the starting and the ending date of the prices.
  4. Press “Download to Spreadsheet”.

And tada in less than a minute you’ll have a csv file of the data with you!



The data of ACC stock

We will just be bothered about the closing price of the stock and build our models on the basis of it. Then we will divide the dataset into train, validation and test dataset and implement the models.



Polynomial Model

  • A polynomial function is one that has the form:


where n is a non-negative integer that defines the degree of the polynomial. A polynomial with a degree of 0 is simply a constant function; with a degree of 1 is a line; with a degree of 2 is a quadratic; with a degree of 3 is a cubic, and so on.

  • Fits the nth degree polynomial on data points. Here n is chosen to be 5


Holt-Winters Model

  • Breaks data into level(a), trend(b) and seasonality(s)

α=Smoothing Parameter | β=Exponential Smoothing | γ=Seasonal component

  1. Y[t+h] = a[t] + h * b[t] + s[t – p + 1 + (h – 1) mod p]
  2. a[t] = α (Y[t] – s[t-p]) + (1-α) (a[t-1] + b[t-1])
  3. b[t] = β (a[t] – a[t-1]) + (1-β) b[t-1]
  4. s[t] = γ (Y[t] – a[t]) + (1-γ) s[t-p]


Feed Forward Model

  • A feed-forward neural network is an artificial neural network where connections between the units do not form a cycle. This is different from recurrent neural networks.
  • The feed-forward neural network was the first and simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.



  • An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model which is fitted to time series data either to better understand the data or to predict future points in the series (forecasting).
  • Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q) where parameters p, d, and q are non-negative integers, p is the order of the Autoregressive model, d is the degree of differencing, and q is the order of the Moving-average model. Seasonal ARIMA models are usually denoted ARIMA(p, d, q)(P, D, Q)m, where m refers to the number of periods in each season, and the uppercase P, D, Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.


SES Model

ŷT+1|T =αYT +α(1-α)yT-1 +α(1-α)2yT-2 …..

ŷT+1|T = αYT + (1-α) ŷT|T-1

α= Smoothing Parameter

Weights decrease exponentially with the increasing distance from the present time.


Market-Stock Correlation

  • The scatter plot between market and stock value is plotted. Linear regression line is fitted between the data points of market and stock price.


Stock Autocorrelation

  • The autocorrelation of stock’s price with itself is seen and tn-1 is seen to have the highest correlation.
  • Hence we could predict stock’s price using tn-1.


Correlation Table

  • Correlation Table provides a linear correlation between the predicted value and real stock values.
  • Higher the correlation, better the estimated values.



Integrating the Models:

Weighted Average of Predicted Models

  • Applying different set of weights to the selected models. Calculation of Chi-square value and selecting appropriate weight for the prediction.


Isotonic regression

  • The isotonic regression finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data.



Chi-square Values

  • Chi-squared = (observed-expected)2/(expected)
  • Lower the value of Chi-square higher the chances of success.




News Sentiment Analysis:


Many key factors that influence stock price of a company or a sector of the economy are also affected by the incoming news articles and feeds. Incoming news can be of various types – such as latest earnings statements, announcement of dividends by a company, information about new products, and trend analysis and prediction by financial experts. Figure 1 shows some of the factors that affect the stock price. Clearly, this is an area in which text and data mining tools and techniques can be employed to provide summary information by extracting important keywords and action phrases from the incoming news stories. More importantly, there is need to find ways to find emotion and sentiment from this corpus of text. With the advent of online news sites such as Google Finance, Yahoo Finance and MSN Money, financial news is delivered in real-time, streaming format. There is a critical need for tools that can provide an accurate estimate of sentiment from streaming news items.


Figure1: A snippet from a news article carrying different sentiments for two companies


News Aggregation

First step is to create on-line news gathering engine using package tm and its various plugins such as tm.plugin.webmining.


Figure2: R code snippet


Analyzing and Filtering News Items

Our method starts off by scanning the news items for the stock symbols in question. As an example, Figure 1 shows a snippet of a news article from Bloomberg. It talks about Apple (AAPL) and Coinstar (CSTR) in the same article, yet the article conveys different sentiment for each. The proposed method filters the article to only relevant sentences. It would thus take only the first paragraph into consideration for Apple Corporation (AAPL). Some of the key words and phrases that would be tagged for polarity would be “sank 2.8 percent”, “fell”, “losing streak” (negative), and “valuable” (positive).

For breaking a story down to sentences as shown in figure3, we have various Natural Language Processing (NLP) tools available therein, such as the package NLP. This allows us to look at individual sentences and keep only those that contain the stock symbol.


Figure 3: R code snippet


Identifying Sentiment Words

We define an instance to be a unit of the granularity levels – such as sentences, headlines, paragraphs at which the analysis is to be performed. After extracting the relevant instances, we identify the key words contained in them and match them against available sources of positive or negative sentiment terms. We have used Harvard General Inquirer.


Scoring Text Corpus

News articles are kept in memory in the form of a document-term matrix with the instances as rows and the terms as columns. We create scores for the instances.

Defintion1:  An instance is classified as positive if the count of positive words is greater than or equal to the count of negative words. Similarly, an instance is negative if the count of negative words is greater than the count of positive words.

Definition 2: Score of a corpus is defined as the ratio of positive instances to the total number of instances.

When we apply this algorithm to news streams gathered on November 14th ’15 for Accenture we get:

  • 23 negative terms occurring in the sentences
  • 107 positive terms occurring in the sentences
  • 89 sentences with sign +1
  • 10 sentences with sign -1
  • A naive sentiment score of 89 / (89 + 10) = 89 / 99 ≈ 89.89%


Figure 4: R code snippet



In this section, we will present the results of our method. The clouds to the left show the positive and negative words in the news streams respectively. They lead to a naive sentiment score of 89.89% representing a positive net sentiment for ACCENTURE.


Figure 5: Positive and Negative clouds



The choice of models is done depending on the training dataset. For Accenture stock that we have selected, Weighted average model turns up to be the best choice. It may vary for different stocks.

The sentiments of the market must also be checked while investing as strong sentiments also tend to have an impact on the stock selection.


For further queries you might contact:

Ankit Sonthalia (15BM6JP06) +91 86000 96336

Loveleen Kaur (15BM6JP21) +91 99153 66173

Siddhant Sanjeev (15BM6JP45) +91 98109 32592


And a last word of advice:


PPT Link

Walmart Triptype Classification


Walmart Headquarter, Bentonville, United States

Will: Hey, David! Good Morning, How was your weekend?

David: Weekend was good boss! How was yours?

Will: Oh! It was horrible. You know what, we had an emergency meeting yesterday. All the guys are worried. Results of this quarter are out and it is not at all looking good. Top management is now asking tough questions. We have hell lot of work to do, so pull up your socks David, sleepless nights are awaiting.

David: That’s bad news boss! But what went wrong? Our estimates had shown considerable profit this quarter. The footfall has not decreased as per our record. There is nothing wrong with economy too. This seems quite puzzling to me.

Will: It seems to me that we are not able to understand our customers well. Customers are spending too much time looking out for product they want and that’s probably affecting the transaction per visit.

David: Oh! I see. What should we do then boss?

Will: We need to understand our customer: what’s their primary purpose of visit to the store. We need to segment our shelves accordingly, resulting in less confusion for buyers. This would reduce the distance they have to travel to get the product and definitely help to cut down on chaos.

David: Hmm. I wonder how come our guys didn’t see this problem coming.

Will: You need to dig deeper into the data, David. Call our overseas analytics team, give them the data we are sitting on for years and hope to get rescued.

David: Okay Boss! I am on it.


1 Month later

 Dear David,

We hope you’re doing well. We have gone through the data you sent us across and our team has worked on it. We request you to allow us to come over to your headquarters and brief you about the findings.

Yours sincerely,



Dear Group-12,

We’d be happy to have you here. We have scheduled your presentation this weekend. All the top management people will be there along with Will. Best of luck to you guys!

Yours faithfully,



Walmart Headquarters, Debriefing session:


  • Treatment of data: missing values and outliers
  • Exploratory analysis of data: feature importance, correlation between features.
  • Feature engineering a)Department Description b) Finelinenumber c) Combination of both.
  • Application of supervised learning algorithm
  • a) XGBoost b) Random Forest c) Gradient Boosting Machine
  • Obtaining triptype classification for the test data.

Treatment of data:

  1. Missing values:
    • Out of a total of 647054 rows, 4129 rows have NULL values (less than 1%)
    • Assuming data is missing at random, ignore rows with NULL values.
  2. ‘Weekday’ field converted to binary (whether the day of visit is a weekend or not)
    • If the day is Friday, Saturday or Sunday – it is considered as weekend (i.e. value of the field is 1). Else it is weekday.
    • Negative values in ‘Scancount’ Indicates a return of the item.
    • Return of an item does not affect buying pattern
    • ‘Scancount’ is made 0 for negative values.
  3. Reshaping Data to obtain Item purchased from particular Department per visit as a feature observation where features are described by Department description.


# #removing all NA's value<-na.omit( length(which($FinelineNumber) == TRUE, arr.ind=TRUE)) length(which($DepartmentDescription=="NULL", arr.ind=TRUE))

#Making weekday as binary variable<-transform(, Weekday<-factor(Weekday)) levels($Weekday)<-c(1,0,1,1,0,0,0) train.walmart_df<[,c(1,2,3,5,6)]<-transform(, Weekday<-factor(Weekday)) levels($Weekday)<-c(1,0,1,1,0,0,0) test.walmart_df<[,c(1,2,4,5)]

#Reshaping the data
train.walmart_df<-dcast(train.walmart_df,dcast_formula, value.var="ScanCount")
 test.walmart_df<-dcast(test.walmart_df,dcast_test, value.var="ScanCount")]

Feature Correlation Graph:

Feature correlation

  • Compute correlation matrix of reshaped training data
  • Compute adjacency matrix from the correlation graphs as follows:-
  • If absolute value of correlation is less than a threshold (0.05 in this case), assume there is no correlation between the purchase of items and value in adjacency matrix is 0 i.e. there is no edge between these 2 products in the correlation graph.
  • Otherwise value in adjacency matrix is 1 i.e. there is an edge between the products in the graph.
  • All diagonal elements in the adjacency matrix are made 0 to avoid self loops.
  • The relationships between different departments are pretty consistent with intuition(e.g. customers who  Men’s wear are likely to buy socks and Undergarments & Shaving-kit).


train$TripType <- NULL
 train$VisitNumber <- NULL
 train$Weekday <- NULL
#corrplot: the library to compute correlation matrix.
#compute the correlation matrix
 corMat <- cor(train)
 adjMat <- corMat
#Construct adjacency matrix
 for (i in 1 : nrow(corMat)){
 for (j in 1: ncol(corMat)){
 if (abs(corMat[i,j]) < 0.05 || i == j ){
 adjMat[i,j] <- 0
 adjMat[i,j] <- 1


Feature Importance graph:



nameFirstCol <- names(train)[1]
 y <- train[, nameFirstCol]
train$TripType <- NULL
 train$VisitNumber <- NULL
trainMatrix <- as.matrix(sapply(train, as.numeric))
numberOfClasses <- max(y) + 1
 param <- list("objective" = "multi:softprob",
 "eval_metric" = "mlogloss",
 "num_class" = numberOfClasses)
nround <- 20
bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround)
# Get the feature real names
 names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
 importance_matrix <- xgb.importance(names, model = bst)
# Feature importance graph

Classification Algorithm used:


  • XGBoost is short for extreme Gradient Boosting. It is an open-sourced tool – Computation in C++, R interface provided
  • A variant of the gradient boosting machine – Tree based model
  • The winning model for several Kaggle competitions.


numberOfClasses <- max(triptype)+1
param <- list("objective" = "multi:softprob","eval_metric" = "mlogloss","num_class" = numberOfClasses)
cv.nround <- 200
 cv.nfold <- 10 =, data = train.walmart_mat, label = triptype, nfold = cv.nfold, nrounds = cv.nround)
nround <- which($test.mlogloss.mean==min($test.mlogloss.mean))
 #train the model
 nround <-114 #this number is the number of trees when test mlogloss is minimum during cross-validation
 bst<-xgboost(data = train.walmart_mat, label = triptype, param=param, nrounds = nround)
#predict the model
ypred <- predict(bst, test.walmart_mat)

Cross-validation and model building:

  • Once the data has been reshaped into the required format, we can choose cross validation to find to choose the parameters.
  • numberOfClasses: is equal to 38, since there are 38 classes in total
  • param: parameters of the model with “objective” indicating the task, “eval_metric” indicating the error measurement of the model
  • cv.nround: number of the trees to build. This is the parameter we want to tune
  • cv.nfold: how many parts you want to divide the train data into for the cross-validation
  • run the cross-validation

Performance evaluation Metric:

  •  Logloss function
  •  N is  the number of visit in the test set.
  •  M  is the number of trip types.
  •  is 1 if observation ‘i’ belongs to class ‘j’ and 0 otherwise.
  •  is the predicted probability.

Bagging : Random Forest

  • Ensemble of decision trees.
  • Unlike single decision trees, Random Forests use averaging to find a natural balance between the two extremes.
  • Random forest uses bootstrapping and averaging.
  • Out of bag error estimate by using department description as features is 44.5%
  • This implies department Description alone is not a good classifier.
fc <- trainControl(method = "repeatedCV",
 number = 2,
 repeats = 1,
 tGrid <- expand.grid(mtry = RF_MTRY)
 model <- train(x = train, y = target, method = "rf", trControl = fc, tuneGrid = tGrid, metric = "Accuracy", ntree = RF_TREES)
 #Predict second training set, and test set using the randomForest
 train2Preds <- predict(model, train2, type="prob")
 testPreds <- predict(model, test, type="prob")

Boosting: Gradient Boosting Machine

  • Fit complex models by iteratively fitting sub-models (decision tree) to residuals.
  • Gradient boosting uses a “pseudo gradient”
  • Pseudo-gradient used is the derivative of a general loss function L().
  • In this case: logloss-function.
  • It shows the deviation of predicted probability of class from original training example.
  • A sub-learner is picked as close as possible to the pseudo gradient and added to model.


Minimizing the logloss function

Challenges and Bottlenecks:

  • Memory issues: With limited RAM, hand  ling big numeric matrix was not feasible.
  • dcast() function is not useful in reshaping features ~5K
  • Different number features in test data and train data when features are made using   FineLinenumber and departmentDescription.
  • Department description is not enough for classification.
  • No improvement even after trying different classification algorithms



Thank you!

*All the characters and events depicted in this post is entirely fictitious.








Topic Transition Modelling in Online Conversations

The Beginning – In the beginning there was an Idea

The idea of topic modelling developed en-route a bus trip. The motivation was simple –  to model the randomness in topic transition and see if it was generalizable – in short, are our ‘random talks’ really that random; and if they are, can they be modeled using available methods.

What, Then is a conversation ?

No generally accepted definition of conversation exists, beyond the fact that a conversation involves at least two people talking together. Consequently, the term is often defined by what it is not. A ritualized exchange such a mutual greeting is not a conversation, and an interaction that includes a marked status differential (such as a boss giving orders) is also not a conversation. An interaction with a tightly focused topic or purpose is also generally not considered a conversation. Summarizing these properties, one authority writes that “Conversation is the kind of speech that happens informally, symmetrically, and for the purposes of establishing and maintaining social ties.”[

Initial Idea – TwatterKeeper Archives

Twatter Keeper was a very interesting tool which mimics the way a lot of people use twitter searches. Twatter Keeper lets you specify a hash tag and then searches Twitter for any occurrence of the hash tag and saves those tweets for you. A very useful tool if you want to read all tweets about, say, a conference. Just create a new Archive with the relevant hash tag and wait for Twatter Keeper to bring you all the tweets with that tag. Sadly, it was shut down by Twitter around 2011. Had Twatter Keeper been available, a lot of time to mine twitter data and clean it could have been saved. There are many interesting generalizations developed by the R community around the Twatter Keeper data using Gephi/iGraph packages. ( A very good example can be found here : )


Our analysis focused on a temporal study of how conversations flowed over time and how  one thing led to another. Formally stated,

1. To identify the latent topics that underlie a conversation and

2. study their distribution across the entire timestamp to arrive at stationary probabilities

The probability distributions obtained could then be modeled to understand the transitions of the conversation from one topic to another. To achieve the objective, certain assumptions were made.

Model Assumptions

Topic transition is a homogeneous Markov chain process. A Markov Chain is a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as “memorylessness”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of “memorylessness” is called the Markov property.

  1. A set of consecutive comments is a document
  2. For topic modeling, order of words in a document is neglected (i.e., exchangeable) and order of documents in a corpus is also neglected.

Data Collection

Data was scraped from a national discussion forum

  1. Crawlers were build using Java program.
  2. Data consists of comments of people from across the country.
  3. Comments were made in a time stamp of one year.


Data Cleaning

  1. Removal of Hindi words using a self prepared dictionary.
  2. Correction of misspelled English words using aspell function provided by utils package in R.
  3. Extraction of different parts of speech (POS) using annotate function provided by NLP and openNLP package in R.


Topic Modeling – Latent Dirichlet Allocation (LDA)

We selected Latent Dirichlet Allocation (LDA)  as our primary algorithm for modeling inherent topics in a corpus of comments scraped online. LDA is essentially a Hierarchical Bayesian framework for factorizing a document term matrix into a matrix of topic distribution over words and a matrix of document distribution over topics. From this perspective LDA appears to behave like a latent factor model where topics act as latent factors. Topics as modeled in LDA is a cluster of words which represent a coherent idea based upon their unique pattern of co-occurrence in the corpus of documents. Our basic reason for choosing LDA is it’s unique ability to find cluster of words which are connected under a cohesive thematic content in the given corpus of documents. LDA acts on a document term matrix and needs the number of topics to model in advance. Most important output from the model is a distribution of words for each of the topics and the distribution of topics for each of the document in the corpus. We suspend further discussion on LDA for brevity and direct interested readers to the original paper by  ‘David M. Blei, Andrew Y. Ng and Michael I. Jordan’.

# running lda with 10 topics
dtm_LDA = lda(dtm2, 10)

Topic Labeling : Need for a Dictionary

Topics generated by LDA are just a collection of words. For the topics to make sense to a human, they have to be classified into some real world topics. This exercise can also be thought of as recognizing the direction in which the collection of words in a topic inherently point to.

Methodology : Forming  dictionary

This necessitated the forming of a dictionary which would serve as a measure to categorize the topics in terms of real world topics. To form a dictionary, we first parsed the PDF of the major topics to form a collectable set of Documents upon which LDA was run to produce Topics. Then top 1000 words according to the probability were sorted and chosen. Doing this over a variety of subjects gave us a comparative frame of reference.

Forming the dictionary : Pitfalls to avoid

The LDA gives probability outputs in from of Log transforms which have to be converted to Probabilities before sorting them in descending order and using them. Some distributions of topics might follow a uniform distribution thereby making the cut-off threshold difficult to decide. (General use here, 1000 words). It would be better to first visually look at a distribution and then select a threshold for the number of representative words in a Topic. As a general principle, use of thick-tailed distributions should be avoided. Selections of texts and number of topics should be done in such a way so as to get a “sharp” word probability distribution.

The process of running LDA over a number of texts and consequent “cleaning” should be implemented on a system with a system with competitive hardware so as to meet the computation intensive demands of the process.

Topic Transition Modeling

Markov Property

Next piece of the puzzle in completing our arduous aim was the task to model the temporal relationship between topics once we have compressed snippets from a conversation into logical topics. We started this task guided by a number of assumptions and tools available off-hand in R. Our basic assumption was that at any point in the conversation future state of the conversation depends solely on the present state, i.e. any given commentator comments only in response to the last made comment (last group of comments actually), this in the field of stochastic modelling is known as the Markov property. This assumption seemed plausible given that we modeled state of conversation by using a collection of comments distributed over time. This approach ensured that any sporadic transition gets smoothed out and we ultimately model only statistically significant pattern in the collection of comments. Our choice on the number of comments representing one state of the conversation was chosen heuristically depending upon the size of each comment, total number of topics used for reference and time over which they were spread. In the present case, the approximately 30k comments scraped were spread over an year and half each having a size of about 40-80 words, with 10 topics used for reference to compare with. We considered (based on a uniform distribution of comments) average number of comments per day equal to about 55 which consisted on an average 3300 raw words. This number (55 comments clubbed together) looked decent enough to capture the context of conversation and hence we started with this number for creating one time-stamp in the flow of conversation. Later we experimented with this number in order to understand and neutralize the effect of this heuristic. What we found was that even 250-300 words were good enough in capturing context of the conversation and at the same time giving us enough number of time-stamps to be useful in understanding long term dynamics of the conversation. So we clubbed 5 consecutive comments and applied LDA over them to get a probability distribution of topics at each time-stamp.

Screen Shot 2015-12-05 at 12.13.58 am

Discrete Time Markov Chain

Next to our assumption of Markov property, we initially tried modelling conversations as a continuous time Markov process. While this assumption allowed us to relax the constraint of knowing exact times for transition of states it brought an implicit and even stronger assumption that the process is Markov as well as time homogeneous (constant transition probabilities) for any given period. This assumption was too much for the real life data we were trying to model and reflected quite clearly when our simulations weren’t converging in fitting the data. In order to relax this constraint we tried Discrete Time Markov Chain (DTMC) modelling which models a stochastic process as a temporal sequence of states coming from a countable collection of states called state space of the process. It required the assumption of knowing exact transition times between states and time-homogeneity for the Markov chain, which essentially means that the transition probabilities between states in any fixed number of steps remain fixed throughout. One additional assumption was that the sampling times are non-informative apart from being exact with respect to the transitions, i.e. sampling times have no other significance with respect to the system apart from capturing the exact transition times for states. Based on this logical foundation we used R primarily for this modelling.

MSM (Multi-State Markov Model)

Initially we experimented with a number of packages available in R for Markov chain modelling like,  markochain, DTMCPack, SemiMarkov, MSM etc. but finally ended up choosing MSM (Multi-State Markov Model ) as most appropriate for our purpose. This package can be used for modelling Continuous time, Discrete Time and Hidden Markov Model given longitudinal data for the states of the system. Considering the fact that we wanted to model our conversational data as a DTMC we needed to convert the probability distribution of topics at each timestamp into one unambiguous state. For this purpose we simply chose the state with maximum probability as the state of the conversation at each timestamp. In another variant of this method we also tried creating multiple batches of training data where in each batch instead of picking the state with maximum probability we sample topics from all available topics based upon probability distribution of topics at each timestamp. In any way finally we end up having one state for the conversation at each timestamp which is nothing but the most relevant topic in the conversation at that time. Next task was simply to use this sequence of states and a few other structural constraints like possible transition matrix specifying what state transitions are possible, in MSM package to model transition probability matrix which can be used to predict probability distribution of topics in the given conversation after any specified number of steps. This can also be used to model the stationary distribution for the Markov Chain which specifies the long term probability distribution of states which remains constant once the chain achieves stationarity.

topic.msm = msm( formula = state ~ time, subject = id, data = df_msm,qmatrix = qmatrix,gen.inits = TRUE,obstype = 2,exacttimes = TRUE,control = list(fnscale = 2500,reltol = 1e-16,ndeps=rep(1e-6, 90)))


Once we fitted the sequence of topics in a DTMC using MSM what we observed was, while the fitted model was predicting stationarity within 15-20 steps, observed chain didn’t achieve a constant probability distribution. This was a clear hint towards failure of one or more of our assumptions specifically the time-homogeneity assumption. To check this assumption instead of fitting the DTMC model over the complete data we fitted this model over 100 timestamps separately and then compared the transition probabilities. As we had suspected these transition probabilities from separate chains were different, but they interestingly behaved like white noise. This essentially meant that these probability numbers were generated randomly with a fixed mean and variance. Which meant that if we gather more batches of data and take average of transition probabilities over all of these chains these should provide an unbiased estimate of the probability numbers, which can ultimately be used to estimate the long term stationary distribution.

p11_timep_1_10_timep10_1_time  p10_10_time

Conclusion and Future Scope

Finally at the end of all of our analyses we came up with a methodology to model conversations as a Discrete Time Markov Chain which can be used to predict the future stochastic movement of a conversation in terms of the relevant topics of discussion, based on its past data. This methodology can be used to estimate the stationary distribution of the conversation over the reference topics which provides a quantitative benchmark for comparing relative importance of various reference topics in the given conversation in the long run.
There are a number of simplifying assumptions made in this model which need to be further probed and refined. The heuristics involved in clubbing a group of comments to represent the state of the process can be made more rigorous by automatically detecting clusters of comments in the time sequenced data which represent one coherent idea. This can also help in minimizing complications due to the time-inhomogeneous  nature of the Markov chain by capturing actual transition of states . List of reference topics need to be expanded further keeping in mind the broad categorization of information. Moreover we need to develop a methodology by which distribution of words associated with each reference topic can be refined online based upon their popular usage. We also need to automate the number of topics used while fitting LDA as a subset of all reference topics available based upon some pre-computation on the corpus of comments. Finally some more effort is required in the direction of evaluating topics generated by LDA and automating counter measures to work in an unsupervised online setting.


Following video shows the temporal dynamics of the content of an online conversation thread scraped from Size of the nodes correspond to the distribution of topics w.r.t time. This visualization was completed usign ‘igraph’ package in R for generating jpg frames and then using ‘ffmpeg’ multimedia converter for converting the still frames into a mp4 video file.

Note: Time appearing in video doesn’t correspond to the chronological time when the comments were made.

PGDBA, Computing for Data Sciences, Group #11

  • Aashish Jhamtani
  • Pradeep Mooda
  • Somya Ranjan Sahu
  • Shashank Kumar

Project Presentation



Rossmann Sales Prediction

Rossmann operates a chain of drug stores across 7 European countries. Rossmann decided to apply a single solution across all stores in Germany which capture their each store characteristics, sales pattern and able to forecast sales of each store of 6 weeks in advance.

Impact on Company:
Reliable sales forecasts enable store managers to create effective staff schedules and stay focused on their customers and teams.
Can increase productivity and customer satisfaction by providing better services.

Past Work:
Individual Store Managers use their own forecasting scheme for their own stores. Variances are very high among the stores forecast scheme used by individual store managers.

Also it takes a lot of time from store managers schedule which they can utilize in other productive task and these individual models for each stores cannot be applicable for for all stores.

Except competition distance and time duration of promotion , all other explanatory variables are categorical variables.
Also there is no common trend across all the stores.  It is very difficult to apply a single time series model to all stores.

But even a separate time series model for prediction of sales for each stores doesn’t gives any satisfactory results(RMPSE of .28).

Missing Values
There are 180 stores that donot have the 6 month sales data and donot have any information how to fill that gap. Also donot find substancial information on kaggle competition forum for filling those missing values.
There are some missing values in competition distance and time, but they are very small in number. So we replace the null values in those categorical variables with median of available data.

Training Data:
The data contains the daily sales of each store from 1st Jan-2013 to July-2015 which is approximately 10 lakhs observation with features like
timestamp(week, date), opening status , promotion status on that day and state/school holiday status. This is infact large data to handle.
In addition to sales data , there are store information(9 features) variables for each of the 1115 stores.

Methodology Used: 
Using different online sources like similar kaggle competition for weekly sales prediction for Walmart Stores, kaggle forum for this competition and get some idea on how to approach for solution.

First we apply time series model on each store. But it gives RMPSE of .28 which implies that time series alone donot captures the variability in the sales data.
So require models which capture the effect of time on sales as well as the explanatory variables.

Using exploratory analysis of data-sets , we find that there are comparatively less no of store type ‘b’ but each store contribute more to sales than rest. Also assortment level ‘b’ is only available at those stores only. For the rest type of store there is no significant difference in the sales data.
But it is difficult to judge most impacting variable of each store based on the visualization.

Using the random forest, we are able to capture both the effect of features on sales as well as sales variability with time.

Importance of different categorical variables(explanatory) using random forest are given below:

(Most Imprtant)Store > Promo(Long Term Promotion) > CompetitionOpenSinceYear > CompetitionOpenSinceMonth > DayOfWeek > StoreType > day > month > Promo2SinceWeek > Assortment > PromoInterval > year > Promo2(Promotion offered daily/Short term promotion)(Least imprtant).

Further exploring different algorithms like XGBoost and taking the weighted average of the results of Random Forest as well as XGBoost gives better
results than applying only individual algorithms.

Tools Used:
R and libraries like randomforest, xgboost , ts(for time series) and visualization libraries like ggplot2.

Using the tuned weight parameters for taking the weighted average of results obtained from random forest and XGBoost, the Root Mean Square
Percentage Error of combination of Random Forest and XGBoost is 0.11154. Our team rank is 1213 out of total 3242 teams.

Rossmann Sales LeaderBoard SnapShot.
Rossmann Sales LeaderBoard SnapShot(sethsachin21)

Presentation Link

Competition on Kaggle

Team Members:

Anurag Patel(
Sachin Kumar(
Sanjeev Kumar(
Vivekanand Chadchan(


Movie Recommendation System Using Twitter Data


Twitter is one of the most popular social networking websites in the world. As of the third quarter of 2015, it has around 307 million active users. Every second, on average, around 6,000 tweets are tweeted, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. The large amount of data, so available, can be and has been used for providing a number of services, like

  • Speech analysis-Using twitter to generate a word cloud and from that determine a person’s style of speech. It can also be used to find the topics of interest, etc.
  • Successes of a product or campaign-To analyze the popularity and the reviews of a product
  • Detection of earthquake
  • To predict the social unrest and mass protest

The project was based on understanding the semantics of the tweets made by the user and then recommending movies to the user on the basis of words used.


The project was divided into four parts starting from extracting Twitter data and building genre dictionaries; these two steps ran in tandem. This was followed by building movie list from which to recommend movies, and the final step being developing the mapping algorithm.

  • Extracting Twitter Data

Tweets from a particular Twitter handle are captures using function – userTimeline. Most of the times, tweets captured may not be proper English words, hence the text is cleaned and common words such as helping verbs, prepositions, etc. are removed from the text.


Then, we created a term document matrix of all the words and words which appear at least thrice were kept. Then, these words were reduced to their stems and common stem were added and assigned weights.


For assigning weights, TF-IDF method was used. Term Frequency (TF) captures the raw frequency (count of word in a document) of the word, while Inverse Document Frequency (IDF) captures the relative importance of every word in a document.  Final weight of a word would be product of TF score and IDF score.

  • Building genre dictionaries

Movie dictionary refers to a list of words that could represent a particular genre of movies. In order to build a movie genre dictionary, we collected data from 4 different sources. The first two included movie description and movie plots from IMDB and Wikipedia. Third source included key/tag words for every movie from The final step was manual intervention. A list of words was added to every genre which could represent that genre.


Then, weight to every word was assigned in the same manner as we did for Twitter data.


  • Building Movie List

A list of around 1800 movies was developed. Every movie could fall into 2 to 4 genres out of the 11 major genres selected. A sample list of movies with all the genres is provided below. Binary representation was used to indicate the presence of a movie in a particular genre, either a movie falls into a genre or it doesn’t.


  • Finding User’s preference of genre

Every word in the user dictionary was matched with every word in all the 11 genre dictionaries. For every word that matched, corresponding scores were multiplied. This gave us score for every genre.


  • Mapping Algorithm
  1. Take dot product of movie matrix (1814 x 11) with the score vector (11 x 1)
  2. Get movie score vector (1814 x 1) having score for every movie
  3. Recommend movie at the top of the list


Outcome and Next Steps

The project’s output could be used to identify user’s personality in terms of his liking/disliking.


Since the choice of movies for a particular user could be associated with many other things such as books, places he would like to visit, among many others, this could be further build upon.

It can serve a perfect tool for online marketing based on user’s social media profile. Some of the improvements that could be done are as follows:

  • Improving Semantic and associations of words in User profile – There exist different words which imply same meaning. Moreover, even while creating term document matrix, different forms of a same words, or even singular plural are considered different words. We tried to minimize this by using the stem of the word, instead of using the word itself. This improved the accuracy to some extent. The next step to this would be to find the intensity of a word and group same meaning words together and then determine user’s personality.
  • Increasing number of genres – Considering the scope of the project and paucity of time, we considered only 11 genres which could broadly classify most of the movies. Netflix uses around 77,000 micro-genres to classify movies. More the number of genres, better the classification and recommendation will become.
  • Find association of Movies – So far we tried finding association among various aspects of a user’s profile. The same can be done for movies. Each movie could be tagged with few key words and other characteristics such as time of release, cast, actor-director pair, geography, etc. This can then be used to cluster movies and improve the recommendation process.
  • Including other aspects of Twitter Profile – In the project, as it is, we have considered only the Tweets made by the user. This could be complemented by other aspects such as hashtags, handles that a person is following, retweets, etc. This would enable us in gathering more information about the user.


  • Extracting legible data from Twitter – As ironical as it may sound, social media are a store house of data, but only a small fraction of that data can be used to extract useful information. Moreover, the usage of slangs, incorrect and misspelled English words make it difficult to use the extracted data
  • Building genre dictionary – There are many movies which fall into different categories, and hence a word that represents a particular genre could also fall into different genres. Estimating the weights of such words so as to distinguish their presence in different genres was a challenge. This was taken care of manually adding set of words to every genre.
  • Lack of training and test data – This problem, being exclusive to the project, was the most difficult to handle. Since, there is no direct literature available related to this project. It was a difficult task to find out whether our algorithm is working fine. We took a sample of few and asked their preferences for movies. Then, we found out movies based on their Twitter handle and compared their preferences. We found out significant overlap between the two lists.

R Packages Used

  • twitteR
  • SnowballC
  • sqldf
  • tm

R Functions Used

  • userTimeline
  • twListToDF
  • regmatches
  • TermDocumentMatrix
  • tm_map
  • Corpus
  • wordStem
  • sqldf
  • setup_twitter_oauth

Thank You

PGDBA Group 08

Bharathi R (

Faichali Basumatary (

Shoorvir Gupta (

Vishal Bagla(

For complete report, please visit Movie Recommendation System.

For presentation, please visit CDS Project – Group 8 – Movie Recommendation System.

In -Hospital Intensive Care Unit Mortality Prediction Model

Introduction to the problem statement

Use of Artificial Intelligence in the hospitals

Healthcare industry by its very nature touches upon the life of each and every citizen and contributes 4% of India’s GDP (Source) as of 2013. The mammoth amount of data generated has been used for decisions and artificial intelligence has a big role to play in the day-to-day functioning of the industry.

Expenditure on healthcare in India – 50 Billion USD

Number of Doctors – 7 lakhs

Nearly 40% of the people admitted to ICU have to borrow money or sell assets (Source)

Can help save critical lives and reduce cost

The power of the data can be harnessed to save critical lives. By predicting a potential fall in the criticality of a patient, appropriate care can be provided. Also, by providing advanced assistance to only patients who are in a risky situation can save money on the resources not spent on non-risky patients.

Past efforts

In the past, many researchers have explored the use of the patient data to predict criticality. However, there will never be an end to the thrust of research due to huge potential and with large volumes of data available now, it is still an active field of exploration.

Our objective was to develop a predictive model, utilising the laboratory measurements of patient in ICU to predict a potential mortality in future hours. Naturally, the performance metrics were the accuracy of the prediction, specifically the true positives and the lead time of prediction, the more the better. (Problem statement). We will be discussing different aspects of the challenge and analysis in the blog. We expect the readers to have a basic understanding of machine learning. We had a lot to cover, but limited time and space and hence, at certain points we have compromised with the details for the sake of brevity.

Primary challenges

The primary challenge in developing any kind of data model for the human body is the physical and numerical vastness. It is difficult to prototype the exact problems as high reliability is must for such models. For the particular case of predicting mortality, there can be many reasons leading to mortality and a model can possibly cover only some of them.  

Missing data

The availability of data is subject to conduct of laboratory or vital measurements. Moreover, the measurement come at monetary and health cost. For e.g. blood test would imply collection of blood samples from the patient body. Hence, there is much smaller data available than ideal for the model and also different measurement are not concurrent in time.

Training the data

Even though mortality is an objective outcome, a worsened state which can in future result in mortality is difficult to capture. We are predicting a state for which we have no direct training data. We used a conservative approach of using the worst case scenario for the patients with positive mortality in ICU and using it for training the state indicative of future mortality.

Different features

No linear relation with the good-bad health situation. Nearly all features have an optimum value and critical zones vary from patient to patient based on their living habits. Features act in combinations and that is how even the doctors analyse the features.

Large data

We have been given details of 5990 patients in total (training and test combined), summing up to more than 6 lakh rows of data, which is in fact big data to handle with!! We had used almost 1% data for preliminary testing to speed up our verification procedure.

Can use only the patient history data

For any patient, only his/her historic data can be used for prediction.


We referred to various research papers in this domain which were published based on a similar competition held in 2012 to decide upon what strategy to follow for the problem.

Feature engineering 

Thanks to Dr. Priyanka Singh, Dr. Ram Kiran and Dr. Tejaswi to support us develop modified features from the data available. Variables which were strongly correlated and influenced together were merged. We also divided values of a lab test/vitals into ranges and assigned weight to them according to criticality.  

Missing values

More than 95% values were missing. As done by doctors, if any value was missing, we looked for it in the last 24 hours, then among ICU values and then since hospital entry otherwise normal value was imputed.

Tools Used

Pandas and numpy and sklearn and R for testing the quality of the variables

Pandas offer a great variety of tools to subset/group the dataset.

Project Stages

Project Stages

Program Structure

For train data

 For every patient in training data

{ if patient died


Extracting modified feature from non-icu data of the current patient

Extracting modified features from icu data of the current patient




Extracting modified feature from non-icu data of the current patient

Extracting modified features from icu data of the current patient



For test data

For every patient in test data

{ if patient in ICU


Creating modified feature for the current patient using his/her historical data



Note: Code snippets can be found in presentation. Presentation Link

Results obtained


The score had highest weightage for sensitivity although there was restriction of minimum 0.99 specificity and 5.5 hours of median prediction time.


SpecificityUse of specific weights parameters to adjust the specificity to the certain level

SensitivityHad to compromise on sensitivity so as to maintain minimum requirement of 0.99 specificity

Median Prediction time – Calculated only for true positives

median prediction time

For the final submission to the challenge, we combined the training and validation dataset for the purpose of training and obtained improved results.



  • Use of appropriate parameter values in the random forest to get the best results
  • The functioning of different classification methodologies (Random forest, SVM and K-NN)
  • Importance of domain knowledge in the healthcare industry

  Contact information:

If you have any questions or want to discuss any aspect of the analysis, please feel free to contact any of the co-authors (Group – 6) : 

Manaswi  Veligatla ( )

Neeti Pokharna      (

Robin Singh       ( 974 888 4997)

Saurabh Rawal      (

Citi Group: Data Scientist Challenge

The Data Scientist Challenge

The Data Scientist Challenge, Powered by Citi Digital Acceleration, was launched in October 2015 with the objective to promote creatively newer approaches to mine data, discover actionable customer insights and build an impact.

More details can be found at

Credit card industry:

Credit card industry has lot of information rich data which is available for millions of customers. As per the contest website, on an average there are 120 annual transactions for an active customer. This data is very useful and can be used to get incredible insights in customer preferences and life style choices.

However the fundamental challenge with this data is its sheer size and the computing power required.

Problem statement:

The objective of this challenge is to identify the Top 10 merchants that customers have not transacted within the past 12 months and are likely to
transact with, in the next 3 months.

We need to personalize these recommendations at customer level. This can be then used to give customized offers from those merchants, to customers from whom he is likely to shop.

Data Format:

Customer #

Merchant #



# of transactions





















Number of customers: 374328

Number of transaction:63659708

Number of merchants: 9822

Number of merchant categories: 15

Data size = 690.5 mb
[1] 18532355
[1] 201309 201408

The final output must be composed of one file, namely ‘Merchant Recommendation; file of .csv format consisting of 10 merchants recommendation for each customer.

Scoring and Evaluation:

For every customer, the merchants will be scored in the order of likelihood.

If Merchant 1 has been identified correctly for a customer, 10 points are awarded to the score. Correctly identifying Merchant 2 adds 9 points, and so on.

The final score is simply the weighted sum of final score (A) and additional bonus points for code efficiency and Approach(B)
Score = Algorithm Score (A, weightage = 80%) + Code efficiency/Approach (B, weightage = 20%)

Nature of problem:

Firstly we identify this a recommendation problem. Recommendation scheme may be classified in two major types:

i. content-based filtering and

ii. Collaborative filtering.

Content-based recommendation systems analyse item descriptions to identify items that are of particular interest to the user. So Content-based recommendation systems try to recommend items similar to those a given user has liked in the past.

On the other hand collaborative filtering assumes if two customers showed similar buying pattern in the past, then they have similar buying behaviour and recommend products/merchants of one to another.

Based on the above argument, we model our recommendation problem as Collaborative filtering where we would like to recommend merchants, who have transacted with similar customers in the given period of transaction.

Now collaborative filtering may be of several types:

  1. Memory-based collaborative filtering: For memory based collaborative filtering, neighbours of a customer is found out from the entire database and from this neighbourhood recommendation is made. To form the neighbourhood of the customer the idea of similarity/distance is required and there are several choice of distance.

  2. Model-based collaborative filtering: This approach allows the system to learn to recognize complex patterns based on the training data, and then make intelligent predictions for the collaborative filtering tasks for test data or real-world data, based on the learned models.The modelling can be bayesian-model,regression based model , clustering model or dependency networks.


  • In the dataset provided we observed that each merchant belongs to one category.

  • No missing data has been observed.

  • No effect of seasonal trend has been observed.

  • High skewness in the market-share of the merchants has been observed. In fact it has been found out that more than 85% of the market-share is dominated by 500 customers out of 9998 customers.

Approaches used:

  1. Similarity-based approach:
    In this approach we first find-out the distance between merchants.For distance metric cosine-similarity distance is chosen.The advantage of cosine similarity over jaccard’s similarity is that the multiple occurrence of transaction would not be possible to capture in jaccad’s similarity.The distance between merchants is stored on a NxN distance matrix where N is the number of merchants. The cosine-similarity between two merchant vectors(say m1,m2) is calculated by the following formula:


Then, we compute score-matrix for each customer-merchant pair by the following algorithm:


Now based on this score vector we select the top 10 merchants for each customer

  1. Low-rank approximation collaborative-filtering:
    At first, we build a transaction matrix where the rows represent merchants and the columns represent customers.The matrix provides the details of how each customer transacted with each merchants.Then we compute k-rank approximation of this transaction matrix.The low-rank approximation ,we expect to capture the similarity pattern of the customer’s buying behaviour.In essence, if the like-minded customer of customer-i made big amount of transaction with merchant j then the the (i,j)th entry of the matrix is expected to be large and small otherwise. For one customer vector we look for merchants whose corresponding entry in the original matrix are zero; but have large value in the low-rank approximation matrix.
    Furthermore, we performed the low-rank approximation separately for each category so that the specific trend of the category is reflected on the computation.
    We start from k=5 then we increase the value of k until the result more or less stabilizes.
    NOTE: The original matrix would be a sparse matrix, but reconstructed one would not be a sparse one.Also.performing normal SVD on this matrix would be difficult.So we take refuge in inbuilt R package named IRLBA package.This package utilizes the Implicitly Restarted Lanczos Bidiagonalization Algorithm proposed by Jim Baglama and Lothar Reichel. The algorithm starts with one vector of unit length as v1 of SVD and then compute u1 and then it keeps appending more vectors of unit length vj,but instead of Gram-Schmidt orthogonalization,where it is subtracted form projection on all of the previous chosen vectors,here it subtraction is done only from the projection previous vector.

  2. K-means clustering:
    In this approach we classify the customers in different clusters and recommendation is made based on the most successful merchants of the cluster. We first order the merchants based on transaction for one cluster.Then for each customer we find out merchants with whom no transaction has been made, but are ranked higher in the cluster.
    We’ve keep the number of clusters as the tuning parameter and settle for the one for each no further change can be observed upon increasing the number of clusters.

R packages used.

We have primarily used irlba

What is IRLBA?
• The Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) of Jim Baglama and Lothar Reichel is a state of the art method for computing a few singular vectors and corresponding singular values of huge matrices.
• With it, computation of partial SVDs and principal component analyses of very large scale data is very time and space -efficient. The package works well with sparse matrices.



Some code snippets

Code snippet for svd using irlba and then reconstruction of matrix by approach suggested by Sourav sir.

for(i in 1:no_of_category){

for(j in 1:100){
for(i in 1:no_of_category)
  difference_vector=difference_vector[-index]#remove that index from that for next biggest index
  #this part is to find out the merchant_no of the column for which maximum deviation is observed between 
  #the k-rank and original
  merchant_no=merchant_by_category[[fi]][find_i]#get the merchant_no


Code snippet for computing the cosine similarity

getCosine <- function(x,y) 
 rslt <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
similarity_mat=matrix( nrow=ncol(test_data_matrix),ncol=ncol(test_data_matrix),dimnames=list(colnames(test_data_matrix),colnames(test_data_matrix)))
for(i in 1:ncol(test_data_matrix)) {
 # Loop through the columns for each column
 for(j in i:ncol(test_data_matrix)) {
 # Fill in placeholder with cosine similarities
 similarity_mat[i,j] <- getCosine(test_data_matrix[,i],test_data_matrix[,j])
 for(j in 1:i-1)

Procedure for verifying algorithm:

We build our test data-set from the original data set by separating out the nine-month transaction data from the original dataset. Then we applied our model in this dataset in order to predict for the next three-months data.Then we check for our accuracy from the original data. Random samples of 10000 customers were aggregated and we accumulated the result of samples.


Application of Dimensionality Reduction in Recommender System - A Case StudyBadrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering, University of Minnesota
Topics in Data Analytics: Collaborative Filtering, Finding Similar Items in Very High Dimension, Sentiment Analysis”
 Debapriyo Majumdar, Indian Statistical Institute Kolkata

Slides:- CDS- Citi group challenge

Team Memebers

Gautam Kumar (
Jaideep Karkhanis (
Jayanta Mandi (
Sajal Jain (

‘What’s Cooking?’ – A sneak peek into Classifiers and Web Scraping

“Picture yourself strolling through your local, open-air market… What do you see? What do you smell? What will you make for dinner tonight?”

What’s cooking – The International platter

The motivation for the Kaggle Competition named What’s Cooking was to predict the category of a dish’s cuisine given a list of its ingredients. Some of our strongest geographic and cultural associations are tied to a region’s local foods hence a list of ingredients would be a clue to which cuisine that dish belongs. Kaggle arranged the dataset for 20 international cuisines from Yummly. Yummly is a free smartphone app and website that provides recipe recommendations personalized to the individual’s tastes, semantic recipe search, a digital recipe box, shopping list and one-hour grocery delivery. The whole train dataset contains 39774 dish-ids along with their ingredients and cuisine label. The test data contains 9944 dish-ids along with their ingredients. Our task is to predict cuisine for a particular dish from test data given its ingredient set. A snippet of train data would be:

 "id": 38714,
 "cuisine": "indian",
 "ingredients": [
 "ice cubes",
 "vanilla extract",
 "ground cardamom",
 "ground cinnamon",
 "1% low-fat milk",
 "nonfat yogurt",

Data Preparation

The project starts with data preparation stage:

We have 20 different cuisine in train and test data namely, Greek, Southern_US, Filipino, Indian, Jamaican, Spanish, Italian, Mexican, Chinese, British, Thai, Vietnamese, Cajun_creole, Brazilian, French, Japanese. Irish, Korean, Moroccan and Russian. The presence of Italian cuisine is maximum in train data.

International Data hist

To start the data preparation stage we have tried several methodologies to clean the data and make it in shape to analyze.

  • Tokenization: We have tokenized all the ingredients as a single word to emphasis the pattern present in data. For example: Red Chili Pepper would appear after tokenization as Red_Chili_Pepper. This would minimize misclassification error from taking Red, Chili and Pepper as three separate tokens. The  python code for tokenization is attached here.
  • Case modification:  We deliberately changed all text data in lower case to avoid inconsistency.
  • Sparse term modification: After creating document term matrix we removed terms those occurred for less than 3 times. This was a heuristic choice which we achieved plugging different numbers. This method is useful when tokenization hasn’t been used.
  • Porter’s stemming algorithm:  This is a process for removing the commoner morphological and inflexional endings from words in English. Example: tomatoes, tomato – will be considered same.

Next stage would be feature engineering: As the number of ingredients for dishes varies from (minimum) to 135 (maximum) in train data, we decided to take number of ingredients per dish as a feature.  Our assumption of this being an important feature can be verified from importance plot.

Feature Engineering

After data preparation and feature engineering we tried to model a classifier based train data. The first part of it would be creating a Document Term Matrix (DTM) for the ingredient sets. ‘tm’ package in R creates a DTM from the corpus of ingredients. The method sets ‘1’ when that particular ingredient appears for a dish and ‘0’ otherwise, for all the ingredients present in corpus. Sparse terms from DTM have been removed where frequency of those terms for all dishes are less than 3. As mentioned, number of ingredients has been added as an extra feature in DTM.


Based on the nature of classification we have decided to use five models to classify and compare the results produced by each of the classifiers: 1. Decision Tree (CART), 2. Random Forest, 3. Naive Bayes Classifier, 4. Modification of Naive Bayes and 5. Extreme Gradient Boosting – XGBoost.

  • Naive Bayes Classifier:

While applying the Naive Bayes classifier, we have assumed here that the occurrence of ingredients is not correlated. This means that the probability of occurring of ingredient is independent of other ingredient present in the dish. The second assumption here is probability of occurring of a dish in a cuisine is product of the probabilities of all the ingredients in a dish, i.e. dishes are independent.

 A simple Bayesian classifier would be


where x is the test ingredient list and C the set of cuisines with cardinality K. The dish is classified in cuisine which gives the maximum probability.

Since the knowledge about the cuisine label for a particular dish is known from the training data we sorted the data according to given cuisine and took row sum in each particular category of cuisine. Finally we converted convert this matrix to probability matrix we need to make the matrix column stochastic.

A sample column stochastic matrix would look like:


  • Note: Though the unconditional probability of each cuisine can be calculated from data, i.e. the sample proportion of each cuisine, but we need to be careful since that may not reflect the population proportion.

For complete procedure and derivation, please follow this project article.

  • Prediction code for Naive Bayes classifier:
for( i in 1:nrow(c_train_ingredientsDTM))
 setdiff(ing_cons,setdiff(ing_cons,names(   ))
 Bayes_Predict_Matrix<- as.matrix(Bayes_Prob_Matrix[,ing_considered])
 cuisine_probability <-prod(Bayes_Predict_Matrix[1,])
 for( j in 1:5)
   cuisine_probability[j]<-prod(Bayes_Predict_Matrix[j, ])*Prob_cuisine[j]

The complete R code for classic Naive Bayes classifier is here.

  • Modifications of Naïve Bayes:

When we are classifying through Naïve Bayes probability matrix, if in training data probability of an ingredient appearing in the cuisine is zero then the whole probability will be zero regardless of other probabilities. Solution can be taking the Geometric Mean of the entries in a row which are non-zero. In a situation, if there are only few non-zero entries in a row which will make geometric mean higher than above and in these situations, we can get wrong classifications. To avoid previous situation, we maintained a threshold which is defined as the total number of non-zero ingredients should be greater than particular value and then we have calculated the Geometric Mean (with some minimum number of nonzero terms).

for( j in 1:20)
 non_zero_num = Bayes_Predict_Matrix[j,]!=0

The complete R code for modified Naive Bayes classifier is here.

Results for Naive Bayes where different form of objective function has been used for experiment:


It can be seen from above that the modifications of taking threshold of minimum number of ingredients present in dish to be classified in given cuisine before we take the geometric mean, actually improves the accuracy. The geometric mean and its modifications are less accurate than probability multiplication. The modification of uniform probability gives the best accuracy.

  • CART and Random Forest:

Since all most all the features are categorical, CART model and Random Forest should be the choices for classifier. For the space constraints, we are not discussing about basic CART model and Random Forest. The presence of more than 4,000 features have seriously impaired the predictive ability of CART model as it tends to overfit for test data. We have got an accuracy of 41.7% on test data. We approached to another classifier which consists of a set of weak learners (decision tree, in this case), each one of them was randomly assigned a set of features with a subsample to learn. This model, named Random Forest, improved the accuracy on test data up to 76.35%. For a CART model, the structure of tree is:Kaggle Decision tree acc .42

  • eXtreme Gradient Boosting- XGBoost:

The philosophy behind random forest encouraged us to use another model, named XGBoost which stands for Extreme Gradient Boosting. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers. Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions where the basis functions are the individual classifiers, in our case. The choice of weak learner for boosting is evident from the following diagram:WL_comp

To carry out the supervised learning using boosted trees we need to redefine ‘tree’. In a way, Regression tree is a function that maps the attributes to the score. A tree can be defined a vector of leaf scores and a leaf index mapping function. The structure of individual tree (q(x)) guides a sample to a leaf and the associated score (w) is assigned as the prediction for that sample for that tree.
The diagram and the equation explains above statements.



On the basis, The prediction model can be written as the aggregation of all the prediction score for each tree for a sample (x). Particularly for i-th sample,aggre

where K is the number of trees, f is the function in the functional space  and  is the all possible set of trees having prediction score in each leaf,  slightly different from decision tree which only contains decision values. As the modelling is done, we need to optimize certain objective function to choose the best model for the training data. Here, we encourage a model to have high predictive power as well as to have a simple in nature (deals with less number of features). As we know minimizing loss function (L(Θ)) encourages predictive models as well as optimizing regularization (Ω) encourages simpler model to have smaller variance in future predictions, making prediction stable. For our tree ensemble model, we have our objective function which is to minimize as below:


For multiclass classification, the loss function is loss for multiclass prediction. In XGBoost model, we specifically optimized the softmax objective.

xgb        <- xgboost(xgbmat, max.depth = 25, eta = 0.3, nround = 500, objective = "multi:softmax", num_class = 20)

 Since the regression is over a set of functions, usual Stochastic Gradient Descent wouldn’t work here. To solve this, an additive strategy has been taken to add a new tree at each iteration. Starting from constant prediction (usually 0), the task is to add a new function each iteration.

The model complexity for a tree comprises of number of leaves in a tree and L2 norm of the scores on leaves which ensures normalization of leaf scores.


Even intuitively the expression is obvious since as the number of leaves is optimized (minimized) the tree will have a simpler model. For optimizing the L2 norm of the leaf scores would try to normalize the score on each leaf and will prevent having a high score on a particular leaf. Two constants, gamma (γ) and lamba (λ) are the Lagrangian multipliers and can be tuned for accuracy. The optimal score to optimize the objective function can be calculated based on the weights on the leaves and Taylor coefficients. In this way, in each iteration, we are able to choose an optimized tree which optimizes the objective function which has been already optimized partly up to previous iteration, which ensures better accuracy. The optimal score is the best score function for a given structure of tree and optimal objective reduction measures how good is a tree structure for a particular iteration so that it could minimize the objective function which is given below. Due to impossibility of enumerating all the tree from the function space, a greedy approach is of practical use which ensure an optimal split. The components of the gain function are the score on the new left leaf, the score on the new right leaf, the score on the original leaf and the complexity cost by introducing additional leaf (γ). It is obvious that, if gain is smaller than, we would better not to add that branch, which is nothing but pruning. For complete mathematical derivation, please follow this this complete project article.

Given the description of XGBoost, it is time to differentiate the model with Random Forest:

  • The fact is Random Forest and Boosted Trees are not different in terms of model, the difference is how we train them.
  • The major reason is in terms of training objective, Boosted Trees tries to add new trees (additive training) that compliments the already built ones.  This normally gives you better accuracy with less trees.
  • In Random Forest the regularization factor is missing, hence if the gain in splitting is greater than epsilon where epsilon is an infinitesimally small positive number, the split will happen. Whereas, in Boosted trees, there is control on model complexity which reduces overfitting.

The complete R code for XGBoost is given here.

As per our belief towards the philosophy behind XGBoost, it performed better than Random Forest on test data. The accuracy obtained using XGBoost is given below:


kaggle-logo-gray-300Leaderboard rank 186 (Team name: Burn Calories) among 1046 entries (on 5th Dec, 2015)


Having the result using all the classifiers, we are now in a position to analyse the results.

  • The present data is not in accordance of the basic assumption of Naïve Bayes, i.e. independence of ingredients (or cannot be guaranteed).
  • Like milk and sour things cannot occur together, similarly various relationship holds between various ingredients which cannot be explained by Naïve Bayes.
  • A single decision tree tends to over-fit the model and having features over two thousands, the expected accuracy for a single decision tree would be low.
  • Random Forest nearly gives accuracy like XGBoost since the ensemble of weak learner predict far better than a single decision tree. But the assumption of randomness is not always desired. There is no control for the complexity of the weak learners hence no control on over-fitting. To achieve higher accuracy than random forest each tree needs to be optimally chosen such that the loss function is minimized at its best.
  • Thus XGBoost comes into play. XGBoost is an optimized random forest. Advantage of boosted tree is the algorithm works very fast on a distributed system (XGBoost package does).
  • The idea of reinforcement learning is somewhat analogical here in case of XGBoost which is not present in Random forest classifiers.
  • We tuned parameter eta which is used to prevent over-fitting by making the boosting process more conservative. Low eta value means model more robust to over-fitting but slower to compute. In our case, we are over cautious to over-fitting hence eta 0.1 gave the best result.
  • Optimizing function XGBoost used to do multiclass classification is the softmax objective. Softmax objective (cost function) is minimised through XGBoost.
  • L1 regularization on leaf weights performs better than L2 regularization because it encourages the lower weighted features to be dropped while modelling, making model simpler.

What’s Cooking – The Indian Platter

The task at the Kaggle What’s Cooking inspired us to try the same experiment on the Indian regional Cuisines.

The Indian Dataset consisted of a list of 1809 popular Indian regional dishes categorized in accordance with the states. the dataset was scraped from the The website is an extremely popular resource for Indian Expatriates in Europe, Australia and North America.



Scarping the dataset- Scrapy at work:

A web crawler can systematically extract data from multiple WebPages by crawling, or accessing, WebPages and returns the data you need for each one of them. To scrape the data from we used Scrapy which is an application framework for crawling.

Basics of Scrapy:

  • Scrapy engine is responsible for controlling the data flow between all components of the system
  • The Scheduler receives requests from the engine and enqueues them for feeding them later
  • The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.
  • Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).
  • The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).
  • Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.

The complete documentation of Scrapy is available here.

We used Scrapy to scrape the list of ingredients for different Indian dishes. We wrote the spider algorithm based on XPath queries to extract required text from HTML content of the page. XPath Helper extension for Google Chrome has been used to record XPath queries. Code snippet for spider:

def parse_fixtures(self,response):
  sel = response.selector
  for tr in sel.css("table.table.table-striped>tbody>tr"):
      item = CraigslistSampleItem()
      item['ingredients'] = tr.xpath('td[@class="name"]/text()').extract()
      item['quantity'] = tr.xpath('td[@class="unit"]/text()').extract()
      item['unit'] = tr.xpath('td[@class="quantity"]/text()').extract()
      yield item

The complete python codes for spider and item are attached.

Indian Dataset at a glance:

The histogram shows the distribution of dishes over Indian cuisines.

The dataset consisted of a dishes from 26 states consisting of 1809 dishes. We divided the dataset in a proportion of 1:5 for test : train.

This is the methodology we used to approach the problem:

Data Preparation

  • The primal problem in the scraped data was inconsistency due to use of Hindi versions of different ingredients.
  • Spelling mistakes was a hindrance for correct feature extraction.
  • Use of plural form of proper nouns, similar meaning words, adjectives (different POS) were additional noise in the data.
  • To deal with the problem, we replaced all Hindi words by their English counter parts.
  • Spelling mistakes were corrected.
  • Porter’s stemming algorithm gave data with uniform consistent set of words.
  • Removing sparse terms after creating DTM removed most of the adjectives those are less frequent. Otherwise, all other adjectives were deliberately removed from the data.
  • Removal of adjectives such as chopped, cooked, boiled and many others in the ingredients list.

 Feature Engineering

To corroborate the effectiveness of the analysis we added two features to our dataset which seemed to help in distinguishing the cuisines. The details are as follow:

  1. Number of ingredients – We considered number of ingredients as an important feature as it varies from 2 to 29, similarly as we considered for Kaggle data.
  2. Number of spices per dish – We have created an Indian Spice dictionary which contains all possible (commonly used) spices. We created a score i.e. number of spices per dish. From our subjective knowledge, it seems to be an importance feature as there are regions in India where use of spices is too high.


Trial 1:

  • XGBoost

Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 26, Max depth = 25

Result: Accuracy of prediction = 18.99 % 

Possible reason of failure: Data don’t even consist of 26 clusters.

Trial 2:

Based on our subjective knowledge, we redefined the label for each dish. Our assumption was dividing all the dishes based on their geographic origin.

The dataset was clustered into 5 main categories: 1. East India 2. North Eastern 3. North India  4. South India  5. West India

The histogram shows new distribution of dishes over 5 cuisines. The whole dataset has been uploaded here.


  •  Random Forest

Model parameters: No. of trees = 200, tree depth = 3

Model Results: Accuracy = 47.5 %

  •  XGBoost

Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 5, Max depth = 25

Result: Accuracy of prediction = 48.1 %

The complete R code for Indian data is attached here.


Due to poor condition of dataset we could not achieve an accuracy greater than 50%. In addition to this the dataset was a biased one consisting of only popular regional dishes not many of the local dishes. XGBoost allows to make an importance matrix which contains sorted features based on relative importance.

  • Gain contribution of each feature of each tree is taken into account, then average gain per feature is calculated for a vision of the entire model
  • Highest percentage means important feature to predict the label used for the training.

The most important features for the Indian data were No. of Ingredients and commonly used spices.

Importance of features:

feature importance latest inidan

Excerpt of the prediction matrix:


Challenges Faced

  • Proper structured data was not available
  • Individual knowledge biased data when labeling became crucial for classification
  • Inconsistency was a big threat, repetitions of word in ingredient list made our task really difficult
  • A proper Indian Spice Dictionary wasn’t available to map total number of features
  • Our subjective clustering might not be correct, presence of local effect on dishes ignored
  • Unsupervised learning could have been a way to detect original clusters present in data, but an appropriate distance measure between two dishes was unavailable
  • It would be good if DTM could be made weighted but proper measures (quantity, or some other measure which emphasize importance) to calculate respective weight of ingredients in a dish were absent to classify for cuisine

Future scope

  • A proper distance measure can be studied to measure similarity or distance between two dishes
  • An unsupervised learning or clustering can be done to see number of clusters present in data
  • Two-level clustering could be effective
  • DTM can be weighted based on importance of the ingredients in a dish
  • Classification can be more deep leading specific dish prediction
  • Can be used for recommendation system in big chain retailers
  • Indian Database could be made more properly to have an organized data



We acknowledge our instructor Dr. Sourav Sen Gupta for his continuous support and valuable inputs. This project was the end semester project by CDS Group 04.

The presentation for this project can be found here.



For any query about this project, please contact any of the co-authors:

Bodhisattwa Prasad Majumder
Dattatreya Biswas
Keshav Sehgal
Rachit Tripathi

US Elections 2016 – Sentiment and Network Analysis




Social Media has become a new norm of life for billions of people today. With an ever increasing amount of time spent online sharing information, opinions and emotions, sites like Facebook ,Twitter and Google+ have become hot data spots, waiting to be retrieved and analysed.

Elections are no exception to this present mantra. The US President Barack Obama used Facebook and Twitter during his campaigning. Even in the Indian scene, PM Modi made Facebook and Twitter his primary tools to spread his message and propaganda. The BJP and the Congress were claimed to even have hired a ‘Social Media Army’ (source: Firstpost) dedicated to this very purpose.

It comes as no surprise that the twitterati also responded overwhelmingly. Posts and discussions debating the pros and cons became commonplace. ‘Opinion Wars’ were rife. Everybody online had something to say and promote.

Several studies have been done before to tap the sweeping potential of social media data to analyse certain issues. Sentiment Analysis, as the genre is broadly called, refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials (source: Wikipedia).

In this project, our attempt has been to replicate one such study for the upcoming US Elections. We have also attempted to perform a modified network analysis for analysing the tentative affiliations of the masses with regards to the political parties.



Twitter allows collecting data till 10 days from the past. We streamed live Twitter feed for 20 days for a given set of keywords (justification in ‘Assumptions’ below).



The data collected concerns a subset of the actual voting masses, which for practical purposes should not be assumed to be a uniformly random sample. Possible reasons include non-inclusion of the older populations, frequent over-reaction, ephemeral stand on issues and non-sampling fluctuations.

Several users deny permission for location tracking. The network analysis has been done based only on the available ones. Our assumption here is that the locations are missing at random, which can be justified in the light of lack of conclusive study on a possible skew in location restriction pattern amongst users.

Data collected belongs to only certain keywords concerning just the candidates. The rationale here is to exclude possible biases resulting from over-reactions to non-influential topics. Considering the short span of data collection, our focus was on reaction for the major events, which we assumed shall have a much greater impact in the long run, than the short bursts of emotional bubbling.

Tweets have a typical lexical structure, and their analysis poses a special problem. We chose to ignore the ones that were not that frequent, assuming that the majority content of a tweet is contained in its hashtag and words expressing emotions.



With 341 days to 2016 US Presidential Elections, there have been very interesting talks going on about candidates such as the 15 year old teen Deez Nuts, the potential woman president Hillary Clinton, the billionaire republican Donald Trump and Bernie Sanders.

There are two major parties democratic(Right Wing and blue color)  and republican(Left Wing and red color). From each wing, candidates announce their candidacy for presidential elections(which they may or may not withdraw during the course of pre-election year), typically tending to speak against the other party and outshine amongst their own party candidates. During the year preceding election year (Starting March 2015 to November 2016), various talks and GOPs are held among these parties to enable them to decide one candidate (from each party) as their nominee for president(known as pre-elections). This is the time when the candidates go about addressing issues related to economy, foreign policies and immigration. Everyday is a new opportunity for different candidates to highlight their candidacy and change the perceptions of the citizens in USA in their favour but who will finally win the elections? Which parties are currently leading the support of the public and which one is planning a strategy against the other?

Our project: Sentiment analysis using tweets from twitter is based on prediction of some of these questions and we do so by evaluating scores of the parties based on the tweet texted.

We captured all the tweets with hashtags of any of the candidates and analyzing the data by assigning a sentiment scores based on the tweet text. In the process, we perform clustering (on the basis of left and right wings) and further classification to find out the top two most likely candidates to win presidency next year, based on the current trends.



The data went through several stages, some independent, of analysis. Our attempt shall be to introduce the procedures as smoothly and briefly as possible. We have tabulated our experiences with different packages for reference.

Data Collection: Since we did not have any readily available data at our hands, data collection was also a big step. It was important to get tweets continuously over a long period of time along with as many details as possible.


We started with R twitteR package and collected tweets but soon faced problems in storing it in files. Due to small RAM sizes in laptops, it was hard to keep storing tweets and R would eventually crash. Moreover, error handling was very complex.

Final code:

Whenever we encounter a tweet with the names of any US elections candidates or their hashtags mentioned, we capture it. For this, we used python tweepy package to stream data and stored it in json files every day.  The data was later on converted to csv to make it more efficient to process in R.

We kept on doing this for a period of 30 days (entire november 2015), changing the filenames as the day changes, to get a volume of 40Gbs. That’s a fat tweet!


Data Cleaning:

Tweets are filled with abbreviations and evolving lexical structures. This proved to be a major challenge. We added to the existing tm package in R to work around this issue.


Data cleaning was a multiple step process. The first level of cleaning was already done while converting to csv using python. We removed emojis,and other information related to following.

In R, we used a standard dictionary of positive and negative words since creating a customized dictionary in return for a slight dip in accuracy of word clouds. To counter this loss, we performed classification on the basis of affiliation to parties and then assigned scores to tweets(both positive and negative).


We made the following assumption: First, we would only use scores generated from the text instead of entire tweet. Second, since there are only two parties being monitored, all scores are converted to positive, that is, if we have a negative score for a party, we count it as a positive score for the opposing party.



What initially began was the classification using various methods to identify the affiliation of a random tweet. But with the increasing complexity and passing time, we had to make certain assumptions to simplify the task owing to the reduced computational abilities of R.

Our assumption of long term impacting tweets were helpful, and we chose to only filter those tweets which contained certain hashtags, including possible candidate names, and their agendas.

Thus, we ended up with the tweets auto-classified, with a very negligible error rate.


Sentiment Analysis:

We can analyze sentiments on the basis of various available factors: number of users, their language, location or on the basis of the candidates themselves. We made tests with various packages – Sentiment and tm.

Due to the time constraints, we didn’t focus much on the nuances of the packages, and decided to go ahead with the tm  sentiment analyser and tm general enquirer.

We then faced the task of which method to choose for analysis and a notion for distance to compare similarities in candidates. We started from naive bayes method to various complex algorithms, creating linear and nonlinear but consistent distances and finally formed generalized views using timeline plots, mean plots, day-wise trends, cumulative proportions and stacked box plots. We assimilated all our results in a dashboard created with shiny package in R.



Republican TweetLine
Republican TweetLine
Scoring Proportions of the parties on Day 30
Scoring Proportions of the parties on Day 30


Heat Map for affiliations for both the parties
Heat Map for affiliations for both the parties
30 day Score Comparison for both parties
30 day Score Comparison for both parties
Mean Scores aggregated over each day for both the parties
Mean Scores aggregated over each day for both the parties
Word Cloud generated for Donald Trump through supervised learning
Word Cloud generated for Donald Trump through supervised learning
Score Cumulative Proportion for the Democratic party.
Score Cumulative Proportion for the Democratic party.








Looking at the generalized plots and timelines and day-wise plots, We could make the following inferences.

  • Hillary Clinton tends to use actions such as praising women to win their support. She also avoids traditional methods to set herself apart from other candidates.
  • The right wing candidate Donald Trump is using disruptive methods to win support. He uses
  • Currently Hillary Clinton has majority in the Democratic party (followed by Bernie Sanders then Martin O’Malley)
  • There is no clear majority in Republicans. However, Donald Trump is currently leading among the pack
  • Hence, Hillary Clinton from Democratic party and Donald Trump from Republican party seem to be the likely winners of the pre-elections
  • When pitched against each other, Donald Trump surpasses Hillary Clinton as on 3rd December 2015.

However, this cannot be taken as the final verdict of the presidential elections since these observations are based on a month’s data while there are still 11 months to go for the final elections.

We also note some spikes in popularities of the parties. Combined with the overall graph, we notice the following:

  • Popularity of the democratic party is going down by the day
  • 5 Nov:Mark Everson withdraws his candidacy for the Republican. Thus Republicans took a popularity dip and Democratic party received a boost.
  • 14 Nov:Second Democratic debate in which Hillary Clinton emerged as the winner of the polls, resulting in a huge spike for the democratic party.
  • Paris attacks during the last week of November. The Republicans this event seriously and Donald Trump spoke to the public about destroying ISIS completely which was a turnaround with Republicans surpassing the Democratic party in terms of popularity


Network Analysis:

We used a rather non-traditional network analysis method. We contacted Victor from UCSB, who helped us understand the concept of Social Network Distance (SND). He helped us with the code as well, but we were finally unable to obtain a useful of notion in the given time frame. SND basically refers to a distance measure designed for the comparison of snapshots of a social network containing polar (competing) opinions. (source: UCSB)

We tried to increase and simplify our understanding of the candidates but sadly, there were no visible clusters among the candidates even though every candidate was from one of the two parties in reality. We then resorted to day-wise and overall generalized analysis of the candidates to form our inferences.

There were many lessons to be learned from this. Regardless of whether sampling is performed or not, it is very skewed and classification and clustering should always be looked at with the mindset that clusters may not exist due to various number of possibilities. While it is tempting to expect and force candidates into clusters when large data is available (as in our case) overfitting would always lead to inappropriate results.



We started very ambitiously with the project. The initial plans were followed by extreme experimentation with packages and different hypotheses were tested. What followed out of the commotion was the realization of various limitations in the domain of memory and algorithmic complexity.

The data was very big to begin with, and the initial processing posed many problems. R was a major roadblock in processing big files. Then came the packages – their applications were limited to the purposes they were specifically designed for – thus leaving very little control of the code implementation for large datasets.

The failure to obtain the network graph was one of the major setbacks to our initial plans, and in hindsight, one reason which could be attributed to this would be the lack of definitive and concrete knowledge regarding the notion of distances.


This study is done just for academic purposes in a limited resource and time frame. The inferences should be taken into business considerations at personal risk only. The contributors shall, however, be happy to share the details of the project on request.



Our team ardently supports the Open Source Community, and acknowledges their role in silently making this world a better place. We would like to thank the R-Community, CRAN, and StackOverflow for their immense contributions.

Dr. Sourav Sen Gupta (webpage), Indian Statistical Institute, Kolkata, took us on a roller-coaster ride across various techniques and aspects of data analysis. He has been inspirational throughout the semester and has helped us carve a knack for the field.


We would also like to acknowledge the contributions of the following:


  • Google search
  • Victor from UCSB (link above)



囗 Chandra Bhanu Jha   囗 Niten Singh   囗 Deepu Unnikrishnan  囗 Madhur Modi   


IIM Calcutta, ISI Kolkata, IIT Kharagpur