The Data Scientist Challenge
The Data Scientist Challenge, Powered by Citi Digital Acceleration, was launched in October 2015 with the objective to promote creatively newer approaches to mine data, discover actionable customer insights and build an impact.
More details can be found at https://www.datascientistchallenge.com/
Credit card industry:
Credit card industry has lot of information rich data which is available for millions of customers. As per the contest website, on an average there are 120 annual transactions for an active customer. This data is very useful and can be used to get incredible insights in customer preferences and life style choices.
However the fundamental challenge with this data is its sheer size and the computing power required.
Problem statement:
The objective of this challenge is to identify the Top 10 merchants that customers have not transacted within the past 12 months and are likely to
transact with, in the next 3 months.
We need to personalize these recommendations at customer level. This can be then used to give customized offers from those merchants, to customers from whom he is likely to shop.
Data Format:
Customer # 
Merchant # 
Month 
Category 
# of transactions 
1 
1 
1 
Grocery 
3 
2 
2 
1 
Dining 
2 
3 
3 
1 
Grocery 
5 
4 
4 
2 
Apparel 
1 
Number of customers: 374328
Number of transaction:63659708
Number of merchants: 9822
Number of merchant categories: 15
Data size = 690.5 mb dim(citi_data) [1] 18532355 5 range(citi_data$TXN_MTH) [1] 201309 201408
The final output must be composed of one file, namely ‘Merchant Recommendation; file of .csv format consisting of 10 merchants recommendation for each customer.
Scoring and Evaluation:
For every customer, the merchants will be scored in the order of likelihood.
If Merchant 1 has been identified correctly for a customer, 10 points are awarded to the score. Correctly identifying Merchant 2 adds 9 points, and so on.
The final score is simply the weighted sum of final score (A) and additional bonus points for code efficiency and Approach(B)
Score = Algorithm Score (A, weightage = 80%) + Code efficiency/Approach (B, weightage = 20%)
Nature of problem:
Firstly we identify this a recommendation problem. Recommendation scheme may be classified in two major types:
i. contentbased filtering and
ii. Collaborative filtering.
Contentbased recommendation systems analyse item descriptions to identify items that are of particular interest to the user. So Contentbased recommendation systems try to recommend items similar to those a given user has liked in the past.
On the other hand collaborative filtering assumes if two customers showed similar buying pattern in the past, then they have similar buying behaviour and recommend products/merchants of one to another.
Based on the above argument, we model our recommendation problem as Collaborative filtering where we would like to recommend merchants, who have transacted with similar customers in the given period of transaction.
Now collaborative filtering may be of several types:

Memorybased collaborative filtering: For memory based collaborative filtering, neighbours of a customer is found out from the entire database and from this neighbourhood recommendation is made. To form the neighbourhood of the customer the idea of similarity/distance is required and there are several choice of distance.

Modelbased collaborative filtering: This approach allows the system to learn to recognize complex patterns based on the training data, and then make intelligent predictions for the collaborative ﬁltering tasks for test data or realworld data, based on the learned models.The modelling can be bayesianmodel,regression based model , clustering model or dependency networks.
Observations:

In the dataset provided we observed that each merchant belongs to one category.

No missing data has been observed.

No effect of seasonal trend has been observed.

High skewness in the marketshare of the merchants has been observed. In fact it has been found out that more than 85% of the marketshare is dominated by 500 customers out of 9998 customers.
Approaches used:

Similaritybased approach:
In this approach we first findout the distance between merchants.For distance metric cosinesimilarity distance is chosen.The advantage of cosine similarity over jaccard’s similarity is that the multiple occurrence of transaction would not be possible to capture in jaccad’s similarity.The distance between merchants is stored on a NxN distance matrix where N is the number of merchants. The cosinesimilarity between two merchant vectors(say m_{1},m_{2}) is calculated by the following formula:
Then, we compute scorematrix for each customermerchant pair by the following algorithm:
Now based on this score vector we select the top 10 merchants for each customer

Lowrank approximation collaborativefiltering:
At first, we build a transaction matrix where the rows represent merchants and the columns represent customers.The matrix provides the details of how each customer transacted with each merchants.Then we compute krank approximation of this transaction matrix.The lowrank approximation ,we expect to capture the similarity pattern of the customer’s buying behaviour.In essence, if the likeminded customer of customeri made big amount of transaction with merchant j then the the (i,j)^{th} entry of the matrix is expected to be large and small otherwise. For one customer vector we look for merchants whose corresponding entry in the original matrix are zero; but have large value in the lowrank approximation matrix.
Furthermore, we performed the lowrank approximation separately for each category so that the specific trend of the category is reflected on the computation.
We start from k=5 then we increase the value of k until the result more or less stabilizes.
NOTE: The original matrix would be a sparse matrix, but reconstructed one would not be a sparse one.Also.performing normal SVD on this matrix would be difficult.So we take refuge in inbuilt R package named IRLBA package.This package utilizes the Implicitly Restarted Lanczos Bidiagonalization Algorithm proposed by Jim Baglama and Lothar Reichel. The algorithm starts with one vector of unit length as v_{1} of SVD and then compute u_{1} and then it keeps appending more vectors of unit length v_{j},but instead of GramSchmidt orthogonalization,where it is subtracted form projection on all of the previous chosen vectors,here it subtraction is done only from the projection previous vector. 
Kmeans clustering:
In this approach we classify the customers in different clusters and recommendation is made based on the most successful merchants of the cluster. We first order the merchants based on transaction for one cluster.Then for each customer we find out merchants with whom no transaction has been made, but are ranked higher in the cluster.
We’ve keep the number of clusters as the tuning parameter and settle for the one for each no further change can be observed upon increasing the number of clusters.
R packages used.
We have primarily used irlba
What is IRLBA? • The Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) of Jim Baglama and Lothar Reichel is a state of the art method for computing a few singular vectors and corresponding singular values of huge matrices. • With it, computation of partial SVDs and principal component analyses of very large scale data is very time and space efficient. The package works well with sparse matrices.
Some code snippets
Code snippet for svd using irlba and then reconstruction of matrix by approach suggested by Sourav sir.
svd_vector=vector(mode="list",length=no_of_category) reloaded_index=c() for(i in 1:no_of_category){ index=merchant_by_category[[i]] reloaded_index=append(reloaded_index,index) a=transaction_matrix[customer_absent,index] svd_vector[[i]]=irlba(a,nv=1)#k=10 print(i) } transaction_matrix=transaction_matrix[customer_absent,reloaded_index] submission=c() for(j in 1:100){ print(j) cust_vector=c() for(i in 1:no_of_category) { vec=svd_vector[[i]]$u[j,]%*%diag(svd_vector[[i]]$d)%*%t(svd_vector[[i]]$v) cust_vector=cbind(cust_vector,vec) } original_vector=transaction_matrix[j,] difference_vector=cust_vectororiginal_vector merchants_picked=0 index_collection=c() suggested=c() while(merchants_picked!=10){ index=match(max(difference_vector),difference_vector) difference_vector=difference_vector[index]#remove that index from that row..search for next biggest index if(original_vector[index]==0){ index=index_return(index_collection,index) #this part is to find out the merchant_no of the column for which maximum deviation is observed between #the krank and original fi=find_index(category_length,index) if(fi>1){find_i=indexcategory_length[fi1] }else { find_i=index } merchant_no=merchant_by_category[[fi]][find_i]#get the merchant_no suggested=append(suggested,merchant_no) index_collection=append(index,index_collection) merchants_picked=merchants_picked+1 } } output=cbind(rep(customer_list[j],times=10),c(1:10),suggested) submission=rbind(submission,output) }
Code snippet for computing the cosine similarity
getCosine < function(x,y) { rslt < sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y))) return(rslt) } test_data_matrix=sparseMatrix(Data_9_months$Cust_map,Data_9_months$Merch_Map_final,x=Data_9_months$NumTrans,dimnames=list(c(1:max(test_customer)),c((1:max(test_merchant_list))))) test_data_matrix=test_data_matrix[test_cust_absent,test_absent_merchant] test_data_matrix=test_data_matrix[1:10000,] similarity_mat=matrix( nrow=ncol(test_data_matrix),ncol=ncol(test_data_matrix),dimnames=list(colnames(test_data_matrix),colnames(test_data_matrix))) for(i in 1:ncol(test_data_matrix)) { print(i) # Loop through the columns for each column for(j in i:ncol(test_data_matrix)) { # Fill in placeholder with cosine similarities similarity_mat[i,j] < getCosine(test_data_matrix[,i],test_data_matrix[,j]) } for(j in 1:i1) { similarity_mat[i,j]=similarity_mat[j,i] } }
Procedure for verifying algorithm:
We build our test dataset from the original data set by separating out the ninemonth transaction data from the original dataset. Then we applied our model in this dataset in order to predict for the next threemonths data.Then we check for our accuracy from the original data. Random samples of 10000 customers were aggregated and we accumulated the result of samples.
References
“Application of Dimensionality Reduction in Recommender System  A Case Study” Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering, University of Minnesota
“Topics in Data Analytics: Collaborative Filtering, Finding Similar Items in Very High Dimension, Sentiment Analysis” Debapriyo Majumdar, Indian Statistical Institute Kolkata
http://www.math.kent.edu/~reichel/publications/blklbd.pdf
http://qubitulm.com/wpcontent/uploads/2012/04/Lanczos_Algebra.pdf
Slides: CDS Citi group challenge
Team Memebers
Gautam Kumar (gautamk2017@email.iimcal.ac.in)
Jaideep Karkhanis (jaideepk2017@email.iimcal.ac.in)
Jayanta Mandi (jayantam2017@email.iimcal.ac.in)
Sajal Jain (sajalj2017@email.iimcal.ac.in)