Citi Group: Data Scientist Challenge

The Data Scientist Challenge

The Data Scientist Challenge, Powered by Citi Digital Acceleration, was launched in October 2015 with the objective to promote creatively newer approaches to mine data, discover actionable customer insights and build an impact.

More details can be found at

Credit card industry:

Credit card industry has lot of information rich data which is available for millions of customers. As per the contest website, on an average there are 120 annual transactions for an active customer. This data is very useful and can be used to get incredible insights in customer preferences and life style choices.

However the fundamental challenge with this data is its sheer size and the computing power required.

Problem statement:

The objective of this challenge is to identify the Top 10 merchants that customers have not transacted within the past 12 months and are likely to
transact with, in the next 3 months.

We need to personalize these recommendations at customer level. This can be then used to give customized offers from those merchants, to customers from whom he is likely to shop.

Data Format:

Customer #

Merchant #



# of transactions





















Number of customers: 374328

Number of transaction:63659708

Number of merchants: 9822

Number of merchant categories: 15

Data size = 690.5 mb
[1] 18532355
[1] 201309 201408

The final output must be composed of one file, namely ‘Merchant Recommendation; file of .csv format consisting of 10 merchants recommendation for each customer.

Scoring and Evaluation:

For every customer, the merchants will be scored in the order of likelihood.

If Merchant 1 has been identified correctly for a customer, 10 points are awarded to the score. Correctly identifying Merchant 2 adds 9 points, and so on.

The final score is simply the weighted sum of final score (A) and additional bonus points for code efficiency and Approach(B)
Score = Algorithm Score (A, weightage = 80%) + Code efficiency/Approach (B, weightage = 20%)

Nature of problem:

Firstly we identify this a recommendation problem. Recommendation scheme may be classified in two major types:

i. content-based filtering and

ii. Collaborative filtering.

Content-based recommendation systems analyse item descriptions to identify items that are of particular interest to the user. So Content-based recommendation systems try to recommend items similar to those a given user has liked in the past.

On the other hand collaborative filtering assumes if two customers showed similar buying pattern in the past, then they have similar buying behaviour and recommend products/merchants of one to another.

Based on the above argument, we model our recommendation problem as Collaborative filtering where we would like to recommend merchants, who have transacted with similar customers in the given period of transaction.

Now collaborative filtering may be of several types:

  1. Memory-based collaborative filtering: For memory based collaborative filtering, neighbours of a customer is found out from the entire database and from this neighbourhood recommendation is made. To form the neighbourhood of the customer the idea of similarity/distance is required and there are several choice of distance.

  2. Model-based collaborative filtering: This approach allows the system to learn to recognize complex patterns based on the training data, and then make intelligent predictions for the collaborative filtering tasks for test data or real-world data, based on the learned models.The modelling can be bayesian-model,regression based model , clustering model or dependency networks.


  • In the dataset provided we observed that each merchant belongs to one category.

  • No missing data has been observed.

  • No effect of seasonal trend has been observed.

  • High skewness in the market-share of the merchants has been observed. In fact it has been found out that more than 85% of the market-share is dominated by 500 customers out of 9998 customers.

Approaches used:

  1. Similarity-based approach:
    In this approach we first find-out the distance between merchants.For distance metric cosine-similarity distance is chosen.The advantage of cosine similarity over jaccard’s similarity is that the multiple occurrence of transaction would not be possible to capture in jaccad’s similarity.The distance between merchants is stored on a NxN distance matrix where N is the number of merchants. The cosine-similarity between two merchant vectors(say m1,m2) is calculated by the following formula:


Then, we compute score-matrix for each customer-merchant pair by the following algorithm:


Now based on this score vector we select the top 10 merchants for each customer

  1. Low-rank approximation collaborative-filtering:
    At first, we build a transaction matrix where the rows represent merchants and the columns represent customers.The matrix provides the details of how each customer transacted with each merchants.Then we compute k-rank approximation of this transaction matrix.The low-rank approximation ,we expect to capture the similarity pattern of the customer’s buying behaviour.In essence, if the like-minded customer of customer-i made big amount of transaction with merchant j then the the (i,j)th entry of the matrix is expected to be large and small otherwise. For one customer vector we look for merchants whose corresponding entry in the original matrix are zero; but have large value in the low-rank approximation matrix.
    Furthermore, we performed the low-rank approximation separately for each category so that the specific trend of the category is reflected on the computation.
    We start from k=5 then we increase the value of k until the result more or less stabilizes.
    NOTE: The original matrix would be a sparse matrix, but reconstructed one would not be a sparse one.Also.performing normal SVD on this matrix would be difficult.So we take refuge in inbuilt R package named IRLBA package.This package utilizes the Implicitly Restarted Lanczos Bidiagonalization Algorithm proposed by Jim Baglama and Lothar Reichel. The algorithm starts with one vector of unit length as v1 of SVD and then compute u1 and then it keeps appending more vectors of unit length vj,but instead of Gram-Schmidt orthogonalization,where it is subtracted form projection on all of the previous chosen vectors,here it subtraction is done only from the projection previous vector.

  2. K-means clustering:
    In this approach we classify the customers in different clusters and recommendation is made based on the most successful merchants of the cluster. We first order the merchants based on transaction for one cluster.Then for each customer we find out merchants with whom no transaction has been made, but are ranked higher in the cluster.
    We’ve keep the number of clusters as the tuning parameter and settle for the one for each no further change can be observed upon increasing the number of clusters.

R packages used.

We have primarily used irlba

What is IRLBA?
• The Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) of Jim Baglama and Lothar Reichel is a state of the art method for computing a few singular vectors and corresponding singular values of huge matrices.
• With it, computation of partial SVDs and principal component analyses of very large scale data is very time and space -efficient. The package works well with sparse matrices.



Some code snippets

Code snippet for svd using irlba and then reconstruction of matrix by approach suggested by Sourav sir.

for(i in 1:no_of_category){

for(j in 1:100){
for(i in 1:no_of_category)
  difference_vector=difference_vector[-index]#remove that index from that for next biggest index
  #this part is to find out the merchant_no of the column for which maximum deviation is observed between 
  #the k-rank and original
  merchant_no=merchant_by_category[[fi]][find_i]#get the merchant_no


Code snippet for computing the cosine similarity

getCosine <- function(x,y) 
 rslt <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
similarity_mat=matrix( nrow=ncol(test_data_matrix),ncol=ncol(test_data_matrix),dimnames=list(colnames(test_data_matrix),colnames(test_data_matrix)))
for(i in 1:ncol(test_data_matrix)) {
 # Loop through the columns for each column
 for(j in i:ncol(test_data_matrix)) {
 # Fill in placeholder with cosine similarities
 similarity_mat[i,j] <- getCosine(test_data_matrix[,i],test_data_matrix[,j])
 for(j in 1:i-1)

Procedure for verifying algorithm:

We build our test data-set from the original data set by separating out the nine-month transaction data from the original dataset. Then we applied our model in this dataset in order to predict for the next three-months data.Then we check for our accuracy from the original data. Random samples of 10000 customers were aggregated and we accumulated the result of samples.


Application of Dimensionality Reduction in Recommender System - A Case StudyBadrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl Department of Computer Science and Engineering, University of Minnesota
Topics in Data Analytics: Collaborative Filtering, Finding Similar Items in Very High Dimension, Sentiment Analysis”
 Debapriyo Majumdar, Indian Statistical Institute Kolkata

Slides:- CDS- Citi group challenge

Team Memebers

Gautam Kumar (
Jaideep Karkhanis (
Jayanta Mandi (
Sajal Jain (