Walmart Triptype Classification


Walmart Headquarter, Bentonville, United States

Will: Hey, David! Good Morning, How was your weekend?

David: Weekend was good boss! How was yours?

Will: Oh! It was horrible. You know what, we had an emergency meeting yesterday. All the guys are worried. Results of this quarter are out and it is not at all looking good. Top management is now asking tough questions. We have hell lot of work to do, so pull up your socks David, sleepless nights are awaiting.

David: That’s bad news boss! But what went wrong? Our estimates had shown considerable profit this quarter. The footfall has not decreased as per our record. There is nothing wrong with economy too. This seems quite puzzling to me.

Will: It seems to me that we are not able to understand our customers well. Customers are spending too much time looking out for product they want and that’s probably affecting the transaction per visit.

David: Oh! I see. What should we do then boss?

Will: We need to understand our customer: what’s their primary purpose of visit to the store. We need to segment our shelves accordingly, resulting in less confusion for buyers. This would reduce the distance they have to travel to get the product and definitely help to cut down on chaos.

David: Hmm. I wonder how come our guys didn’t see this problem coming.

Will: You need to dig deeper into the data, David. Call our overseas analytics team, give them the data we are sitting on for years and hope to get rescued.

David: Okay Boss! I am on it.


1 Month later

 Dear David,

We hope you’re doing well. We have gone through the data you sent us across and our team has worked on it. We request you to allow us to come over to your headquarters and brief you about the findings.

Yours sincerely,



Dear Group-12,

We’d be happy to have you here. We have scheduled your presentation this weekend. All the top management people will be there along with Will. Best of luck to you guys!

Yours faithfully,



Walmart Headquarters, Debriefing session:


  • Treatment of data: missing values and outliers
  • Exploratory analysis of data: feature importance, correlation between features.
  • Feature engineering a)Department Description b) Finelinenumber c) Combination of both.
  • Application of supervised learning algorithm
  • a) XGBoost b) Random Forest c) Gradient Boosting Machine
  • Obtaining triptype classification for the test data.

Treatment of data:

  1. Missing values:
    • Out of a total of 647054 rows, 4129 rows have NULL values (less than 1%)
    • Assuming data is missing at random, ignore rows with NULL values.
  2. ‘Weekday’ field converted to binary (whether the day of visit is a weekend or not)
    • If the day is Friday, Saturday or Sunday – it is considered as weekend (i.e. value of the field is 1). Else it is weekday.
    • Negative values in ‘Scancount’ Indicates a return of the item.
    • Return of an item does not affect buying pattern
    • ‘Scancount’ is made 0 for negative values.
  3. Reshaping Data to obtain Item purchased from particular Department per visit as a feature observation where features are described by Department description.


# #removing all NA's value<-na.omit( length(which($FinelineNumber) == TRUE, arr.ind=TRUE)) length(which($DepartmentDescription=="NULL", arr.ind=TRUE))

#Making weekday as binary variable<-transform(, Weekday<-factor(Weekday)) levels($Weekday)<-c(1,0,1,1,0,0,0) train.walmart_df<[,c(1,2,3,5,6)]<-transform(, Weekday<-factor(Weekday)) levels($Weekday)<-c(1,0,1,1,0,0,0) test.walmart_df<[,c(1,2,4,5)]

#Reshaping the data
train.walmart_df<-dcast(train.walmart_df,dcast_formula, value.var="ScanCount")
 test.walmart_df<-dcast(test.walmart_df,dcast_test, value.var="ScanCount")]

Feature Correlation Graph:

Feature correlation

  • Compute correlation matrix of reshaped training data
  • Compute adjacency matrix from the correlation graphs as follows:-
  • If absolute value of correlation is less than a threshold (0.05 in this case), assume there is no correlation between the purchase of items and value in adjacency matrix is 0 i.e. there is no edge between these 2 products in the correlation graph.
  • Otherwise value in adjacency matrix is 1 i.e. there is an edge between the products in the graph.
  • All diagonal elements in the adjacency matrix are made 0 to avoid self loops.
  • The relationships between different departments are pretty consistent with intuition(e.g. customers who  Men’s wear are likely to buy socks and Undergarments & Shaving-kit).


train$TripType <- NULL
 train$VisitNumber <- NULL
 train$Weekday <- NULL
#corrplot: the library to compute correlation matrix.
#compute the correlation matrix
 corMat <- cor(train)
 adjMat <- corMat
#Construct adjacency matrix
 for (i in 1 : nrow(corMat)){
 for (j in 1: ncol(corMat)){
 if (abs(corMat[i,j]) < 0.05 || i == j ){
 adjMat[i,j] <- 0
 adjMat[i,j] <- 1


Feature Importance graph:



nameFirstCol <- names(train)[1]
 y <- train[, nameFirstCol]
train$TripType <- NULL
 train$VisitNumber <- NULL
trainMatrix <- as.matrix(sapply(train, as.numeric))
numberOfClasses <- max(y) + 1
 param <- list("objective" = "multi:softprob",
 "eval_metric" = "mlogloss",
 "num_class" = numberOfClasses)
nround <- 20
bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround)
# Get the feature real names
 names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
 importance_matrix <- xgb.importance(names, model = bst)
# Feature importance graph

Classification Algorithm used:


  • XGBoost is short for extreme Gradient Boosting. It is an open-sourced tool – Computation in C++, R interface provided
  • A variant of the gradient boosting machine – Tree based model
  • The winning model for several Kaggle competitions.


numberOfClasses <- max(triptype)+1
param <- list("objective" = "multi:softprob","eval_metric" = "mlogloss","num_class" = numberOfClasses)
cv.nround <- 200
 cv.nfold <- 10 =, data = train.walmart_mat, label = triptype, nfold = cv.nfold, nrounds = cv.nround)
nround <- which($test.mlogloss.mean==min($test.mlogloss.mean))
 #train the model
 nround <-114 #this number is the number of trees when test mlogloss is minimum during cross-validation
 bst<-xgboost(data = train.walmart_mat, label = triptype, param=param, nrounds = nround)
#predict the model
ypred <- predict(bst, test.walmart_mat)

Cross-validation and model building:

  • Once the data has been reshaped into the required format, we can choose cross validation to find to choose the parameters.
  • numberOfClasses: is equal to 38, since there are 38 classes in total
  • param: parameters of the model with “objective” indicating the task, “eval_metric” indicating the error measurement of the model
  • cv.nround: number of the trees to build. This is the parameter we want to tune
  • cv.nfold: how many parts you want to divide the train data into for the cross-validation
  • run the cross-validation

Performance evaluation Metric:

  •  Logloss function
  •  N is  the number of visit in the test set.
  •  M  is the number of trip types.
  •  is 1 if observation ‘i’ belongs to class ‘j’ and 0 otherwise.
  •  is the predicted probability.

Bagging : Random Forest

  • Ensemble of decision trees.
  • Unlike single decision trees, Random Forests use averaging to find a natural balance between the two extremes.
  • Random forest uses bootstrapping and averaging.
  • Out of bag error estimate by using department description as features is 44.5%
  • This implies department Description alone is not a good classifier.
fc <- trainControl(method = "repeatedCV",
 number = 2,
 repeats = 1,
 tGrid <- expand.grid(mtry = RF_MTRY)
 model <- train(x = train, y = target, method = "rf", trControl = fc, tuneGrid = tGrid, metric = "Accuracy", ntree = RF_TREES)
 #Predict second training set, and test set using the randomForest
 train2Preds <- predict(model, train2, type="prob")
 testPreds <- predict(model, test, type="prob")

Boosting: Gradient Boosting Machine

  • Fit complex models by iteratively fitting sub-models (decision tree) to residuals.
  • Gradient boosting uses a “pseudo gradient”
  • Pseudo-gradient used is the derivative of a general loss function L().
  • In this case: logloss-function.
  • It shows the deviation of predicted probability of class from original training example.
  • A sub-learner is picked as close as possible to the pseudo gradient and added to model.


Minimizing the logloss function

Challenges and Bottlenecks:

  • Memory issues: With limited RAM, hand  ling big numeric matrix was not feasible.
  • dcast() function is not useful in reshaping features ~5K
  • Different number features in test data and train data when features are made using   FineLinenumber and departmentDescription.
  • Department description is not enough for classification.
  • No improvement even after trying different classification algorithms



Thank you!

*All the characters and events depicted in this post is entirely fictitious.