Rossmann operates a chain of drug stores across 7 European countries. Rossmann decided to apply a single solution across all stores in Germany which capture their each store characteristics, sales pattern and able to forecast sales of each store of 6 weeks in advance.
Impact on Company:
Reliable sales forecasts enable store managers to create effective staff schedules and stay focused on their customers and teams.
Can increase productivity and customer satisfaction by providing better services.
Individual Store Managers use their own forecasting scheme for their own stores. Variances are very high among the stores forecast scheme used by individual store managers.
Also it takes a lot of time from store managers schedule which they can utilize in other productive task and these individual models for each stores cannot be applicable for for all stores.
Except competition distance and time duration of promotion , all other explanatory variables are categorical variables.
Also there is no common trend across all the stores. It is very difficult to apply a single time series model to all stores.
But even a separate time series model for prediction of sales for each stores doesn’t gives any satisfactory results(RMPSE of .28).
There are 180 stores that donot have the 6 month sales data and donot have any information how to fill that gap. Also donot find substancial information on kaggle competition forum for filling those missing values.
There are some missing values in competition distance and time, but they are very small in number. So we replace the null values in those categorical variables with median of available data.
The data contains the daily sales of each store from 1st Jan-2013 to July-2015 which is approximately 10 lakhs observation with features like
timestamp(week, date), opening status , promotion status on that day and state/school holiday status. This is infact large data to handle.
In addition to sales data , there are store information(9 features) variables for each of the 1115 stores.
Using different online sources like similar kaggle competition for weekly sales prediction for Walmart Stores, kaggle forum for this competition and get some idea on how to approach for solution.
First we apply time series model on each store. But it gives RMPSE of .28 which implies that time series alone donot captures the variability in the sales data.
So require models which capture the effect of time on sales as well as the explanatory variables.
Using exploratory analysis of data-sets , we find that there are comparatively less no of store type ‘b’ but each store contribute more to sales than rest. Also assortment level ‘b’ is only available at those stores only. For the rest type of store there is no significant difference in the sales data.
But it is difficult to judge most impacting variable of each store based on the visualization.
Using the random forest, we are able to capture both the effect of features on sales as well as sales variability with time.
Importance of different categorical variables(explanatory) using random forest are given below:
(Most Imprtant)Store > Promo(Long Term Promotion) > CompetitionOpenSinceYear > CompetitionOpenSinceMonth > DayOfWeek > StoreType > day > month > Promo2SinceWeek > Assortment > PromoInterval > year > Promo2(Promotion offered daily/Short term promotion)(Least imprtant).
Further exploring different algorithms like XGBoost and taking the weighted average of the results of Random Forest as well as XGBoost gives better
results than applying only individual algorithms.
R and libraries like randomforest, xgboost , ts(for time series) and visualization libraries like ggplot2.
Using the tuned weight parameters for taking the weighted average of results obtained from random forest and XGBoost, the Root Mean Square
Percentage Error of combination of Random Forest and XGBoost is 0.11154. Our team rank is 1213 out of total 3242 teams.