**INTRODUCTION**

In an environment where the only constant is innovation, the high-tech and telecommunications sectors are grappling with rapid changes in customer behavior and the competitive landscape. As the industry shifts its focus from products to solutions and emphasizes mobility and cloud-based delivery models, high-tech and telecommunications companies are rethinking all aspects of their go-to-market strategy:

Ø How do we compete in an environment where customers are demanding outcome-focused, managed services delivered on a subscription basis?

Ø How can we develop deep customer insights when often we don’t have a direct relationship with the end user?

Ø What do we do with all the data generated and how do we leverage it to make smart business decisions?

Ø How can we anticipate or, better yet, drive changes in the market that will give us a competitive advantage?

It is necessary to understand the customer behavior and patterns to make better business decisions in this competitive world.

**PROBLEM STATEMENT**

The challenge is to model customers of an telecom company and predict the propensity of them buying add-on. The telecom sectors are trying to leverage data that is generated and in turn are rethinking different business strategies in order to get maximum ROI. In today’s world there are many factors that are influencing the customers to buy a product or service and hence it is essential to reach out to the masses in an efficient manner.

Our objective is to make a predictive model to predict as to whether the customer will opt for the add-on or not by utilizing the data at hand. The performance of the model can be checked by the level of accuracy of the prediction. We will be discussing about the various methods and techniques that was required to build the model and so prerequisite of random forest and few packages is needed for understanding.

**PROBLEM SOLVING APPROACH**

In this, we create a holistic approach to solve the problem by following various steps given below

**PRELIMINARY ANALYSIS**

This stage focuses on understanding the data by doing some preliminary statistics. Our data has 30000 observations with 190 explanatory variables. Therefore, we split the data into Training, Testing and Validation.

The uniqueness of the data is that there were *no labels for the explanatory variables**,* which made it difficult to correlate the explanatory variables along with the response variables. Another important feature of the data is the *missingness*. There were only 38 variables having the actual data and the rest had almost 90 percent missing values.

**DATA CLEANING**

In this step, we remove the features (explanatory variables) in which more than 90% missing values and missingness is completely at random (MCAR). We split the data into two response groups of zero’s and one’s and see if their spread was different or same for an explanatory variable. This tells us about the influence of the concerned explanatory variables with respect to the response variable.

For the remaining 38 variable we performed multiple imputation using MICE package. The imputation method used by us is *Predictive Mean Matching*.The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data point.

**DATA MODELLING**

In order to generate insights from the data, we need to build a prediction model based on our our data. As we have limited resources with medium computing power, we decided to select just the important features to save time and memory. So feature selection is an important criteria for the modelling.

We first compute the correlation matrix and draw heat map for the 38 features, to select the features independent of each other. Furthermore, we perform PCA (Principal Component Analysis) on these variables to select the explanatory variables which can explain the variance of whole data set.

From the above operations, we get 10 features for building our prediction model using Random Forest .

**TESTING :-**

We had already split our training data set into two parts randomly. Testing is an iterative process where we try to tune the different parameters of our prediction model. In random forest we have parameters such as:-

- mtry :- No. of variables to be used for each decision tree
- maxnodes :- Maximum no. of terminal nodes for classification
- minnodes :- Minimum no. of terminal nodes
- ntree :- no. of decision tree we want

These parameters help us in controlling the underfit and overfit of data set.

In the data set we had only 7 percent of response variables are 1’s and rest are 0’s. Moreover, our objective was to predict the number of 1’s rather than 0’s.Therefore we increase the probability of response lying in the vicinity of 1 to 0.85.

Mean F1 score is used to evaluate the submissions. It considers both precision and recall.

Contingency Table :-

Precision P is the ratio of true positives (TP) to all predicted positives (TP + FP)

R is Recall or Sensitivity which is equal to the ratio of true positives (TP) to all actual positives (TP + FN)

As we can see what matters most is true positive false negative and false positive ,so we need to tilt the line of partition towards the 1 rather than zero. For this we need to understand what is the output of random forest ? It gives the probability of response lying in the vicinity of one or zero. In order to get more true positive we provided more weightage to tuples with response variables as one.

**EVALUATION :-**

In the ideatory ZS Customer Modeling Challenge, we finally acquired 27th position with the F1 score of 0.026 in the leaderboard with the highest being 0.268. Later on, we further improved our model and got a F1 score of 0.23.

**LEARNING :-**

As it was our first hand’s on experience to solve an analytic problem.We learned the various aspects of a Big data problem.What are the various steps for solving a Big data problem? What are the common challenges faced in solving any analytic problem? Furthermore, we explored many R packages which could be also useful in these data set such as CARET for data modelling ,Miss forest for data imputation and Random Forest for Classification and regression based on a forest of trees using random inputs .

**REFERENCES :- **

- https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
- https://cran.r-project.org/web/packages/mice/mice.pdf
- https://cran.r-project.org/web/packages/missForest/missForest.pdf
- http://datascienceplus.com/imputing-missing-data-with-r-mice-package/
- https://www.ideatory.co/challenges/zs-customer-modeling-challenge-2015/
- ZS Customer Modelling Presentation