Effects of Lifestyle on Aging

Effects of lifestyle on Aging

Introduction:

There have been many studies focused on understanding physical and cognitive changes that define aging. These studies spread their effort in identifying genetic, physical, behavioral, and environmental factors. These are primary factors that affect the aging process and understand the interrelationship between aging and various diseases.

Objective:

Explore the data to uncover insights around impacts of lifestyle on aging. Specifically, we need to predict three types of diseases like ‘Arthritis’, ‘Angina’ and ‘Chronic Lung’ conditions. If you want to know the detailed problem statement follow the link

https://www.crowdanalytix.com/contests/effects-of-lifestyle-on-aging

Data Explanation:

In background 31 unique features like Economic condition, Personal health, Residence, Smoking/Alcohol etc were given to analyze dependent variables. Also there are a total of 269 features in the data set. Out of them 3 are dependent variables (Angina, Lung Cancer, Arthritis) & remaining 265 are explanatory variables. Moreover, the total number of observations around 13K.

Each set of unique features consist of mixed data type for e.g.  ‘Physical activity’ consists of categorical as well as continuous data. Some of the unique features set also consist of ordinal data like Self care, Memory. Initially most of raw data was also filled with NA values.

Levels of the data:
lvl

 

 

 

Work flow in steps :

  1. Consulting a domain expert (doctor) to know the dependency among 3 Response variables.
  2. Predicting NA values.
  3. Dimension reduction of category variables (Decision trees).
  4. Dimension reduction of continuous variables.
  5. Segregating Train & validate data.
  6. Implementing the prediction algorithms & testing for Accuracy of various models.

Handling Missing Data (NA) values:

The first hurdle of our data analysis was to predict and fill the missing values in our data. For this we came up with strategy of using Miss Forest package in R and predicting the missing Values. Also we have tested the prediction of NA values by actually predicting some known values & check for correctness.

About Miss Forest principle :

We have used missForest package to find the missing data values.The package missForest is used to impute missing values. It uses a random forest trained on observed values of data matrix to predict missing values. It can be used to predict both continuous and categorical variables.

In missForest, primarily the N.A. values are replaced with mean values for continuous variables and mode values for the categorical values. After each iteration the difference between the previous and the new imputed data matrix is assessed for the continuous and categorical parts. The stopping criterion is defined such that the imputation process is stopped as soon as both differences have become larger once.

lvl

where Xtrue the complete data matrix, Ximp the imputed data matrix

Code Snippet:

install.packages(“missForest”)
library(“missForest”, lib.loc=”C:/Program Files/R/R-3.2.2/library”)
test_fill<-missForest(test2,ntree=50,mtry=8,maxiter = 10)

Ntree :  No of trees to grow in each forest.

Mtry : No of variables that are randomly selected at each split.

Maxiter : maximum number of iteration to be performed given the stopping criterion is not met.

View of Training data after predicting missing Values:

lvl1

Take away from Domain expert ( A doctor) :

We first check the interdependency of response variables. Going by statistical way and domain knowledge, we consulted doctor to know about this.

Conclusion:

Lung cancer is totally independent of other two.Arthritis can be linked with Angina with probability of 0.05 that too in rare cases.As a result, we can safely assume that all response variables are independent of each other.

Dimension Reduction of categorical variables through Decision Trees:

The data was really complicated. It consists of categorical, continuous and ordinal as well. Also, the data dictionary suggests that one parameters may consist of multiple features. For e.g. Parameter -Economic conditions consist of several features from V17-V45 (28 features!!!). To reduce data from 28 to few features and decision trees prove its worth. We applied decision trees to every set of unique features for dimension reduction.

After using decision tree we were able to reduce the dimensions.

  • For Angina : 64
  • For Arthritis : 58
  • For chronic_ lung : 41

Important Categorical features extracted for each disease:

 Because of such reduction the overall efficiency increase which result in decrease of computation time and unnecessary parameters are also thrown out. Following figure shows exact selection of features for a particular disease.

lvl

The table shows that:

For the diseases Chronic Lung, Arthritis and Angina the important features are {V9},{V2,V6,V11},{V2,V3,V5,V6,V11} respectively which corresponds to  unique feature named ‘ General Health’ . This is how Decision trees helped in ‘feature Engineering’ the most important task of analytics.

Decision Tree Code Snippet:

  • chest_Arth=ctree(Angina~V258+V259+V260+V261,data=filled_data)
  • plot(filled_data,type=”simple”)
  • plot(chest_Arth,type=”simple”)

lvl

Dimension Reduction of continuous variables Using Principal component analysis (PCA):

We have about 35 attributes which are continuous and we used PCA to reduce the dimension of the data. Using the train data set, we found the principal components and their respective contributions towards variances. As first PC contributes around 99.7 % is selected.

We found loadings of that particular PC and calculated scores for train set and test set.

Scree Plot:

lvl

PCA code snippet:

  • count_data_pca = princomp(cont_data, cor=”False”)
  • summary(count_data_pca)
  • screeplot(count_data_pca, type=”lines”)
  • cont_data_pc <- cont_data_pca$loadings[,1]
  • cont_mat <- as.matrix(cont_data)
  • cont_train_scores <- cont_mat %*% cont_data_pc
  • cont_test_scores <-  as.matrix(filled_test) %*% cont_data_pc

 

Prediction Methods:

We tried to compare the prediction of diseases based on the explanatory mainly using three methods. Those are:

  1. Random forests.
  2. Maximum entropy(maxtent)
  3. Naive bayes.

Let me first describe about principle of randomForest and its accuracy.

Method 1: Random Forest

Random forests (Breiman, 2001) is a substantial modification of bagging that builds a large collection of de-correlated trees, and then averages them. Bagging or bootstrap aggregation is a technique for reducing the variance of an estimated prediction function. Bagging seems to work especially well for high-variance, low-bias procedures, such as trees.

Some of the important parameters are

  1. The size of the forest, i.e., number of trees
  2. The maximum allowed depth for each tree
  3. The amount of randomness between trees

We have divided the training data into two types:

  • 70 % of Data kept as train data to train random forests.
  • 30 % of Data is used as validation data for checking accuracy
  • For the random forest on training data, we used below parameters:

mtry = 10

ntree = 300

Code snippet :

lung_fit<-          randomForest(Chronic_Lung~V9+V11+V17+V19+V34+V36+V38+ V46+V47+V13+V14+V51+V67+V68+V69+V263+V265+V70+V102+V106+V114+V124+V256+V88+V89+V90+V92+V99+V100+V261+V258+V155+V160+V239+V242+V246+V249+V50+V115+V116, data = train_lung, ntree = 300, mtry=10)

 

lung_pred <- predict(lung_fit, train_dat_validate)

prop.table(lung_pred, train_dat_validate$lung)

After training , Using this model testing the result on validation data gave the following prediction accuracy levels.Prediction Accuracy :

Disease Prediction Accuracy
Chronic_lung 91
Arthritis 93
Angina 90

We can optimize the parameters like ntree, mtry to get the maximum accuracy levels of prediction.

Method 2: Maximum entropy (Maxent)

Features are often added during model development to target errors. Then, for any given feature weights, we want to be able to calculate: • Data conditional likelihood • Derivative of the likelihood wrt each feature weight • Uses expectations of each feature according to the model n then find the optimum feature weights .

Maxent behaviour:

max_lung<-maxent(trainlungmat,lungtrain$Chronic_Lung)

Accuracy : 89.5 %

Method 3: Naive Bayes

We have a bunch of random variables (data features) which we would like to use to predict another variable (the class)

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.

Navlung=naiveBayes(Chronic_Lung~V9+V11+V17+V19+V34+V36+V38+V46+V47+V13+V14+V51+V67+V68+V69+V263+V265+V70+V102+V106+V114+V124+V256+V88+V89+V90+V92+V99+V100+V261+V258+V155+V160+V239+V242+V246+V249+V50+V115+V116, data = train_lung)

navpredlung<-predict(navlung,train_lung)

Accuracy : 88 %

 

 

TEAM CDS GROUP 5:

Akshay Naik  akshayn2017@email.iimcal.ac.in (02)

Anirudh Kuruvada anirudhk2017@email.iimcal.ac.in (04)

Ramakrishna Ronanki ramakrishnar2017@email.iimcal.ac.in (34)

Rohit Musle rohitm2017@email.iimcal.ac.in (37)