In -Hospital Intensive Care Unit Mortality Prediction Model

Introduction to the problem statement

Use of Artificial Intelligence in the hospitals

Healthcare industry by its very nature touches upon the life of each and every citizen and contributes 4% of India’s GDP (Source) as of 2013. The mammoth amount of data generated has been used for decisions and artificial intelligence has a big role to play in the day-to-day functioning of the industry.

Expenditure on healthcare in India – 50 Billion USD

Number of Doctors – 7 lakhs

Nearly 40% of the people admitted to ICU have to borrow money or sell assets (Source)

Can help save critical lives and reduce cost

The power of the data can be harnessed to save critical lives. By predicting a potential fall in the criticality of a patient, appropriate care can be provided. Also, by providing advanced assistance to only patients who are in a risky situation can save money on the resources not spent on non-risky patients.

Past efforts

In the past, many researchers have explored the use of the patient data to predict criticality. However, there will never be an end to the thrust of research due to huge potential and with large volumes of data available now, it is still an active field of exploration.

Our objective was to develop a predictive model, utilising the laboratory measurements of patient in ICU to predict a potential mortality in future hours. Naturally, the performance metrics were the accuracy of the prediction, specifically the true positives and the lead time of prediction, the more the better. (Problem statement). We will be discussing different aspects of the challenge and analysis in the blog. We expect the readers to have a basic understanding of machine learning. We had a lot to cover, but limited time and space and hence, at certain points we have compromised with the details for the sake of brevity.

Primary challenges

The primary challenge in developing any kind of data model for the human body is the physical and numerical vastness. It is difficult to prototype the exact problems as high reliability is must for such models. For the particular case of predicting mortality, there can be many reasons leading to mortality and a model can possibly cover only some of them.  

Missing data

The availability of data is subject to conduct of laboratory or vital measurements. Moreover, the measurement come at monetary and health cost. For e.g. blood test would imply collection of blood samples from the patient body. Hence, there is much smaller data available than ideal for the model and also different measurement are not concurrent in time.

Training the data

Even though mortality is an objective outcome, a worsened state which can in future result in mortality is difficult to capture. We are predicting a state for which we have no direct training data. We used a conservative approach of using the worst case scenario for the patients with positive mortality in ICU and using it for training the state indicative of future mortality.

Different features

No linear relation with the good-bad health situation. Nearly all features have an optimum value and critical zones vary from patient to patient based on their living habits. Features act in combinations and that is how even the doctors analyse the features.

Large data

We have been given details of 5990 patients in total (training and test combined), summing up to more than 6 lakh rows of data, which is in fact big data to handle with!! We had used almost 1% data for preliminary testing to speed up our verification procedure.

Can use only the patient history data

For any patient, only his/her historic data can be used for prediction.


We referred to various research papers in this domain which were published based on a similar competition held in 2012 to decide upon what strategy to follow for the problem.

Feature engineering 

Thanks to Dr. Priyanka Singh, Dr. Ram Kiran and Dr. Tejaswi to support us develop modified features from the data available. Variables which were strongly correlated and influenced together were merged. We also divided values of a lab test/vitals into ranges and assigned weight to them according to criticality.  

Missing values

More than 95% values were missing. As done by doctors, if any value was missing, we looked for it in the last 24 hours, then among ICU values and then since hospital entry otherwise normal value was imputed.

Tools Used

Pandas and numpy and sklearn and R for testing the quality of the variables

Pandas offer a great variety of tools to subset/group the dataset.

Project Stages

Project Stages

Program Structure

For train data

 For every patient in training data

{ if patient died


Extracting modified feature from non-icu data of the current patient

Extracting modified features from icu data of the current patient




Extracting modified feature from non-icu data of the current patient

Extracting modified features from icu data of the current patient



For test data

For every patient in test data

{ if patient in ICU


Creating modified feature for the current patient using his/her historical data



Note: Code snippets can be found in presentation. Presentation Link

Results obtained


The score had highest weightage for sensitivity although there was restriction of minimum 0.99 specificity and 5.5 hours of median prediction time.


SpecificityUse of specific weights parameters to adjust the specificity to the certain level

SensitivityHad to compromise on sensitivity so as to maintain minimum requirement of 0.99 specificity

Median Prediction time – Calculated only for true positives

median prediction time

For the final submission to the challenge, we combined the training and validation dataset for the purpose of training and obtained improved results.



  • Use of appropriate parameter values in the random forest to get the best results
  • The functioning of different classification methodologies (Random forest, SVM and K-NN)
  • Importance of domain knowledge in the healthcare industry

  Contact information:

If you have any questions or want to discuss any aspect of the analysis, please feel free to contact any of the co-authors (Group – 6) : 

Manaswi  Veligatla ( )

Neeti Pokharna      (

Robin Singh       ( 974 888 4997)

Saurabh Rawal      (