## “Picture yourself strolling through your local, open-air market… What do you see? What do you smell? What will you make for dinner tonight?”

What’s cooking – The International platter

The motivation for the Kaggle Competition named ** What’s Cooking** was to

*predict the category of a dish’s cuisine given a list of its ingredients. Some of our strongest geographic and cultural associations are tied to a region’s local foods hence a list of ingredients would be a clue to which cuisine that dish belongs. Kaggle arranged the dataset for 20 international cuisines from*

**Yummly**.

**Yummly**is a free smartphone app and website that provides recipe recommendations personalized to the individual’s tastes, semantic recipe search, a digital recipe box, shopping list and one-hour grocery delivery. The whole train dataset contains

**dish-ids along with their ingredients and cuisine label. The test data contains**

*39774***dish-ids along with their ingredients. Our task is to**

*9944**. A snippet of train data would be:*

**predict cuisine for a particular dish from test data given its ingredient set**{ "id": 38714, "cuisine": "indian", "ingredients": [ "ice cubes", "vanilla extract", "honey", "ground cardamom", "ground cinnamon", "1% low-fat milk", "nonfat yogurt", "mango" ] }

Data Preparation

The project starts with data preparation stage:

We have 20 different cuisine in train and test data namely, Greek, Southern_US, Filipino, Indian, Jamaican, Spanish, Italian, Mexican, Chinese, British, Thai, Vietnamese, Cajun_creole, Brazilian, French, Japanese. Irish, Korean, Moroccan and Russian. The presence of Italian cuisine is maximum in train data.

To start the data preparation stage we have tried several methodologies to clean the data and make it in shape to analyze.

**Tokenization:**We have tokenized all the ingredients as a single word to emphasis the pattern present in data. For example: Red Chili Pepper would appear after tokenization as Red_Chili_Pepper. This would minimize misclassification error from taking Red, Chili and Pepper as three separate tokens. The python code for tokenization is attached.**here****Case modification:**We deliberately changed all text data in lower case to avoid inconsistency.**Sparse term modification:**After creating document term matrix we removed terms those occurred for less thantimes. This was a*3*choice which we achieved plugging different numbers. This method is useful when tokenization*heuristic*been used.*hasn’t***Porter’s stemming algorithm:**This is a process for removing the commonerand*morphological*endings from words in English. Example: tomatoes, tomato – will be considered same.*inflexional*

Next stage would be **feature engineering**: As the number of ingredients for dishes varies from ** 1 **(minimum) to

**(maximum) in train data, we decided to take**

*135***as a feature. Our assumption of this being an important feature can be verified from importance plot.**

*number of ingredients per dish*Feature Engineering

After data preparation and feature engineering we tried to model a classifier based train data. The first part of it would be creating a Document Term Matrix (DTM) for the ingredient sets. ‘tm’ package in R creates a DTM from the corpus of ingredients. The method sets ‘** 1**’ when that particular ingredient appears for a dish and ‘

**’ otherwise, for all the ingredients present in corpus. Sparse terms from DTM have been removed where frequency of those terms for all dishes are**

*0***. As mentioned, number of ingredients has been added as an extra feature in DTM.**

*less than 3*Modelling

Based on the nature of classification we have decided to use five models to classify and compare the results produced by each of the classifiers: 1. **Decision Tree (CART)**, 2. **Random Fores**t, 3. **Naive Bayes Classifier**, 4. **Modification of Naive Bayes** and 5. **Extreme Gradient Boosting – XGBoost**.

**Naive Bayes Classifier:**

While applying the Naive Bayes classifier, we have assumed here that the occurrence of ingredients is ** not correlated**. This means that the probability of occurring of ingredient is independent of other ingredient present in the dish. The second assumption here is probability of occurring of a dish in a cuisine is product of the probabilities of all the ingredients in a dish, i.e.

**.**

*dishes are independent*A simple Bayesian classifier would be

where * x *is the test ingredient list and

*the set of cuisines with cardinality*

`C`

`K. `

The dish is classified in cuisine which gives the maximum probability.Since the knowledge about the cuisine label for a particular dish is known from the training data we sorted the data according to given cuisine and took row sum in each particular category of cuisine. Finally we converted convert this matrix to probability matrix we need to make the matrix **column stochastic**.

A sample column stochastic matrix would look like:

**Note**: Though the unconditional probability of each cuisine can be calculated from data, i.e. the sample proportion of each cuisine, but we need to be careful since that may not reflect the.*population proportion*

For complete procedure and derivation, please follow this * project article*.

- Prediction code for Naive Bayes classifier:

for( i in 1:nrow(c_train_ingredientsDTM)) { ing_cons<-train_raw$ingredients[[i]] ing_considered<- setdiff(ing_cons,setdiff(ing_cons,names(as.data.frame(Bayes_Prob_Matrix)) )) Bayes_Predict_Matrix<- as.matrix(Bayes_Prob_Matrix[,ing_considered]) cuisine_probability <-prod(Bayes_Predict_Matrix[1,]) for( j in 1:5) { cuisine_probability[j]<-prod(Bayes_Predict_Matrix[j, ])*Prob_cuisine[j] } t[i]<-which.max(cuisine_probability) }

The complete R code for classic Naive Bayes classifier is * here*.

**Modifications of Naïve Bayes:**

When we are classifying through Naïve Bayes probability matrix, if in training data probability of an ingredient appearing in the cuisine is zero then the whole probability will be zero regardless of other probabilities. Solution can be taking the **Geometric Mean** of the entries in a row which are non-zero. In a situation, if there are only few non-zero entries in a row which will make geometric mean higher than above and in these situations, we can get wrong classifications. To avoid previous situation, we maintained a threshold which is defined as the total number of non-zero ingredients should be greater than particular value and then we have calculated the Geometric Mean (with some minimum number of nonzero terms).

for( j in 1:20) { non_zero_num = Bayes_Predict_Matrix[j,]!=0 cuisine_probability[j]<- prod(Bayes_Predict_Matrix[j,non_zero_num])^(length(Bayes_Predict_Matrix[j,])/length(non_zero_num[non_zero_num==TRUE]))*Prob_cuisine[j] }

The complete R code for modified Naive Bayes classifier is * here*.

Results for Naive Bayes where different form of objective function has been used for experiment:

It can be seen from above that the modifications of taking threshold of minimum number of ingredients present in dish to be classified in given cuisine before we take the geometric mean, actually improves the accuracy. The geometric mean and its modifications are less accurate than probability multiplication. * The modification of uniform probability gives the best accuracy*.

**CART and Random Forest:**

Since all most all the features are categorical, CART model and Random Forest should be the choices for classifier. For the space constraints, we are not discussing about basic CART model and Random Forest. The presence of more than 4,000 features have seriously impaired the predictive ability of CART model as it tends to overfit for test data. We have got an accuracy of **41.7%** on test data. We approached to another classifier which consists of a set of weak learners (decision tree, in this case), each one of them was randomly assigned a set of features with a subsample to learn. This model, named Random Forest, improved the accuracy on test data up to **76.35%**. For a CART model, the structure of tree is:

**eXtreme Gradient Boosting- XGBoost:**

The philosophy behind random forest encouraged us to use another model, named XGBoost which stands for Extreme Gradient Boosting. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers. Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions where the basis functions are the individual classifiers, in our case. The choice of weak learner for boosting is evident from the following diagram:

To carry out the supervised learning using boosted trees we need to redefine ‘tree’. In a way, Regression tree is a function that maps the attributes to the score. A tree can be defined a vector of leaf scores and a leaf index mapping function. The structure of individual tree (`q(x)`

) guides a sample to a leaf and the associated score (`w`

) is assigned as the prediction for that sample for that tree.

The diagram and the equation explains above statements.

On the basis, The ** prediction model** can be written as the aggregation of all the prediction score for each tree for a

**(**

*sample**x*). Particularly for

*i*-th sample,

where ** K** is the number of trees,

**is the function in the functional space and**

*f**F*is the all possible set of trees having prediction score in each leaf, slightly different from decision tree which only contains decision values. As the modelling is done, we need to optimize certain objective function to choose the best model for the training data. Here, we encourage a model to have high predictive power as well as to have a simple in nature (deals with less number of features). As we know minimizing loss function (L(Θ)) encourages predictive models as well as optimizing regularization (Ω) encourages simpler model to have smaller variance in future predictions, making prediction stable. For our tree ensemble model, we have our objective function which is to minimize as below:

For multiclass classification, the loss function is loss for multiclass prediction. In XGBoost model, we specifically optimized the softmax objective.

xgb <- xgboost(xgbmat, max.depth = 25, eta = 0.3, nround = 500, objective = "multi:softmax", num_class = 20)

Since the regression is over a set of functions, usual Stochastic Gradient Descent wouldn’t work here. To solve this, an ** additive** strategy has been taken to add a new tree at each iteration. Starting from constant prediction (usually 0), the task is to add a new function each iteration.

The model complexity for a tree comprises of ** number of leaves** in a tree and

**which ensures normalization of leaf scores.**

*L2 norm of the scores on leaves*Even intuitively the expression is obvious since as the number of leaves is optimized (minimized) the tree will have a simpler model. For optimizing the L2 norm of the leaf scores would try to normalize the score on each leaf and will prevent having a high score on a particular leaf. Two constants, gamma (γ) and lamba (λ) are the Lagrangian multipliers and can be tuned for accuracy. The optimal score to optimize the objective function can be calculated based on the weights on the leaves and Taylor coefficients. In this way, in each iteration, we are able to choose an optimized tree which optimizes the objective function which has been already optimized partly up to previous iteration, which ensures better accuracy. The optimal score is the best score function for a given structure of tree and ** optimal objective reduction** measures how good is a tree structure for a particular iteration so that it could minimize the objective function which is given below. Due to impossibility of enumerating all the tree from the function space, a greedy approach is of practical use which ensure

**. The components of the gain function are the score on the new left leaf, the score on the new right leaf, the score on the original leaf and the complexity cost by introducing additional leaf (γ). It is obvious that, if gain is smaller than, we would better not to add that branch, which is nothing but**

*an optimal split***. For complete mathematical derivation, please follow this this**

*pruning**.*

**complete project article**Given the description of XGBoost, it is time to differentiate the model with Random Forest:

- The fact is Random Forest and Boosted Trees are not different in terms of model, the difference is how we train them.
- The major reason is in terms of training objective, Boosted Trees tries to add new trees (additive training) that compliments the already built ones. This normally gives you
*better accuracy with less trees*. - In Random Forest the regularization factor is missing, hence if the gain in splitting is greater than epsilon where epsilon is an infinitesimally small positive number, the split will happen. Whereas, in Boosted trees, there is control on model complexity which reduces overfitting.

The complete R code for XGBoost is given * here*.

As per our belief towards the philosophy behind XGBoost, it performed better than Random Forest on test data. The accuracy obtained using XGBoost is given below:

**Leaderboard rank 186 (Team name: Burn Calories) among 1046 entries (on 5th Dec, 2015)**

Analysis

Having the result using all the classifiers, we are now in a position to analyse the results.

- The present data is not in accordance of the basic assumption of Naïve Bayes, i.e.
(or cannot be guaranteed).*independence of ingredients* - Like milk and sour things cannot occur together, similarly
holds between various ingredients which cannot be explained by Naïve Bayes.*various relationship* - A single decision tree tends to
the model and having features over two thousands, the expected*over-fit*for a single decision tree would be*accuracy*.*low* - Random Forest nearly gives accuracy like XGBoost since the ensemble of weak learner predict far better than a single decision tree. But the assumption of randomness is not always desired. There is
for the complexity of the weak learners hence no control on over-fitting. To achieve higher accuracy than random forest each tree needs to be optimally chosen such that the loss function is minimized at its best.*no control* - Thus XGBoost comes into play. XGBoost is an
. Advantage of boosted tree is the algorithm works very fast on a distributed system (XGBoost package does).*optimized random forest* - The idea of
is somewhat analogical here in case of XGBoost which is not present in Random forest classifiers.*reinforcement learning* - We tuned parameter
which is used to prevent over-fitting by making the boosting process more conservative. Low eta value means model more robust to over-fitting but slower to compute. In our case, we are over cautious to over-fitting hence eta*eta***0.1**gave the best result. - Optimizing function XGBoost used to do multiclass classification is the
. Softmax objective (cost function) is minimised through XGBoost.*softmax objective* on leaf weights performs better than L2 regularization because it encourages the lower weighted features to be dropped while modelling, making model*L1 regularization**simpler.*

What’s Cooking – The Indian Platter

The task at the Kaggle What’s Cooking inspired us to try the same experiment on the Indian regional Cuisines.

The Indian Dataset consisted of a list of * 1809* popular Indian regional dishes categorized in accordance with the states. the dataset was scraped from the

*. The website is an extremely popular resource for Indian Expatriates in Europe, Australia and North America.*

**www.vahrevahchef.com**Scrapy

**Scarping the dataset- Scrapy at work:**

A web crawler can systematically extract data from multiple WebPages by crawling, or accessing, WebPages and returns the data you need for each one of them. To scrape the data from vahrevahchef.com we used Scrapy which is an application framework for crawling.

Basics of Scrapy:

is responsible for controlling the data flow between all components of the system*Scrapy engine*- The
receives requests from the engine and enqueues them for feeding them later*Scheduler* - The
is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.*Downloader* are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).*Spiders*- The
is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).*Item Pipeline* are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.*Downloader middlewares*

The complete documentation of Scrapy is available * here*.

We used Scrapy to scrape the list of ingredients for different Indian dishes. We wrote the spider algorithm based on XPath queries to extract required text from HTML content of the page. XPath Helper extension for Google Chrome has been used to record XPath queries. Code snippet for spider:

def parse_fixtures(self,response): sel = response.selector for tr in sel.css("table.table.table-striped>tbody>tr"): item = CraigslistSampleItem() item['ingredients'] = tr.xpath('td[@class="name"]/text()').extract() item['quantity'] = tr.xpath('td[@class="unit"]/text()').extract() item['unit'] = tr.xpath('td[@class="quantity"]/text()').extract() yield item

The complete python codes for * spider* and

*are attached.*

**item****Indian Dataset at a glance:**

The histogram shows the distribution of dishes over Indian cuisines.

The dataset consisted of a dishes from 26 states consisting of 1809 dishes. We divided the dataset in a proportion of 1:5 for test : train.

This is the methodology we used to approach the problem:

Data Preparation

- The primal problem in the scraped data was
use of**inconsistency**due toof different ingredients.*Hindi versions* was a hindrance for correct feature extraction.*Spelling mistakes*- Use of plural form of proper nouns, similar meaning words, adjectives (
) were additional noise in the data.*different POS* - To
*deal*with the problem, weall*replaced*words by their*Hindi*counter parts.*English* - Spelling mistakes were corrected.
- Porter’s
algorithm gave data with uniform consistent set of words.*stemming* after creating DTM removed most of the adjectives those are less frequent. Otherwise, all other adjectives were deliberately removed from the data.*Removing sparse terms*- Removal of adjectives such as chopped, cooked, boiled and many others in the ingredients list.

Feature Engineering

To corroborate the effectiveness of the analysis we added two features to our dataset which seemed to help in distinguishing the cuisines. The details are as follow:

**Number of ingredients**– We considered number of ingredients as an important feature as it varies from 2 to 29, similarly as we considered for Kaggle data.**Number of spices per dish**– We have created anwhich contains all possible (commonly used) spices. We created a*Indian Spice dictionary*i.e. number of spices per dish. From our subjective knowledge, it seems to be an importance feature as there are regions in India where use of spices is too high.*score*

Modelling

**Trial 1: **

**XGBoost**

Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 26, Max depth = 25

Result: Accuracy of prediction = **18.99 % **

Possible reason of failure: *Data don’t even consist of 26 clusters*.

**Trial 2:**

Based on our subjective knowledge, we redefined the label for each dish. Our assumption was dividing all the dishes based on their geographic origin.

The dataset was clustered into 5 main categories: 1. **East India** 2. **North Eastern** 3. **North India** 4. **South India** 5. **West India**

The histogram shows new distribution of dishes over 5 cuisines. The whole dataset has been uploaded * here*.

**Random Forest**

Model parameters: No. of trees = 200, tree depth = 3

Model Results: Accuracy = **47.5 %**

**XGBoost**

Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 5, Max depth = 25

Result: Accuracy of prediction = **48.1 %**

The complete R code for Indian data is attached * here*.

Analysis

Due to poor condition of dataset we could not achieve an accuracy greater than 50%. In addition to this the dataset was a biased one consisting of only popular regional dishes not many of the local dishes. XGBoost allows to make an ** importance matrix **which contains sorted features based on relative importance.

contribution of each feature of each tree is taken into account, then average gain per feature is calculated for a vision of the entire model*Gain*- Highest percentage means
to predict the label used for the training.*important feature*

The most important features for the Indian data were No. of Ingredients and commonly used spices.

**Importance of features:**

**Excerpt of the prediction matrix:**

Challenges Faced

- Proper
was not available*structured data* - Individual
when labeling became crucial for classification*knowledge biased data* was a big threat, repetitions of word in ingredient list made our task really difficult*Inconsistency*- A proper
*Indian**Spice**D*wasn’t available to map total number of features*ictionary* - Our subjective clustering might not be correct, presence of
on dishes ignored*local effect* - Unsupervised learning could have been a way to detect original clusters present in data, but
between two dishes was unavailable*an appropriate distance measure* - It would be good if DTM could be made
but proper measures (quantity, or some other measure which emphasize importance) to calculate respective weight of ingredients in a dish were absent to classify for cuisine*weighted*

Future scope

- A
can be studied to measure similarity or distance between two dishes*proper distance measure* - An unsupervised learning or
can be done to see number of clusters present in data*clustering* could be effective*Two-level clustering*can be*DTM*based on importance of the ingredients in a dish*weighted*- Classification can be more deep leading specific
*dish prediction* - Can be used for
in big chain retailers*recommendation system* could be made more properly to have an organized data*Indian Database*

References:

- https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
- http://xgboost.readthedocs.org/en/latest/model.html
- https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
- https://www.kaggle.com/c/whats-cooking/forums
- http://www.isical.ac.in/~debapriyo/teaching/datamining2014/slides/NaiveBayes.pdf
- https://www.kaggle.com/yilihome/whats-cooking/simple-xgboost-in-r

We acknowledge our instructor **Dr.***Sourav*** Sen Gupta** for his continuous support and valuable inputs. This project was the end semester project by

*.*

**CDS Group 04**The presentation for this project can be found * here*.

Contact

For any query about this project, please contact any of the co-authors:

Bodhisattwa Prasad Majumderemail:bodhisattwapm2017@email.iimcal.ac.in

Dattatreya Biswasemail:dattatreyab2017@email.iimcal.ac.in

Keshav Sehgalemail:keshavs2017@email.iimcal.ac.in

Rachit Tripathiemail:rachitt2017@email.iimcal.ac.in