“Picture yourself strolling through your local, open-air market… What do you see? What do you smell? What will you make for dinner tonight?”
What’s cooking – The International platter
The motivation for the Kaggle Competition named What’s Cooking was to predict the category of a dish’s cuisine given a list of its ingredients. Some of our strongest geographic and cultural associations are tied to a region’s local foods hence a list of ingredients would be a clue to which cuisine that dish belongs. Kaggle arranged the dataset for 20 international cuisines from Yummly. Yummly is a free smartphone app and website that provides recipe recommendations personalized to the individual’s tastes, semantic recipe search, a digital recipe box, shopping list and one-hour grocery delivery. The whole train dataset contains 39774 dish-ids along with their ingredients and cuisine label. The test data contains 9944 dish-ids along with their ingredients. Our task is to predict cuisine for a particular dish from test data given its ingredient set. A snippet of train data would be:
"1% low-fat milk",
The project starts with data preparation stage:
We have 20 different cuisine in train and test data namely, Greek, Southern_US, Filipino, Indian, Jamaican, Spanish, Italian, Mexican, Chinese, British, Thai, Vietnamese, Cajun_creole, Brazilian, French, Japanese. Irish, Korean, Moroccan and Russian. The presence of Italian cuisine is maximum in train data.
To start the data preparation stage we have tried several methodologies to clean the data and make it in shape to analyze.
- Tokenization: We have tokenized all the ingredients as a single word to emphasis the pattern present in data. For example: Red Chili Pepper would appear after tokenization as Red_Chili_Pepper. This would minimize misclassification error from taking Red, Chili and Pepper as three separate tokens. The python code for tokenization is attached here.
- Case modification: We deliberately changed all text data in lower case to avoid inconsistency.
- Sparse term modification: After creating document term matrix we removed terms those occurred for less than 3 times. This was a heuristic choice which we achieved plugging different numbers. This method is useful when tokenization hasn’t been used.
- Porter’s stemming algorithm: This is a process for removing the commoner morphological and inflexional endings from words in English. Example: tomatoes, tomato – will be considered same.
Next stage would be feature engineering: As the number of ingredients for dishes varies from 1 (minimum) to 135 (maximum) in train data, we decided to take number of ingredients per dish as a feature. Our assumption of this being an important feature can be verified from importance plot.
After data preparation and feature engineering we tried to model a classifier based train data. The first part of it would be creating a Document Term Matrix (DTM) for the ingredient sets. ‘tm’ package in R creates a DTM from the corpus of ingredients. The method sets ‘1’ when that particular ingredient appears for a dish and ‘0’ otherwise, for all the ingredients present in corpus. Sparse terms from DTM have been removed where frequency of those terms for all dishes are less than 3. As mentioned, number of ingredients has been added as an extra feature in DTM.
Based on the nature of classification we have decided to use five models to classify and compare the results produced by each of the classifiers: 1. Decision Tree (CART), 2. Random Forest, 3. Naive Bayes Classifier, 4. Modification of Naive Bayes and 5. Extreme Gradient Boosting – XGBoost.
While applying the Naive Bayes classifier, we have assumed here that the occurrence of ingredients is not correlated. This means that the probability of occurring of ingredient is independent of other ingredient present in the dish. The second assumption here is probability of occurring of a dish in a cuisine is product of the probabilities of all the ingredients in a dish, i.e. dishes are independent.
A simple Bayesian classifier would be
x is the test ingredient list and
C the set of cuisines with cardinality
The dish is classified in cuisine which gives the maximum probability.
Since the knowledge about the cuisine label for a particular dish is known from the training data we sorted the data according to given cuisine and took row sum in each particular category of cuisine. Finally we converted convert this matrix to probability matrix we need to make the matrix column stochastic.
A sample column stochastic matrix would look like:
- Note: Though the unconditional probability of each cuisine can be calculated from data, i.e. the sample proportion of each cuisine, but we need to be careful since that may not reflect the population proportion.
For complete procedure and derivation, please follow this project article.
- Prediction code for Naive Bayes classifier:
for( i in 1:nrow(c_train_ingredientsDTM))
for( j in 1:5)
The complete R code for classic Naive Bayes classifier is here.
- Modifications of Naïve Bayes:
When we are classifying through Naïve Bayes probability matrix, if in training data probability of an ingredient appearing in the cuisine is zero then the whole probability will be zero regardless of other probabilities. Solution can be taking the Geometric Mean of the entries in a row which are non-zero. In a situation, if there are only few non-zero entries in a row which will make geometric mean higher than above and in these situations, we can get wrong classifications. To avoid previous situation, we maintained a threshold which is defined as the total number of non-zero ingredients should be greater than particular value and then we have calculated the Geometric Mean (with some minimum number of nonzero terms).
for( j in 1:20)
non_zero_num = Bayes_Predict_Matrix[j,]!=0
The complete R code for modified Naive Bayes classifier is here.
Results for Naive Bayes where different form of objective function has been used for experiment:
It can be seen from above that the modifications of taking threshold of minimum number of ingredients present in dish to be classified in given cuisine before we take the geometric mean, actually improves the accuracy. The geometric mean and its modifications are less accurate than probability multiplication. The modification of uniform probability gives the best accuracy.
Since all most all the features are categorical, CART model and Random Forest should be the choices for classifier. For the space constraints, we are not discussing about basic CART model and Random Forest. The presence of more than 4,000 features have seriously impaired the predictive ability of CART model as it tends to overfit for test data. We have got an accuracy of 41.7% on test data. We approached to another classifier which consists of a set of weak learners (decision tree, in this case), each one of them was randomly assigned a set of features with a subsample to learn. This model, named Random Forest, improved the accuracy on test data up to 76.35%. For a CART model, the structure of tree is:
- eXtreme Gradient Boosting- XGBoost:
The philosophy behind random forest encouraged us to use another model, named XGBoost which stands for Extreme Gradient Boosting. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers. Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions where the basis functions are the individual classifiers, in our case. The choice of weak learner for boosting is evident from the following diagram:
To carry out the supervised learning using boosted trees we need to redefine ‘tree’. In a way, Regression tree is a function that maps the attributes to the score. A tree can be defined a vector of leaf scores and a leaf index mapping function. The structure of individual tree (
q(x)) guides a sample to a leaf and the associated score (
w) is assigned as the prediction for that sample for that tree.
The diagram and the equation explains above statements.
On the basis, The prediction model can be written as the aggregation of all the prediction score for each tree for a sample (x). Particularly for i-th sample,
where K is the number of trees, f is the function in the functional space and F is the all possible set of trees having prediction score in each leaf, slightly different from decision tree which only contains decision values. As the modelling is done, we need to optimize certain objective function to choose the best model for the training data. Here, we encourage a model to have high predictive power as well as to have a simple in nature (deals with less number of features). As we know minimizing loss function (L(Θ)) encourages predictive models as well as optimizing regularization (Ω) encourages simpler model to have smaller variance in future predictions, making prediction stable. For our tree ensemble model, we have our objective function which is to minimize as below:
For multiclass classification, the loss function is loss for multiclass prediction. In XGBoost model, we specifically optimized the softmax objective.
xgb <- xgboost(xgbmat, max.depth = 25, eta = 0.3, nround = 500, objective = "multi:softmax", num_class = 20)
Since the regression is over a set of functions, usual Stochastic Gradient Descent wouldn’t work here. To solve this, an additive strategy has been taken to add a new tree at each iteration. Starting from constant prediction (usually 0), the task is to add a new function each iteration.
The model complexity for a tree comprises of number of leaves in a tree and L2 norm of the scores on leaves which ensures normalization of leaf scores.
Even intuitively the expression is obvious since as the number of leaves is optimized (minimized) the tree will have a simpler model. For optimizing the L2 norm of the leaf scores would try to normalize the score on each leaf and will prevent having a high score on a particular leaf. Two constants, gamma (γ) and lamba (λ) are the Lagrangian multipliers and can be tuned for accuracy. The optimal score to optimize the objective function can be calculated based on the weights on the leaves and Taylor coefficients. In this way, in each iteration, we are able to choose an optimized tree which optimizes the objective function which has been already optimized partly up to previous iteration, which ensures better accuracy. The optimal score is the best score function for a given structure of tree and optimal objective reduction measures how good is a tree structure for a particular iteration so that it could minimize the objective function which is given below. Due to impossibility of enumerating all the tree from the function space, a greedy approach is of practical use which ensure an optimal split. The components of the gain function are the score on the new left leaf, the score on the new right leaf, the score on the original leaf and the complexity cost by introducing additional leaf (γ). It is obvious that, if gain is smaller than, we would better not to add that branch, which is nothing but pruning. For complete mathematical derivation, please follow this this complete project article.
Given the description of XGBoost, it is time to differentiate the model with Random Forest:
- The fact is Random Forest and Boosted Trees are not different in terms of model, the difference is how we train them.
- The major reason is in terms of training objective, Boosted Trees tries to add new trees (additive training) that compliments the already built ones. This normally gives you better accuracy with less trees.
- In Random Forest the regularization factor is missing, hence if the gain in splitting is greater than epsilon where epsilon is an infinitesimally small positive number, the split will happen. Whereas, in Boosted trees, there is control on model complexity which reduces overfitting.
The complete R code for XGBoost is given here.
As per our belief towards the philosophy behind XGBoost, it performed better than Random Forest on test data. The accuracy obtained using XGBoost is given below:
Leaderboard rank 186 (Team name: Burn Calories) among 1046 entries (on 5th Dec, 2015)
Having the result using all the classifiers, we are now in a position to analyse the results.
- The present data is not in accordance of the basic assumption of Naïve Bayes, i.e. independence of ingredients (or cannot be guaranteed).
- Like milk and sour things cannot occur together, similarly various relationship holds between various ingredients which cannot be explained by Naïve Bayes.
- A single decision tree tends to over-fit the model and having features over two thousands, the expected accuracy for a single decision tree would be low.
- Random Forest nearly gives accuracy like XGBoost since the ensemble of weak learner predict far better than a single decision tree. But the assumption of randomness is not always desired. There is no control for the complexity of the weak learners hence no control on over-fitting. To achieve higher accuracy than random forest each tree needs to be optimally chosen such that the loss function is minimized at its best.
- Thus XGBoost comes into play. XGBoost is an optimized random forest. Advantage of boosted tree is the algorithm works very fast on a distributed system (XGBoost package does).
- The idea of reinforcement learning is somewhat analogical here in case of XGBoost which is not present in Random forest classifiers.
- We tuned parameter eta which is used to prevent over-fitting by making the boosting process more conservative. Low eta value means model more robust to over-fitting but slower to compute. In our case, we are over cautious to over-fitting hence eta 0.1 gave the best result.
- Optimizing function XGBoost used to do multiclass classification is the softmax objective. Softmax objective (cost function) is minimised through XGBoost.
- L1 regularization on leaf weights performs better than L2 regularization because it encourages the lower weighted features to be dropped while modelling, making model simpler.
What’s Cooking – The Indian Platter
The task at the Kaggle What’s Cooking inspired us to try the same experiment on the Indian regional Cuisines.
The Indian Dataset consisted of a list of 1809 popular Indian regional dishes categorized in accordance with the states. the dataset was scraped from the www.vahrevahchef.com. The website is an extremely popular resource for Indian Expatriates in Europe, Australia and North America.
Scarping the dataset- Scrapy at work:
A web crawler can systematically extract data from multiple WebPages by crawling, or accessing, WebPages and returns the data you need for each one of them. To scrape the data from vahrevahchef.com we used Scrapy which is an application framework for crawling.
Basics of Scrapy:
- Scrapy engine is responsible for controlling the data flow between all components of the system
- The Scheduler receives requests from the engine and enqueues them for feeding them later
- The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.
- Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).
- The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).
- Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.
The complete documentation of Scrapy is available here.
We used Scrapy to scrape the list of ingredients for different Indian dishes. We wrote the spider algorithm based on XPath queries to extract required text from HTML content of the page. XPath Helper extension for Google Chrome has been used to record XPath queries. Code snippet for spider:
sel = response.selector
for tr in sel.css("table.table.table-striped>tbody>tr"):
item = CraigslistSampleItem()
item['ingredients'] = tr.xpath('td[@class="name"]/text()').extract()
item['quantity'] = tr.xpath('td[@class="unit"]/text()').extract()
item['unit'] = tr.xpath('td[@class="quantity"]/text()').extract()
The complete python codes for spider and item are attached.
Indian Dataset at a glance:
The histogram shows the distribution of dishes over Indian cuisines.
The dataset consisted of a dishes from 26 states consisting of 1809 dishes. We divided the dataset in a proportion of 1:5 for test : train.
This is the methodology we used to approach the problem:
- The primal problem in the scraped data was inconsistency due to use of Hindi versions of different ingredients.
- Spelling mistakes was a hindrance for correct feature extraction.
- Use of plural form of proper nouns, similar meaning words, adjectives (different POS) were additional noise in the data.
- To deal with the problem, we replaced all Hindi words by their English counter parts.
- Spelling mistakes were corrected.
- Porter’s stemming algorithm gave data with uniform consistent set of words.
- Removing sparse terms after creating DTM removed most of the adjectives those are less frequent. Otherwise, all other adjectives were deliberately removed from the data.
- Removal of adjectives such as chopped, cooked, boiled and many others in the ingredients list.
To corroborate the effectiveness of the analysis we added two features to our dataset which seemed to help in distinguishing the cuisines. The details are as follow:
- Number of ingredients – We considered number of ingredients as an important feature as it varies from 2 to 29, similarly as we considered for Kaggle data.
- Number of spices per dish – We have created an Indian Spice dictionary which contains all possible (commonly used) spices. We created a score i.e. number of spices per dish. From our subjective knowledge, it seems to be an importance feature as there are regions in India where use of spices is too high.
Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 26, Max depth = 25
Result: Accuracy of prediction = 18.99 %
Possible reason of failure: Data don’t even consist of 26 clusters.
Based on our subjective knowledge, we redefined the label for each dish. Our assumption was dividing all the dishes based on their geographic origin.
The dataset was clustered into 5 main categories: 1. East India 2. North Eastern 3. North India 4. South India 5. West India
The histogram shows new distribution of dishes over 5 cuisines. The whole dataset has been uploaded here.
Model parameters: No. of trees = 200, tree depth = 3
Model Results: Accuracy = 47.5 %
Parameters: eta = 0.1, no. of rounds = 500, no. of classes = 5, Max depth = 25
Result: Accuracy of prediction = 48.1 %
The complete R code for Indian data is attached here.
Due to poor condition of dataset we could not achieve an accuracy greater than 50%. In addition to this the dataset was a biased one consisting of only popular regional dishes not many of the local dishes. XGBoost allows to make an importance matrix which contains sorted features based on relative importance.
- Gain contribution of each feature of each tree is taken into account, then average gain per feature is calculated for a vision of the entire model
- Highest percentage means important feature to predict the label used for the training.
The most important features for the Indian data were No. of Ingredients and commonly used spices.
Importance of features:
Excerpt of the prediction matrix:
- Proper structured data was not available
- Individual knowledge biased data when labeling became crucial for classification
- Inconsistency was a big threat, repetitions of word in ingredient list made our task really difficult
- A proper Indian Spice Dictionary wasn’t available to map total number of features
- Our subjective clustering might not be correct, presence of local effect on dishes ignored
- Unsupervised learning could have been a way to detect original clusters present in data, but an appropriate distance measure between two dishes was unavailable
- It would be good if DTM could be made weighted but proper measures (quantity, or some other measure which emphasize importance) to calculate respective weight of ingredients in a dish were absent to classify for cuisine
- A proper distance measure can be studied to measure similarity or distance between two dishes
- An unsupervised learning or clustering can be done to see number of clusters present in data
- Two-level clustering could be effective
- DTM can be weighted based on importance of the ingredients in a dish
- Classification can be more deep leading specific dish prediction
- Can be used for recommendation system in big chain retailers
- Indian Database could be made more properly to have an organized data
We acknowledge our instructor Dr. Sourav Sen Gupta for his continuous support and valuable inputs. This project was the end semester project by CDS Group 04.
The presentation for this project can be found here.
For any query about this project, please contact any of the co-authors:
Bodhisattwa Prasad Majumder