Can we predict which cuisine a meal is from using its ingredients?
In the What's Cooking? Kaggle challenge, participants are asked to predict the category of a dish's cuisine given a list of ingredients. In this blog post, I'll show you a quick baseline built with scikit-learn. You can try it out at the bottom of the page!
Loading the data
The dataset consists of the ingredient lists of lots of recipes, and the corresponding cuisine:
import json recipes = json.load(open('train.json')) print 'training set size:', len(recipes) cuisines, ingredients = zip(*[(r['cuisine'], r['ingredients']) for r in recipes]) cuisines, ingredients
training set size: 39774
As you can see, most recipes contain salt or olive oil. That's why the easiest baseline is to predict that every recipe is Italian!
from collections import Counter ingredient_counter = Counter(i for r in ingredients for i in r) print 'different ingredients:', len(ingredient_counter) ingredient_counter.most_common(10)
different ingredients: 6714
Build an recipe-ingredient matrix
MultiLabelBinarizer converts a list of items into a vector of ones and zeros. For example, if we had three different ingredients and a recipe only contained number 2, the vector of that recipe would be
import numpy as np from sklearn.preprocessing import MultiLabelBinarizer binarizer = MultiLabelBinarizer(sparse_output=True) X = binarizer.fit_transform(ingredients) y = np.array(cuisines) X
/Users/benno/.virtualenvs/ull/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2641: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning)
Cross-validate the model
To see how my model performs, I check the cross-validation score on 5 folds over the data. That means that I train the model on a random part of the training data and check the predictions on the rest 5 times.
from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score lr = LogisticRegression() print 'cross validation accuracy:', cross_val_score(lr, X, y, cv=5, n_jobs=-1)
cross validation accuracy: [ 0.77354936 0.78461925 0.77991453 0.77663187 0.78482446]
Now I need to fit the binarizer on both the training and the test data, because the test data contains ingredients that are not in the training data. That means I have more features and the matrix
X will be wider.
My submission got a score of
0.78339 on the Kaggle leaderboard, which means that the cross-validation was very accurate!
import csv test_recipes = json.load(open('test.json')) test_ids, test_ingredients = zip(*[(r['id'], r['ingredients']) for r in test_recipes]) binarizer.fit(np.hstack([ingredients, test_ingredients])) X = binarizer.transform(ingredients) lr.fit(X, y) test_cuisines = lr.predict(binarizer.transform(test_ingredients)) with open('submission.csv','w') as sub: w = csv.writer(sub) w.writerow(['id', 'cuisine']) w.writerows(zip(test_ids, test_cuisines))
Just for fun, here's the model for you to play around with.