What's cooking
Can we predict which cuisine a meal is from using its ingredients?
In the What’s Cooking? Kaggle challenge, participants are asked to predict the category of a dish’s cuisine given a list of ingredients. In this blog post, I’ll show you a quick baseline built with scikit-learn. You can try it out at the bottom of the page!
Loading the data
The dataset consists of the ingredient lists of lots of recipes, and the corresponding cuisine:
import json
recipes = json.load(open('train.json'))
print 'training set size:', len(recipes)
cuisines, ingredients = zip(*[(r['cuisine'], r['ingredients']) for r in recipes])
cuisines[0], ingredients[0]
training set size: 39774
As you can see, most recipes contain salt or olive oil. That’s why the easiest baseline is to predict that every recipe is Italian!
from collections import Counter
ingredient_counter = Counter(i for r in ingredients for i in r)
print 'different ingredients:', len(ingredient_counter)
ingredient_counter.most_common(10)
different ingredients: 6714
Build an recipe-ingredient matrix
A MultiLabelBinarizer
converts a list of items into a vector of ones and zeros. For example, if we had three different ingredients and a recipe only contained number 2, the vector of that recipe would be [0,1,0]
.
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
binarizer = MultiLabelBinarizer(sparse_output=True)
X = binarizer.fit_transform(ingredients)
y = np.array(cuisines)
X
/Users/benno/.virtualenvs/ull/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2641: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
VisibleDeprecationWarning)
Cross-validate the model
To see how my model performs, I check the cross-validation score on 5 folds over the data. That means that I train the model on a random part of the training data and check the predictions on the rest 5 times.
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
lr = LogisticRegression()
print 'cross validation accuracy:', cross_val_score(lr, X, y, cv=5, n_jobs=-1)
cross validation accuracy: [ 0.77354936 0.78461925 0.77991453 0.77663187 0.78482446]
Submit predictions
Now I need to fit the binarizer on both the training and the test data, because the test data contains ingredients that are not in the training data. That means I have more features and the matrix X
will be wider.
My submission got a score of 0.78339
on the Kaggle leaderboard, which means that the cross-validation was very accurate!
import csv
test_recipes = json.load(open('test.json'))
test_ids, test_ingredients = zip(*[(r['id'], r['ingredients']) for r in test_recipes])
binarizer.fit(np.hstack([ingredients, test_ingredients]))
X = binarizer.transform(ingredients)
lr.fit(X, y)
test_cuisines = lr.predict(binarizer.transform(test_ingredients))
with open('submission.csv','w') as sub:
w = csv.writer(sub)
w.writerow(['id', 'cuisine'])
w.writerows(zip(test_ids, test_cuisines))
Interactive cuisine-predicter
Just for fun, here’s the model for you to play around with.
from IPython.display import Javascript
js = 'window.ingredients = %s;' % json.dumps(list(binarizer.classes_))
js += 'window.cuisines = %s;' % json.dumps(list(lr.classes_))
js += 'window.coef = %s;' % json.dumps(map(list, lr.coef_.T))
Javascript(js)
var argmax = function(arr) { return arr.indexOf(Math.max.apply(null, arr)); }
function predict_cuisine(ingredient_string) {
var x = numeric.rep([20],0)
ingredient_string.split( /,\s*/ ).forEach(function(s) {
var index = window.ingredients.indexOf(s);
if (index != -1) {
numeric.addeq(x, window.coef[ index ]);
}
});
if (numeric.all(numeric.eq(0, x))) {
return '';
} else {
return window.cuisines[argmax(x)];
}
}
var ingredient_list = $('<input size="100">')
var recipe_prediction = $('<h1>')
ingredient_list.keydown( function(){
recipe_prediction.text( predict_cuisine(this.value) );
});
element.append('<h1>Cuisine Predicter</h1>')
element.append('<div>Type a comma-separated list of ingredients:</div>')
element.append(ingredient_list)
element.append(recipe_prediction)