Can we predict which cuisine a meal is from using its ingredients?

In the What’s Cooking? Kaggle challenge, participants are asked to predict the category of a dish’s cuisine given a list of ingredients. In this blog post, I’ll show you a quick baseline built with scikit-learn. You can try it out at the bottom of the page!

The dataset consists of the ingredient lists of lots of recipes, and the corresponding cuisine:

``````import json
print 'training set size:', len(recipes)
cuisines, ingredients = zip(*[(r['cuisine'], r['ingredients']) for r in recipes])
cuisines, ingredients
``````
``````training set size: 39774
``````

As you can see, most recipes contain salt or olive oil. That’s why the easiest baseline is to predict that every recipe is Italian!

``````from collections import Counter
ingredient_counter = Counter(i for r in ingredients for i in r)
print 'different ingredients:', len(ingredient_counter)
ingredient_counter.most_common(10)
``````
``````different ingredients: 6714
``````

# Build an recipe-ingredient matrix

A `MultiLabelBinarizer` converts a list of items into a vector of ones and zeros. For example, if we had three different ingredients and a recipe only contained number 2, the vector of that recipe would be `[0,1,0]`.

``````import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
binarizer = MultiLabelBinarizer(sparse_output=True)
X = binarizer.fit_transform(ingredients)
y = np.array(cuisines)
X
``````
``````/Users/benno/.virtualenvs/ull/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2641: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
VisibleDeprecationWarning)
``````

# Cross-validate the model

To see how my model performs, I check the cross-validation score on 5 folds over the data. That means that I train the model on a random part of the training data and check the predictions on the rest 5 times.

``````from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

lr = LogisticRegression()
print 'cross validation accuracy:', cross_val_score(lr, X, y, cv=5, n_jobs=-1)
``````
``````cross validation accuracy: [ 0.77354936  0.78461925  0.77991453  0.77663187  0.78482446]
``````

# Submit predictions

Now I need to fit the binarizer on both the training and the test data, because the test data contains ingredients that are not in the training data. That means I have more features and the matrix `X` will be wider.

My submission got a score of `0.78339` on the Kaggle leaderboard, which means that the cross-validation was very accurate!

``````import csv
test_ids, test_ingredients = zip(*[(r['id'], r['ingredients']) for r in test_recipes])

binarizer.fit(np.hstack([ingredients, test_ingredients]))
X = binarizer.transform(ingredients)
lr.fit(X, y)
test_cuisines = lr.predict(binarizer.transform(test_ingredients))
with open('submission.csv','w') as sub:
w = csv.writer(sub)
w.writerow(['id', 'cuisine'])
w.writerows(zip(test_ids, test_cuisines))
``````

# Interactive cuisine-predicter

Just for fun, here’s the model for you to play around with.

``````from IPython.display import Javascript
js = 'window.ingredients = %s;' % json.dumps(list(binarizer.classes_))
js += 'window.cuisines = %s;' % json.dumps(list(lr.classes_))
js += 'window.coef = %s;' % json.dumps(map(list, lr.coef_.T))
Javascript(js)
``````