Can we predict which cuisine a meal is from using its ingredients?

In the What’s Cooking? Kaggle challenge, participants are asked to predict the category of a dish’s cuisine given a list of ingredients. In this blog post, I’ll show you a quick baseline built with scikit-learn. You can try it out at the bottom of the page!

Loading the data

The dataset consists of the ingredient lists of lots of recipes, and the corresponding cuisine:

import json
recipes = json.load(open('train.json'))
print 'training set size:', len(recipes)
cuisines, ingredients = zip(*[(r['cuisine'], r['ingredients']) for r in recipes])
cuisines[0], ingredients[0]
training set size: 39774

As you can see, most recipes contain salt or olive oil. That’s why the easiest baseline is to predict that every recipe is Italian!

from collections import Counter
ingredient_counter = Counter(i for r in ingredients for i in r)
print 'different ingredients:', len(ingredient_counter)
different ingredients: 6714

Build an recipe-ingredient matrix

A MultiLabelBinarizer converts a list of items into a vector of ones and zeros. For example, if we had three different ingredients and a recipe only contained number 2, the vector of that recipe would be [0,1,0].

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
binarizer = MultiLabelBinarizer(sparse_output=True)
X = binarizer.fit_transform(ingredients)
y = np.array(cuisines)
/Users/benno/.virtualenvs/ull/lib/python2.7/site-packages/numpy/core/ VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.

Cross-validate the model

To see how my model performs, I check the cross-validation score on 5 folds over the data. That means that I train the model on a random part of the training data and check the predictions on the rest 5 times.

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

lr = LogisticRegression()
print 'cross validation accuracy:', cross_val_score(lr, X, y, cv=5, n_jobs=-1)
cross validation accuracy: [ 0.77354936  0.78461925  0.77991453  0.77663187  0.78482446]

Submit predictions

Now I need to fit the binarizer on both the training and the test data, because the test data contains ingredients that are not in the training data. That means I have more features and the matrix X will be wider.

My submission got a score of 0.78339 on the Kaggle leaderboard, which means that the cross-validation was very accurate!

import csv
test_recipes = json.load(open('test.json'))
test_ids, test_ingredients = zip(*[(r['id'], r['ingredients']) for r in test_recipes])[ingredients, test_ingredients]))
X = binarizer.transform(ingredients), y)
test_cuisines = lr.predict(binarizer.transform(test_ingredients))
with open('submission.csv','w') as sub:
    w = csv.writer(sub)
    w.writerow(['id', 'cuisine'])
    w.writerows(zip(test_ids, test_cuisines))

Interactive cuisine-predicter

Just for fun, here’s the model for you to play around with.

from IPython.display import Javascript
js = 'window.ingredients = %s;' % json.dumps(list(binarizer.classes_))
js += 'window.cuisines = %s;' % json.dumps(list(lr.classes_))
js += 'window.coef = %s;' % json.dumps(map(list, lr.coef_.T))