Data Science @ Yummly

In honor of Data Innovation Day 2014, in this post we’ll highlight how we collect, understand, and leverage food and recipe data at Yummly.

Gathering Recipes

Yummly gathers recipes from across the web. While some recipes are encoded in special formats (such as hrecipe) that make crawling easier, we identify and extract many of our recipes using machine learning. Our graphical model-based approach enables accurate recipe detection and extraction from arbitrarily structured and formatted webpages. So far we have collected over one million recipes, and we’re just getting started.

recipe extraction

Our machine learning-based approach to extracting recipes correctly identifies the structure of a recipe from Smitten Kitchen.

Understanding Recipes

Our recipe understanding system then derives a rich, detailed representation of the recipe from the extracted recipe text. Fundamental steps in this process include parsing ingredient lines (e.g. “1 cup plus 1 tablespoon sugar”), normalizing ingredient amounts, and mapping identified ingredients onto nodes in our extensive ingredient graph (~40k ingredient nodes, including a growing number of consumer packaged goods). In addition to metadata about each ingredient such as compatible diets, the ingredient graph encodes relationships between ingredients, for example that salmon is a kind of fish, which is a kind of seafood.

ingredient graph

Our ingredient line parser correctly parses a line and maps the ingredient onto the appropriate node in the ingredient graph.

This system also infers various recipe attributes with probabilistic machine learning methods, including the cuisine, course, and relevant occasions. For example, in previous posts we discussed identifying Super Bowl and Thanksgivukkah recipes. Other inferences include the nutrition information, flavor (spicy, sweet, etc.), techniques used, difficulty, and preparation time (if the source does not provide it).

Searching and Recommending Recipes

Our search and recommendation algorithms are then able to leverage these comprehensive representations to provide more relevant recipes. Implicit user feedback also plays a critical role in these systems.  For example, user interactions in search allow us to infer the relevance of a recipe to a query. This feedback is incorporated into the ranking algorithm using a learning to rank approach.  User interactions also help us determine the similarity between recipes, which is incorporated into our content and item-based recommendation algorithms.

Food and Recipe Insights

We also study interaction data in order to learn about the food world. For example, in previous posts we discussed regional variations in holiday and Super Bowl searches. Other projects have included identifying emerging food trends in search, studying the way that user behavior changes depending on the time of day, week, or year, and quantifying factors that make a recipe more appealing.

trending vegetables

Vegetables with the largest increases in relative search volume in 2013.

pancakes searches

Relative search volume for “pancakes” at different times of day.


What’s next? In the coming months we plan to deepen our understanding of food and recipes by refining and augmenting our algorithms and data, digesting an increasing number of recipes, and expanding into new food cultures (we recently launched a UK site).

Sound interesting? We are hiring!