Fork me on GitHub

Knowledge Graph Construction


We combine multiple sources of data into a single cohesive knowledge graph, forming linkages to relate similar concepts.


Components

Recipe data

Recipes can be derived from a multitude of sources, such as books, websites, and structured datasets. For the purposes of the publication dataset, we chose to use a collection of recipes gathered by the authors of and used in the making of the Im2Recipe project.

Nutrient data

Nutrient information can be found in great quantities for a variety of foods. We chose to source our data from the USDA. To bring the data they provide into the knowledge graph, we took advantage of Semantic Data Dictionaries, an RPI project. The files used in the Semantic Data Dictionary process is available in this folder. The dictionary mapping file specifies all the linkages made to external ontologies, such as FoodOn, Units Ontology etc.

We make available a sample of the FoodKG (USDA mappings) that were created using the Semantc Data Dictionary process.

Food knowledge

Finally, to provide some structure to the ingredients encountered in our recipes, we incorporated FoodOn. The ontology provides detailed information about the origin and preparation of foods.

Construction

Prerequisites

You will need to manually acquire the following:

To build, Python 3.7 is required as some of the prerequisite packages depend on the bundled packages with Python 3.7.

Execution

After cloning the repository, detailed instructions for reproduction are available under the /src directory. A broad overview follows:

  1. Acquire the data listed above
  2. Use /src/prep-scripts/ to join automatically-acquired data with the manual data
  3. Use /src/recipe-handler/ to generate a knowledge graph from the prepared data
  4. Use /src/verify to generate statistics from the result

Inputs

The following two files are required from Recipe1M:

Other public data sources (e.g., USDA, FoodOn) are downloaded automatically by the script.

Outputs

The final output comprises the serialized RDF data iles comprising the FoodKG:

  • usda-links.trig (approx 4.1 million triples)
  • foodon-links.trig (approx 30 thousand triples)
  • foodkg-core.trig (approx 63 million triples)

These files can be loaded into a graph database like BlazeGraph for executing the natural language queries.