Skip to content

Data & Methods

Ford, Morgan edited this page Nov 26, 2021 · 11 revisions

This will be a data and method page.

Database structure

The data used for the App is stored in 4 databases; disease2gene, genedata, studydata, and nutrient_info.

Disease2gene

The disease2gene database consists of the key driver genes chosen for each disease and their goal expression. The data used to create this database was provided to the team by Dr. Crawford in two PDFs, the cancer driver genes can be found here and the driver genes for other conditions can be found here.

The disease2gene database maps disease to genes that affect that disease. It is organized as follows:

Table
Col name: Disease Gene Expression
Data type: string string factor("up", "down", "neutral")

Genedata

The genedata database consists of the key information in each study gathered relating the nutrient and how the gene expression was changed. The data for this database was collected from a number of studies gathered by Dr. Crawford, as well as data collected from nutrigenomedb.org.

For the studies provided by Dr. Crawford, the data was provided in many different formats that had to be cleaned and unified into one database. The data was found in word documents, PDFs, and excel spreadsheets. While most of the data provided either log2 fold change or fold change, some did not. This will be discussed in further detail below.

Further, the gene names had to be unified. This is discussed here.

The Log2fc is the log base 2 of the 2-fold change, in other words, the factor by which the gene expression changed. The expression is determined based on a rule that is described in the studydata database.

The genedata database describes genes and how they were affected in each study.

Table
Col name: Gene P.value Log2fc Expression Study Nutrient
Data type: string numeric numeric factor("up", "down", "neutral") string string

Studydata

The studydata database contains more details about each study used in the genedata database. It stores information about the study to be provided to the user and used to determine quality of data. The ranking represents the quality of a given study. How the ranking was determined can be found here. Study represents the short name of the study used in the genedata database while Study.name is the full title of the study. The description is a short breifing with any important information about the study while Summary is a longer summary of the study. Geo is the link to the Geo for a study, if given. This is most common for the nutrigenomedb data. The Link is a bit.ly link to the full paper.

UDrule stores the rule that was used to determine if a gene's expression was up, down, or neutral. The main rule for determining expression used is:

  • Up when log2fc >= 0.263
  • Down when log2fc <= -0.263
  • Neutral if P.value > 0.05 or else

Hardman_2019 used a different rule:

  • "Ratio of expression between surgery and biopsy"

The studydata database describes each study used, the food it studied and the ranking of the study.

Col name: Study Nutrient Description Ranking Study.name Summary Geo Link UDrule
Data type: string string string integer string string string string string

Nutrient_info

Nutrient_info is a database that keeps track of information about the nutrient that is shown to the provider on the main page, the "What Should I Eat?" page.

There are five categories :

  • whole foods
  • extracts
  • bacteria
  • organic compounds
  • chemical compounds
  • other

The description is a brief description of the nutrient, that also contains any notes or warnings for the nutrient, such as "high in sodium". The Link is a link to a webpage with more information about the nutrient. The Img.link is a link to a small image, such as an emojii, to use next to the name of the nutrient.

The Nutrient_info database describes the nutrient, categorizes it, and stores a link to more information and an image.

Col name: Nutrient Category Description Link Img.link
Data type: string string string string string

Gene Name Cleaning

Ranking system