Data & Methods
Clone this wiki locally
This will be a data and method page.
Database structure
The data used for the App is stored in 4 databases; disease2gene, genedata, studydata, and nutrient_info.
Disease2gene
The disease2gene database consists of the key driver genes chosen for each disease and their goal expression. The data used to create this database was provided to the team by Dr. Crawford in two PDFs, the cancer driver genes can be found here and the driver genes for other conditions can be found here.
The disease2gene database maps disease to genes that affect that disease. It is organized as follows:
Table | |||
---|---|---|---|
Col name: | Disease | Gene | Expression |
Data type: | string | string | factor("up", "down", "neutral") |
Genedata
The genedata database consists of the key information in each study gathered relating the nutrient and how the gene expression was changed. The data for this database was collected from a number of studies gathered by Dr. Crawford, as well as data collected from nutrigenomedb.org.
For the studies provided by Dr. Crawford, the data was provided in many different formats that had to be cleaned and unified into one database. The data was found in word documents, PDFs, and excel spreadsheets. While most of the data provided either log2 fold change or fold change, some did not. This will be discussed in further detail below.
Further, the gene names had to be unified. This is discussed here.
The Log2fc is the log base 2 of the 2-fold change, in other words, the factor by which the gene expression changed. The expression is determined based on a rule that is described in the studydata database.
The genedata database describes genes and how they were affected in each study.
Table | ||||||
---|---|---|---|---|---|---|
Col name: | Gene | P.value | Log2fc | Expression | Study | Nutrient |
Data type: | string | numeric | numeric | factor("up", "down", "neutral") | string | string |
Studydata
The studydata database contains more details about each study used in the genedata database. It stores information about the study to be provided to the user and used to determine quality of data. The ranking represents the quality of a given study. How the ranking was determined can be found here. Study represents the short name of the study used in the genedata database while Study.name is the full title of the study. The description is a short breifing with any important information about the study while Summary is a longer summary of the study. Geo is the link to the Geo for a study, if given. This is most common for the nutrigenomedb data. The Link is a bit.ly link to the full paper.
UDrule stores the rule that was used to determine if a gene's expression was up, down, or neutral. The main rule for determining expression used is:
- Up when log2fc >= 0.263
- Down when log2fc <= -0.263
- Neutral if P.value > 0.05 or else
Hardman_2019 used a different rule:
- "Ratio of expression between surgery and biopsy"
The studydata database describes each study used, the food it studied and the ranking of the study.
Col name: | Study | Nutrient | Description | Ranking | Study.name | Summary | Geo | Link | UDrule |
---|---|---|---|---|---|---|---|---|---|
Data type: | string | string | string | integer | string | string | string | string | string |
Nutrient_info
Nutrient_info is a database that keeps track of information about the nutrient that is shown to the provider on the main page, the "What Should I Eat?" page.
There are five categories :
- whole foods
- extracts
- bacteria
- organic compounds
- chemical compounds
- other
The description is a brief description of the nutrient, that also contains any notes or warnings for the nutrient, such as "high in sodium". The Link is a link to a webpage with more information about the nutrient. The Img.link is a link to a small image, such as an emojii, to use next to the name of the nutrient.
The Nutrient_info database describes the nutrient, categorizes it, and stores a link to more information and an image.
Col name: | Nutrient | Category | Description | Link | Img.link |
---|---|---|---|---|---|
Data type: | string | string | string | string | string |
Gene Name Cleaning
The following will be a description of the method used to clean gene names in the disease2gene database and the genedata database.
Create the geneNames Database
Update Gene Names
Both the disease2gene database and the genedata database are updated by joining them with the geneNames database on the Gene column and replacing each gene name in that column with the corresponding geneNames$approvedName. For example, for the disease2gene database:
newDriverGene <- disease2gene %>% inner_join(geneNames, by=c("Gene"))
newDriverGene$Gene <- newDriverGene$approvedName
Ranking system
Methods
The following will be a detailed description of the webpages and how they are computed. At the top of the page, the user selected a condition or disease to search for. This is saved as input$disease. The key risk genes that will be searched for are saved in target_genes, a subset of disease2gene. The matches found for the key risk genes in the study pool is saved in matching, which is computed as follows:
matching <- inner_join(target_genes , genedata)
matching is of the same form as genedata.
What Should I Eat?
At the top of the page, the user is shown the unique list of nutrients found for them in matching, printed sorted per category. They can then choose between Bubble View and Table View to view their recommendations.
Table View
Table view is where the user can view their recommendations in text form. It returns a table of the form:
Nutrient | Ranking | Category | Description | Link |
---|
In order to add the information about the nutrients, foods_complete is formed, which is computed as:
foods_complete <- inner_join(foods_info, nutrient_info)
foods_complete_link is what is returned in the table view, which mutates the Link column to turn the string of text of the link into a hot link.
Bubble View
Bubble view is where the user can view their results as a categorized packed bubble chart. The highcharter package is used, which allows Highchart plotting objects in R.
The tooltip for each object contains the name of the nutrient followed by a small image such as an emojii, the ranking of the nutrient, and a brief description of the nutrient. It is added to foods_complete as a column called text, and this new object is stored as foods_info_text.
Highchart plots are plotted using series, and empty series cannot be plotting. As such, foods_info_text is sorted into the six categories as separate objects. For each category, a new subset is created, as well as a Boolean object that stores if that subset is empty. For example, for whole foods:
foods_info_wf <- subset(foods_info_text, Category == "whole food")
check_wf <- nrow(foods_info_wf()) != 0)
This is also used at the top of the page to only print categories of nutrients that are not empty.
The packed bubble highchart object is created, and then each category of food is added as a series if it is not empty. The tooltip is added as described above, as well as the click event leading to a new window being opened with more information about the nutrient. In other words, when the bubble is hovered over it will show the tooltip described and when it is clicked it will lead to a new window with information about the food.
The bubbles are sorted by category and the size of each individual bubble depends on the ranking of the nutrient. The user may choose to hide or show a category by clicking on the legend.
Why Should I Eat It?
This page shows more details about the specific genes selected for the condition selected and the matches found from the pool of studies. At the top of the page, the target_genes are shown to the user. matching_genes is the subset of matching that shows the Gene, Expression, and Nutrient columns. matching_genes_icons turns the Expression into either an up or down arrow html object depending on the Expression. matching_genes_icons is given to the user as a table of the form:
Gene | Expression | Nutrient |
---|
Which Studies Say I Should Eat It?
This table shows the user more information about the studies referenced to create the recommendations from the Eat4Genes database. study_info is the information about the studies that the matches were found in. It is computed as:
study_info <- inner_join(matching , studydata)
From there, the links are made into hot links and the columns are selected, using the same methods are already described.