Skip to content

Data & Methods

Ford, Morgan edited this page May 11, 2022 · 11 revisions

This will be a data and method page.

Database structure

The data used for the App is stored in 4 databases; disease2gene, genedata, studydata, and nutrient_info.

Disease2gene

The disease2gene database consists of the key driver genes chosen for each disease and their goal expression. The data used to create this database was provided to the team by Dr. Crawford in two PDFs, the cancer driver genes can be found here and the driver genes for other conditions can be found here.

The disease2gene database maps disease to genes that affect that disease. It is organized as follows:

Col name: Category Disease Gene Expression
Data type: string string string factor("up", "down", "neutral")

Genedata

The genedata database consists of the key information in each study gathered relating the nutrient and how the gene expression was changed. The data for this database was collected from a number of studies gathered by Dr. Crawford, as well as data collected from nutrigenomedb.org.

For the studies provided by Dr. Crawford, the data was provided in many different formats that had to be cleaned and unified into one database. The data was found in word documents, PDFs, and excel spreadsheets. While most of the data provided either log2 fold change or fold change, some did not. This will be discussed in further detail below.

Further, the gene names had to be unified. This is discussed here.

The Log2fc is the log base 2 of the 2-fold change, in other words, the factor by which the gene expression changed. The expression is determined based on a rule that is described in the studydata database.

The genedata database describes genes and how they were affected in each study.

Col name: Gene P.value Log2fc Expression Study Nutrient Nutrigenome
Data type: string numeric numeric factor("up", "down", "neutral") string string string

Studydata

The studydata database contains more details about each study used in the genedata database. It stores information about the study to be provided to the user and used to determine quality of data. The ranking represents the quality of a given study. How the ranking was determined can be found here. Study represents the short name of the study used in the genedata database while Study.name is the full title of the study. The description is a short breifing with any important information about the study while Summary is a longer summary of the study. Geo is the link to the Geo for a study, if given. This is most common for the nutrigenomedb data. The Link is a bit.ly link to the full paper.

UDrule stores the rule that was used to determine if a gene's expression was up, down, or neutral. The main rule for determining expression used is:

  • Up when log2fc >= 0.263
  • Down when log2fc <= -0.263
  • Neutral if P.value > 0.05 or else

Hardman_2019 used a different rule:

  • "Ratio of expression between surgery and biopsy"

The studydata database describes each study used, the food it studied and the ranking of the study.

Col name: Study Nutrient Study Name Summary Geo Link VitViv Consum. Type Subject Conc. Sample Size Ranking
Data Type: factor factor character character character character factor factor factor factor factor factor numeric

Nutrient_info

Nutrient_info is a database that keeps track of information about the nutrient that is shown to the provider on the main page, the "What Should I Eat?" page.

There are five categories :

  • whole foods
  • whole food extracts
  • phytonutrients

The description is a brief description of the nutrient, that also contains any notes or warnings for the nutrient, such as "high in sodium". The Link is a link to a webpage with more information about the nutrient. The Img.link is a link to a small image, such as an emojii, to use next to the name of the nutrient.

The Nutrient_info database describes the nutrient, categorizes it, and stores a link to more information and an image.

Col name: Nutrient Category Description Link Img.link
Data type: string string string string string

Gene Name Cleaning

The following will be a description of the method used to clean gene names in the disease2gene database and the genedata database.

Create the geneNames Database

The geneNames database was created due to an inconsistency in the naming conventions of gene symbols across the studies our app uses. For example, GPR175, FLJ32197, and TPRA1 are all symbols for the same gene so different papers may be referring to the same gene while using different symbols. To create this database, all unique gene symbols from our data pool were passed through the HGNC multi-symbol checker tool to find their approved name and full gene name. The HGNC multi-symbol checker tool was used to identify if the gene symbol is an approved name, an alias name, a previously approved name, or unknown. If a symbol was an approved name, their name and full name would be recorded. If a symbol was a previous name or an alias name, their approved symbol was recorded. Unknown names were left as they were with the full name as "unknown" due to the assumption that since the disease risk genes are known names, the unknown names will not match with them. The difficulty came when a gene was listed as both an approved name and an alias name or as multiple alias names for different genes. When conflict arose if a gene symbol was listed as an approved name for one gene that name took precedence over the others. If a conflict between records of the symbol being aliases for multiple genes or a previous symbol of one and an alias of another, those conflicts were either manually examined on a case-by-case basis or the symbol was kept with the full name listed as a conflict. Note disease risk genes were also added to the geneNames database. This process was later automated. A database of over 8000 names was created to form a consistent naming convention. The data frame was implemented into the App so that gene symbols referring to the same gene could be connected and matched with their disease risk genes. As a result, we were able to find more papers for a good number of the disease risk genes.

Update Gene Names

Both the disease2gene database and the genedata database are updated by joining them with the geneNames database on the Gene column and replacing each gene name in that column with the corresponding geneNames$approvedName. For example, for the disease2gene database:

newDriverGene <- disease2gene %>% inner_join(geneNames, by=c("Gene"))

newDriverGene$Gene <- newDriverGene$approvedName

Ranking system

The goal behind the rankings is to have a standardized measure of quality for which we can use to evaluate all the studies. This helps us convey to the users of the app how confident we are in the data that we found in each scientific paper. Two formulas were created: one to rank the quality of data found in an individual study, and one to rank the quality of a combination of studies that all look at the same thing.

A ranking for an individual study is based off of 4 variables about the study, 1 variable that is a sub-variable of 1 of those variables, and a variable "preference", which tells the function which variable should be considered the most important factor of how strong a study is. The variables taken into consideration are the method ate (in vivo or in vitro), the concentration (this applies only if the food was consumed in vivo), whether a whole food or an extract was consumed, the sample size of the study, the x-factor (our gut feeling about what the study should be ranked), and preference. Foods that were consumed in vivo got higher rankings than foods that were consumed in vitro. Foods that were consumed in vitro got slightly higher rankings if the concentration used was low. Whole foods consumed got higher ranks than extracts. Studies with higher sample sizes got higher scores than studies with lower sample sizes. The x-factor was an opinion-based variable that is created solely based on the Eat4Genes creator's gut feeling about how strong abd useful that study's data is. The preference variable gets a weight of 50% in the final ranking (note that the x-factor cannot be the preference variable), the other 2 variables get weights of 20% each in the final ranking, and the x-factor gets a weight of 10% in the final ranking.

The combination ranking formula takes in the number of studies that all look at the same nutrient, the mean of the scores for each individual study, and the x-factor variable. Nutrients that had more studies supporting evidence for them got higher ranks than nutrients that had fewer studies. The mean of the scores was important to take into consideration because even if a nutrient had a lot of studies written about it, does not necessarily mean all of the data was high quality. The mean was a good measure of how strong all of the studies as a whole are. The x-factor, just like in the individual study ranking, was a gut feeling about what the combination of studies should be ranked. The number of studies variable gets a weight of 20% in the final rank, the mean of the individual study rankings gets 70% of the final rank, and the remaining 10& of the final rank goes to the x-factor.

Methods

The following will be a detailed description of the webpages and how they are computed. At the top of the page, the user selected a condition or disease to search for. This is saved as input$disease. The key risk genes that will be searched for are saved in target_genes, a subset of disease2gene. The matches found for the key risk genes in the study pool is saved in matching, which is computed as follows:

matching <- inner_join(target_genes , genedata)

matching is of the same form as genedata.

What Should I Eat?

At the top of the page, the user is shown the unique list of nutrients found for them in matching, printed sorted per category. They can then choose between Bubble View and Table View to view their recommendations.

Table View

Table view is where the user can view their recommendations in text form. It returns a table of the form:

Nutrient Ranking Category Description Link

In order to add the information about the nutrients, foods_complete is formed, which is computed as:

foods_complete <- inner_join(foods_info, nutrient_info)

foods_complete_link is what is returned in the table view, which mutates the Link column to turn the string of text of the link into a hot link.

Bubble View

Bubble view is where the user can view their results as a categorized packed bubble chart. The highcharter package is used, which allows Highchart plotting objects in R.

The tooltip for each object contains the name of the nutrient followed by a small image such as an emojii, the ranking of the nutrient, and a brief description of the nutrient. It is added to foods_complete as a column called text, and this new object is stored as foods_info_text.

Highchart plots are plotted using series, and empty series cannot be plotting. As such, foods_info_text is sorted into the six categories as separate objects. For each category, a new subset is created, as well as a Boolean object that stores if that subset is empty. For example, for whole foods:

foods_info_wf <- subset(foods_info_text, Category == "whole food")

check_wf <- nrow(foods_info_wf()) != 0)

This is also used at the top of the page to only print categories of nutrients that are not empty.

The packed bubble highchart object is created, and then each category of food is added as a series if it is not empty. The tooltip is added as described above, as well as the click event leading to a new window being opened with more information about the nutrient. In other words, when the bubble is hovered over it will show the tooltip described and when it is clicked it will lead to a new window with information about the food.

The bubbles are sorted by category and the size of each individual bubble depends on the ranking of the nutrient. The user may choose to hide or show a category by clicking on the legend.

Why Should I Eat It?

This page shows more details about the specific genes selected for the condition selected and the matches found from the pool of studies. At the top of the page, the target_genes are shown to the user. matching_genes is the subset of matching that shows the Gene, Expression, and Nutrient columns. matching_genes_icons turns the Expression into either an up or down arrow html object depending on the Expression. matching_genes_icons is given to the user as a table of the form:

Gene Expression Nutrient

Which Studies Say I Should Eat It?

This table shows the user more information about the studies referenced to create the recommendations from the Eat4Genes database. study_info is the information about the studies that the matches were found in. It is computed as: study_info <- inner_join(matching , studydata)

From there, the links are made into hot links and the columns are selected, using the same methods are already described.

More Details