Notebook5-Biological-Connections.Rmd

---
title: 'Drawing Biological Connections with Gene Ontology Enrichment Analysis'
subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022
author: "Insert your name here"
output:
  pdf_document:
    latex_engine: xelatex
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
# Required R package installation:
# These will install packages if they are not already installed

# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)


if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}
if (!require("tibble")) {
  install.packages("tibble")
  library(tibble)
}

if (!require("dplyr")) {
  install.packages("dplyr")
  library(dplyr)
}

if (!require("tidyr")) {
  install.packages("tidyr")
  library(tidyr)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

if (!require('gprofiler2')) {
  install.packages("gprofiler2")
  library(gprofiler2)
}

knitr::opts_chunk$set(echo = TRUE)
```

# Overview

This notebook performs functional enrichment analysis on an input list of genes. Specifically, we run Gene Ontology Enrichment followed by Revigo summarization on the input gene list.

This type of analysis is a common approach to characterizing the molecular functions of a set of genes of interest, in order to better understand and summarize the biological role of the set of genes as a whole, as well as the larger-scale cellular processes the genes may play a role in.


*  Part I: Basic Gene Ontology analysis with `g:profiler`
*  Part II: `Revigo` for GO semantic summarization

## Data

For this analysis, we will use the set of genes found to be enriched in the cell type `Astrocytes` in FTD organoid models.

```{r}
Ast.markers.top <- readRDS("/data/AlzheimersDSData/celltype_marker_genes_Ast.rds")
```


## Part I: Run Gene Ontology Enrichment Analysis on genes of interest

Gene Ontology (GO) term enrichment is a technique for interpreting the biological roles of sets of genes that  makes use of the Gene Ontology system of classification. In this ontology, genes are assigned to a set of predefined bins depending on their functional characteristics. For example, a gene may be categorized for its role of encoding a cell signaling receptor, or for having a protein product involved in a cellular repair process.

The `g:profiler` package enables functional profiling of custom gene lists using Gene Ontology terms. Specifically, the [`gost`](https://www.rdocumentation.org/packages/gprofiler2/versions/0.2.1/topics/gost) function performs statistical enrichment analysis to find over-representation of biological functions among the input genes. Significantly enriched functions are identified by means of a hypergeometric test followed by
correction for multiple testing.

Using `g:profiler`, we can identify and show the top biological functions enriched among our genes of interest. The output of the GO enrichment analysis from `g:profiler` is a ranked list of GO terms, and their corresponding p-values (indicating term enrichment significance).

Read more on the background and usage of Gene Ontology Enrichment Analysis [here](http://geneontology.org/docs/introduction-to-go-resource/).


```{r}

## Run GO analysis on gene list
# Results from the query are stored in GOres$result

library(gprofiler2)

genes <- rownames(Ast.markers.top)

GOres <- gost(query = genes,
organism = "hsapiens", ordered_query = TRUE,
multi_query = FALSE, significant = TRUE, user_threshold = 0.05,
sources = c("GO:BP"))

# Make a dataframe of major Gene Ontology terms returned and their p-values
# Note we order the results by increasing p-value (most signifcant come first)
GOres.df <- GOres$result %>%
dplyr::select(term_name,term_id,p_value) %>%
arrange(p_value)

# Show results
head(GOres.df)
terms<- GOres.df$term_id
```


## Revigo Summarization of Gene Ontology Terms

Revigo is a web-based tool for summarizing lists of GO terms (such as our output from Gene Ontology Enrichment Analysis) by finding a representative subset of the terms based on semantic similarity.

Once summarized, the resulting non-redundant GO term set can be visualized by means of  a `treemap` by calling Revigo's `treemapPlot()` function on the similarity matrix, as shown.

Learn more about Revigo summarization [here](http://revigo.irb.hr/FAQ.aspx#q01).

To implement Revigo summarization, we will use the `R` package `rrvgo` - usage reference [here](https://bioconductor.org/packages/release/bioc/vignettes/rrvgo/inst/doc/rrvgo.html#using-rrvgo). The following section highlights the core functionality of this package.

### Step 1: Similarity Calculation
First we use the R library `rrvgo` to calculate the similarity matrix between terms. The function `calculateSimMatrix` takes i) a list of GO terms for which the semantic simlarity is to be calculated, ii) an `orgdb` object for the organism database to reference, iii) the GO ontology category of interest (one of Biological Process, Molecular Function, or Cellular Component), and iv) the method to calculate the similarity scores.

```{r}
library(rrvgo)

# calculate similarity matrix between GO terms
# function takes GO term ids as input;
# these are accessible in GOres.df$term_id
simMatrix <- calculateSimMatrix(GOres.df$term_id,
                                # human reference database
                                orgdb="org.Hs.eg.db",
                                # Gene ontology to use: Biological Process
                                ont="BP",
                                method="Rel")

```

### Step 2: Term Reduction

From the similarity matrix, we can next group the GO terms based on similarity. `rrvgo` provides the `reduceSimMatrix` function for that. It takes as arguments i) the similarity matrix, ii) an optional named vector of scores associated to each GO term, iii) a similarity threshold used for grouping terms, and iv) an `orgdb` organism database reference object.

For component ii), the scores vector is optional information that helps `rrvgo` assign importance to input GO terms for summarization. Scores are interpreted in the direction that higher are better, so we can minus log-transform the p-values from GO enrichment to use as our scores.

```{r}
scores <- setNames(-log10(GOres.df$p_value), GOres.df$term_id)
reducedTerms <- reduceSimMatrix(simMatrix,
                                scores,
                                threshold=0.7,
                                # human reference database
                                orgdb="org.Hs.eg.db")


```

Treemaps are spatial visualizations of hierarchical structures. The terms are grouped (colored) based on their parent, and the space used by the term is proportional to the score. Treemaps can help with the interpretation of the summarized Revigo results.

```{r}
treemapPlot(reducedTerms)
```
	---
	title: 'Drawing Biological Connections with Gene Ontology Enrichment Analysis'
	subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022
	author: "Insert your name here"
	output:
	pdf_document:
	latex_engine: xelatex
	html_document:
	df_print: paged
	---

	```{r setup, include=FALSE}
	# Required R package installation:
	# These will install packages if they are not already installed

	# Set the correct default repository
	r = getOption("repos")
	r["CRAN"] = "http://cran.rstudio.com"
	options(repos = r)


	if (!require("knitr")) {
	install.packages("knitr")
	library(knitr)
	}
	if (!require("tibble")) {
	install.packages("tibble")
	library(tibble)
	}

	if (!require("dplyr")) {
	install.packages("dplyr")
	library(dplyr)
	}

	if (!require("tidyr")) {
	install.packages("tidyr")
	library(tidyr)
	}
	if (!require("ggplot2")) {
	install.packages("ggplot2")
	library(ggplot2)
	}

	if (!require('gprofiler2')) {
	install.packages("gprofiler2")
	library(gprofiler2)
	}

	knitr::opts_chunk$set(echo = TRUE)
	```

	# Overview

	This notebook performs functional enrichment analysis on an input list of genes. Specifically, we run Gene Ontology Enrichment followed by Revigo summarization on the input gene list.

	This type of analysis is a common approach to characterizing the molecular functions of a set of genes of interest, in order to better understand and summarize the biological role of the set of genes as a whole, as well as the larger-scale cellular processes the genes may play a role in.


	* Part I: Basic Gene Ontology analysis with `g:profiler`
	* Part II: `Revigo` for GO semantic summarization

	## Data

	For this analysis, we will use the set of genes found to be enriched in the cell type `Astrocytes` in FTD organoid models.

	```{r}
	Ast.markers.top <- readRDS("/data/AlzheimersDSData/celltype_marker_genes_Ast.rds")
	```


	## Part I: Run Gene Ontology Enrichment Analysis on genes of interest

	Gene Ontology (GO) term enrichment is a technique for interpreting the biological roles of sets of genes that makes use of the Gene Ontology system of classification. In this ontology, genes are assigned to a set of predefined bins depending on their functional characteristics. For example, a gene may be categorized for its role of encoding a cell signaling receptor, or for having a protein product involved in a cellular repair process.

	The `g:profiler` package enables functional profiling of custom gene lists using Gene Ontology terms. Specifically, the [`gost`](https://www.rdocumentation.org/packages/gprofiler2/versions/0.2.1/topics/gost) function performs statistical enrichment analysis to find over-representation of biological functions among the input genes. Significantly enriched functions are identified by means of a hypergeometric test followed by
	correction for multiple testing.

	Using `g:profiler`, we can identify and show the top biological functions enriched among our genes of interest. The output of the GO enrichment analysis from `g:profiler` is a ranked list of GO terms, and their corresponding p-values (indicating term enrichment significance).

	Read more on the background and usage of Gene Ontology Enrichment Analysis [here](http://geneontology.org/docs/introduction-to-go-resource/).


	```{r}

	## Run GO analysis on gene list
	# Results from the query are stored in GOres$result

	library(gprofiler2)

	genes <- rownames(Ast.markers.top)

	GOres <- gost(query = genes,
	organism = "hsapiens", ordered_query = TRUE,
	multi_query = FALSE, significant = TRUE, user_threshold = 0.05,
	sources = c("GO:BP"))

	# Make a dataframe of major Gene Ontology terms returned and their p-values
	# Note we order the results by increasing p-value (most signifcant come first)
	GOres.df <- GOres$result %>%
	dplyr::select(term_name,term_id,p_value) %>%
	arrange(p_value)

	# Show results
	head(GOres.df)
	terms<- GOres.df$term_id
	```


	## Revigo Summarization of Gene Ontology Terms

	Revigo is a web-based tool for summarizing lists of GO terms (such as our output from Gene Ontology Enrichment Analysis) by finding a representative subset of the terms based on semantic similarity.

	Once summarized, the resulting non-redundant GO term set can be visualized by means of a `treemap` by calling Revigo's `treemapPlot()` function on the similarity matrix, as shown.

	Learn more about Revigo summarization [here](http://revigo.irb.hr/FAQ.aspx#q01).

	To implement Revigo summarization, we will use the `R` package `rrvgo` - usage reference [here](https://bioconductor.org/packages/release/bioc/vignettes/rrvgo/inst/doc/rrvgo.html#using-rrvgo). The following section highlights the core functionality of this package.

	### Step 1: Similarity Calculation
	First we use the R library `rrvgo` to calculate the similarity matrix between terms. The function `calculateSimMatrix` takes i) a list of GO terms for which the semantic simlarity is to be calculated, ii) an `orgdb` object for the organism database to reference, iii) the GO ontology category of interest (one of Biological Process, Molecular Function, or Cellular Component), and iv) the method to calculate the similarity scores.

	```{r}
	library(rrvgo)

	# calculate similarity matrix between GO terms
	# function takes GO term ids as input;
	# these are accessible in GOres.df$term_id
	simMatrix <- calculateSimMatrix(GOres.df$term_id,
	# human reference database
	orgdb="org.Hs.eg.db",
	# Gene ontology to use: Biological Process
	ont="BP",
	method="Rel")

	```

	### Step 2: Term Reduction

	From the similarity matrix, we can next group the GO terms based on similarity. `rrvgo` provides the `reduceSimMatrix` function for that. It takes as arguments i) the similarity matrix, ii) an optional named vector of scores associated to each GO term, iii) a similarity threshold used for grouping terms, and iv) an `orgdb` organism database reference object.

	For component ii), the scores vector is optional information that helps `rrvgo` assign importance to input GO terms for summarization. Scores are interpreted in the direction that higher are better, so we can minus log-transform the p-values from GO enrichment to use as our scores.

	```{r}
	scores <- setNames(-log10(GOres.df$p_value), GOres.df$term_id)
	reducedTerms <- reduceSimMatrix(simMatrix,
	scores,
	threshold=0.7,
	# human reference database
	orgdb="org.Hs.eg.db")


	```

	Treemaps are spatial visualizations of hierarchical structures. The terms are grouped (colored) based on their parent, and the space used by the term is proportional to the score. Treemaps can help with the interpretation of the summarized Revigo results.

	```{r}
	treemapPlot(reducedTerms)
	```