Notebook1-Intro-To-The-Data.Rmd

---
title: 'Notebook Day 1: Introduction to scRNAseq Analysis of FTD Organoids'
subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022
author: "Insert your name here"
output:
  pdf_document:
    latex_engine: xelatex
  html_document: default
---

```{r setup, include=FALSE}
# Required R package installation:
# These will install packages if they are not already installed

# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)


if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}
if (!require("tibble")) {
  install.packages("tibble")
  library(tibble)
}

if (!require("dplyr")) {
  install.packages("dplyr")
  library(dplyr)
}

if (!require("tidyr")) {
  install.packages("tidyr")
  library(tidyr)
}
if (!require("ggridges")) {
  install.packages("ggridges")
  library(ggridges)
}

if (!require('gprofiler2')) {
  install.packages("gprofiler2")
  library(gprofiler2)
}
if (!require('rrvgo')) {
  install.packages("rrvgo")
  library(rrvgo)
}
if (!require('viridis')) {
  install.packages("viridis")
  library(viridis)
}

# Utility function to convert cell type codes to their real names for plotting
setCellTypeNames <- function(vec) {
    vec <- recode(as.factor(vec),
                          "Ast"="Astrocytes",
                          "ExDp1"="Excitatory deeplayer\n1",
                          "ExDp2"="Excitatory deeplayer\n2",
                          "ExM"="Maturing excitatory",
                          "ExM-U"="Maturing excitatory\nupper enriched",
                          "ExN"="Newborn excitatory",
                          "Glia"="Unspecified glia/\nnon-neuronal cells",
                          "InCGE"="Interneurons caudal\nganglionic eminence",
                          "InMGE"="Interneurons medial\nganglionic eminence",
                          "IP"="Intermediate\nprogenitors",
                          "OPC"="Oligodendrocyte\nprecursors",
                          "oRG"="Outer radial glia",
                          "PgG2M"="Cycling progenitors\n(G2/M phase)",
                          "PgS"="Cycling progenitors\n(S phase)",
                          "UN"="Unspecified neurons",
                          "vRG"="Ventricular radial \nglia")
    return(vec)}

knitr::opts_chunk$set(echo = TRUE)
```

##  Overview

-   This notebook aims to give initial exposure to our dataset of scRNAseq expression data from organoid cells containing the _MAPT_ mutation linked to frontotemporal dementia, and to give a sense of the immense variety of research questions that can be explored with such high-dimensional scRNAseq data.

-   Data source: 2021 Bowles et al. paper "ELAVL4, splicing, and glutamatergic dysfunction precede neuron loss in MAPT mutation cerebral organoids."

## Outline

* Brief background on single-cell RNA sequencing data analysis

* Intro to our NSCI Dataset

* How the `R` programming language affords the researcher enormous flexibility and control over data representation and visualizations

* Introductory exercises in exploring a specific celltype through the lens of the NSCI dataset


## Background on our Dataset: Why single-cell RNA-seq?

Across human tissues there is an incredible diversity of cell types, each with its own defining states and interactions. Importantly among these  key tissue-distinguishing states, is the fact that different cell types (tissues) tend to have different characteristic gene expression profiles. Moreover, these expression profiles are not static but dynamic- they can change over the developmental timeline of the tissue.

Because of this distinctiveness of different cell types, it is often of interest to molecular biology and disease researchers to understand the specific expression profile of a selected cell type or tissue of interest, in order to better understand and study their model system. For example, much insight in the domain of cancer research has been dependent on cell-type-specific experiments, as tumors often manifest in specific tissues in the body.

Single-cell RNA-seq (scRNA-seq) offers a glimpse into what genes are being expressed, as well as their expression levels, at the level of individual cells. Based on their expression profiles, individual cells can be grouped by both expert and automated scrutiny into their predicted celltype, and once this is done, gene expression trends of those cells can be analyzed at the cell type level.

This idea of celltype-specific analysis based on single-cell RNAseq data is a core focus of this summer's Data-Driven Alzheimer's Research Challenge.

## Required Knowledge

By Day 2 of the bootcamp (May 26, 2022) if you have not done so already and if you are not already familiar with scRNAseq as a technique and the data output it generates, we require you read this [_Nature Protocols_ tutorial](https://www.nature.com/articles/s41596-020-00409-w) on the computational analysis of scRNAseq data. While we will not be performing the entirety of the workflow described, it is necessary context to know how the data has already been preprocessed.

Optionally, for a short big picture overview of the advantages and challenges of scRNAseq, this brief [primer](https://hbctraining.github.io/scRNA-seq/lessons/01_intro_to_scRNA-seq.html) is a great resource.


## Our Summer Focus

This summer, we will be exploring and visualizing a large corpus of scRNAseq data from organoid models of frontotemporal dementia and isogenic controls. We will introduce `R` and [`Seurat`](https://satijalab.org/seurat/data analysis)-based data analysis workflows that will guide you through this process, with the aim of providing you with the tools for developing your own creative and publishable analyses of omics-scale scRNAseq datasets.

Our scRNAseq data set of focus was originally generated in the experimental suite described in [Bowles et al. (2021)](https://doi.org/10.1016/j.cell.2021.07.003), and is stored here in `FinalMergedData-downsampled.rds`. This data represents gene expression levels and cell attributes for a large body of 370,000 individual cortical organoid cells, encompassing a set of mutant cells with a mutation in the _MAPT_ gene that is implicated in [frontotemporal dementia](https://www.hopkinsmedicine.org/health/conditions-and-diseases/dementia/frontotemporal-dementia) (FTD), as well as the equivalent CRISPR-corrected control cells.  These cells were subjected to single-cell RNA sequencing at 2, 4, and 6 months of development. Data from all 3 timepoints is represented in the full  dataset.

We are interested in analyzing subsets of this data based on various combinations of cell attributes, in order to further characterize genetic perturbations associated with frontotemporal dementia in mutant brain cells versus controls, how these changes manifest in each of 16 identified neuronal cell types, and how these changes evolve over time.

## A few important metadata definitions:*

### **Cell lines represented**

Single-cell data was originally collected from organoids derived from 3 main cell lines (`ND`, `GIH6`, `GIH7`) from the [Tau Consortium](https://tauconsortium.org/) in conjunction with the [Neural Stem Cell Institute](https://www.neuralsci.org/tau). Each line encompasses an FTD mutant version and the corresponding CRISPR-corrected isogenic control. These iPSC lines were differentiated into cerebral organoids and subjected to single-cell RNA sequencing at 2, 4, and 6 months.

**Cell Line Key**

| Line     | Wildtype/mutant         |
|----------|-------------------------|
| ND-B06   | Wildtype                |
| ND-B09   | Mutant                  |
| GIH6-E11 | Wildtype                |
| GIH6-A02 | Mutant                  |
| GIH7-F02 | Wildtype                |
| GIH7-B12 | Wildtype                |
| GIH7-A01 | Mutant                  |


**_In the dataset, "wildtype" and "mutant" are denoted by "V337V" and "V337M", respectively._**

### **Cell types represented (all neuronal cells)**

Bowles et al. clustered the single cells of this data set (all cell lines) into 16 primary neuronal cell type clusters. Cell types were identified by both automated and expert cell type annotation. The identified categories are represented as follows:

**Cell Type Key**

-   Ast - astrocytes
-   ExDp1- excitatory deep layer 1
-   ExDp2- excitatory deep layer 2
-   ExM- maturing excitatory
-   ExM-U- maturing excitatory upper enriched
-   ExN- newborn excitatory
-   Glia- unspecified glia/non-neuronal cells
-   InCGE- interneurons caudal ganglionic eminence
-   InMGE- interneurons medial ganglionic eminence
-   IP- intermediate progenitors
-   OPC- oligodendrocyte precursor cells
-   oRG- outer radial glia
-   PgG2M- cycling progenitors (G2/M phase)
-   PgS- cycling progenitors (S phase)
-   UN- unspecified neurons
-   vRG- ventricular radial glia.


## Intro to the Data

**Important:** As you read through the example exercises in this notebook, note that at this point we *don't* ask that you focus on or fully understand how the code works. The aim of these exercises is to help you begin to appreciate the variety of options and modes that `R` offers you, the researcher, for exploring transcriptional data from different perspectives.

Exploratory data analysis is an important first step when getting oriented with any new data set. For example, we might want to know how many (or what proportion) of cells are in each cell type category, and whether or not these proportions change over time.

To calculate these statistics, we can use the `table()` function in base `R`, which returns the counts at each level of a category (or at each combination of two specified categories).

```{r echo=TRUE, message=FALSE, warning=FALSE}

# Fetch overall cell-level statistics (for all cell lines)
FDS_CT_stats <- readRDS("/data/AlzheimersDSData/overall_celltype_statistics.rds")
FDS_CT_stats$CellType <- setCellTypeNames(FDS_CT_stats$CellType)
kable(head(FDS_CT_stats))
```
Let's inspect the dimensions of this dataset using the base `R`  `dim()` function:

```{r}
dim(FDS_CT_stats)
```
We can see that we have 379142 rows and 3 columns.


In this data, each row is an individual single cell. The columns are cell attributes:


-  `CellType`: The cell type category to which this cell belongs
-  `Age`: The developmental time point at which this cell was sequenced
-  `Mt`: The status of the cell as bearing the frontotemporal dementia disease mutation (V337M) or wildtype (V337V)


Get counts of cells in each cell type category:

```{r}
t1 <- table(FDS_CT_stats$CellType)
kable(t1)
```

Get counts of cells analyzed at each sequencing timepoint:

```{r}
t2 <- table(FDS_CT_stats$Age)
kable(t2)
```

Get counts at each combination of time point + CellType:

```{r}
t3 <- table(FDS_CT_stats$CellType,FDS_CT_stats$Age)
kable(t3)

```

As a visual alternative, we can plot the proportions of cells in each cell type, across the full data set:

```{r}
# Calculate overall cell type proportions for plotting
FDS_CT_props <- FDS_CT_stats %>%
  group_by(CellType,Age) %>%
   summarise(n = n()) %>%
   ungroup() %>%
   group_by(Age) %>%
   mutate(freq = n / sum(n))

ggplot(FDS_CT_props, aes(x = Age, y=freq, fill = CellType)) +
  geom_col() +
      labs(title = "Cell Type Proportions at Each Sequencing\nTimepoint",
        x = "Developmental Timepoint of Organoid Cells",
        y = "Proportion of Cells", fill="Cell Type" ) +
   theme_minimal()

```

For each cell type, we can also inspect the distributions of cells with the diseased phenotype vs. in the control group.

(Recall: `V337M` = mutant, `V337V` = wildtype)

```{r echo=FALSE, message=FALSE, warning=FALSE}

# Calculate overall proportions for plotting
FDS_CT_props_variant <- FDS_CT_stats %>%
  group_by(CellType,Age,Mt) %>%
   summarise(n = n()) %>%
   ungroup() %>%
   group_by(Age) %>%
   mutate(freq = n / sum(n))

FDS_CT_props_variant$CellType <- setCellTypeNames(FDS_CT_props_variant$CellType)
ggplot(FDS_CT_props_variant, aes(x = Age, y=freq, fill = Mt)) +
  geom_col(position = "fill") +
   facet_wrap(~CellType) +
        labs(title = "Proportions of Variants at Each Sequencing Timepoint (For All Cell Types)",
        x = "Developmental Timepoint of Organoid Cells",
        y = "Proportion of Cells", fill="Variant" ) +

   theme_minimal()

```

While the overall proportions of individual cell types can be observed to increase or decrease over time, we can see that the relative experimental group sizes remain approximately constant over time, for all cell types. This allows us to draw valid comparisons between groups at each time point.

## Introductory Exercise: Analyzing CellType-Specific Gene Expression Trends

This section will walk you through a few examples of how you could stage a celltype-specific analysis of gene expression trends in our NSCI scRNAseq data.

### Cell Type Designation

Replace the code `Ast` (for "astrocytes") below with the code for your assigned cell type. Make sure to keep it in double quotes ("").

_(see list above for cell type codes)_

```{r}
# Important:
# The text you enter must exactly match
# the code for your cell type as given above- case-sensitive!

my_celltype <- "Ast"

```

### Cell Type Marker Genes

Replace the gene IDs below with the IDs of the marker genes for your assigned cell type. Refer to this [spreadsheet](https://docs.google.com/spreadsheets/d/1oV5GrMy6DsFvzfwBZip2PDOjtRJAYkH-KGKO8WlVwFI/edit?usp=sharing) for known marker genes associated with your celltype. Make sure to type the IDs exactly, keeping each one individually in double quotes ("").


```{r}
# Important:
# The IDs you enter must exactly match the gene IDs for your cell type
# as given in the spreadsheet- case-sensitive!

my_celltype_marker_genes <- c("GFAP","ALDH1L1","SOX9","AQP4")

```


Read in the gene expression data and cell-level metadata for your cell type:

```{r}
my_celltype_data <- readRDS(sprintf("/data/AlzheimersDSData/celltype_data_%s.rds",my_celltype))
```

The function `readRDS()` reads the data from a specified file in your R environment into an R object.
After read-in, the object can then be accessed for the remainder of the R session under the same variable name.
You can see all currently-loaded objects under the **Environment** panel in the upper right corner of your RStudio.

```{r}
# gather data into long format for plotting,
# with one column "Gene" containing all 3000 of the most-variable genes
# identified in the overall dataset
my_celltype_data.df <- my_celltype_data %>% tidyr::gather(key = "Gene",
                                                          value="ExpressionLevel",
                                                          -Age,-Mt,-CellType)
# filter the resulting data frame
# so that it contains only the rows corresponding to the
# cell type marker genes
my_ct_genes.df <- my_celltype_data.df %>%
                        filter(Gene %in% my_celltype_marker_genes)
```


```{r}
# Use `ggplot2` to plot expression distributions of marker genes for this cell type,
# by means of a "violin plot" (ggplot2::geom_violin())

plot <- ggplot(data = my_ct_genes.df,
               aes(x = Gene, y = ExpressionLevel, fill=Gene))

plot <- plot + geom_violin()
plot

```

*Note: if your graph appears blank or data for only one gene shows up, it may just mean that your marker genes are not among the top most-variable genes in the dataset. Let us know if this occurs, so we can find alternate genes for you to analyze. Alternatively, you can inspect the genes available in `my_celltype_data.df` and pick a few that have expression values greater than 0.*

This visual is missing several key aesthetics that add value for data interpretation- most notably a title. We can use the `ggplot2` library to add an informative title, and also change the plot theme to make it a bit more visually pleasing, simply by adding additional geoms (or "layers") to our existing plot.

```{r}

plot <- plot + ggtitle(sprintf("Expression Distributions of Key Marker Genes for CellType %s",my_celltype)) +
  theme_minimal()
plot
```

We'll dive further into these and many other `ggplot` functionalities later in the bootcamp.

Now, let's choose one of these genes and explore how we can visualize its expression trends with respect to other attributes within the data. For example, we might be interested in exploring how this gene is expressed in the mutant (diseased) cells versus in the isogenic controls, or how its expression levels change over time.

```{r}
# select one marker gene for plotting
marker_gene1 <- my_celltype_marker_genes[1]

# filter the previous celltype data
# so that it contains only the rows corresponding to the
# cell type marker gene of interest
gene.df <- my_celltype_data.df %>%
                        filter(Gene == marker_gene1)

```

### Plot expression distributions with respect to variant (attribute "Mt")

```{r}

var_to_plot <- "Mt"

# Use `ggplot` to plot the expression distribution of
# this cell type marker gene with respect to variant,
# by means of a "violin plot" (ggplot2::geom_violin())

plot2 <- ggplot(data = gene.df,
               aes(x = .data[[var_to_plot]], y = ExpressionLevel, fill=.data[[var_to_plot]]))

plot2 <- plot2 + geom_violin() +
                ggtitle(sprintf("%s Expression Distribution within CellType %s, Diseased vs. Normal Cells",marker_gene1, my_celltype)) +
                theme_minimal()
plot2


```
Are there noticeable expression changes between variants for this marker gene?

### **CHALLENGE 1:** Plot expression distributions with respect to time (attribute "Age")

Using the previous example and provided code as reference, use `ggplot2` to plot the expression distribution of this cell type marker gene with respect to **time**.

*Hint: use "Age" as the plotting variable.*


```{r}

# Insert your code here

```

```{r}
# provided code
plot3 <- ggplot(data = gene.df,
               aes(x = .data[[var_to_plot]], y = ExpressionLevel, fill=.data[[var_to_plot]]))

plot3 <- plot3 + geom_violin() +
                ggtitle(sprintf("%s Expression Distribution within CellType %s,\nby Developmental Time Point",marker_gene1, my_celltype)) +
                theme_minimal()
plot3


```

Are there any time trends evident?


*Note: it is also possible to recreate these graphs with respect to multiple genes. Here we give examples with one gene for ease of illustration.*


### Exploring alternative visualization modes

Another cool feature of data visualization in `R` is the modularity of visualization types available. There are often multiple different graph types that would be reasonable choices for illustrating the same expression distribution- for example, a `violin plot` versus a `box plot` versus a `ridge plot`. `R`'s `ggplot2` library makes it easy to switch between these options.

For example, using virtually the same code pipeline, we can render our expression distributions for the set of cell type marker genes as boxplots:

```{r}
# Use `ggplot` to plot expression distributions of marker genes for this cell type,
# this time by box plot (ggplot2::geom_boxplot())

plot4 <- ggplot(data = my_ct_genes.df,
               aes(x = Gene, y = ExpressionLevel, fill=Gene))

plot4 <- plot4 + geom_boxplot() +
                ggtitle(sprintf("Expression Distributions of Key Marker Genes for CellType %s",my_celltype)) +
                theme_minimal()
plot4

```
We can even combine the aesthetics if desired!

```{r}
plot + geom_boxplot(width=0.05, color="grey", alpha=0.6)
```

An alternative option for studying the distribution of a numeric variable across several groups is a Ridgeplot, which we can similarly achieve by simply specifying a different `ggplot` geometry:

```{r message=FALSE, warning=FALSE}
# This graph is made using the ggridges library,
# which is a ggplot2 extension

plot6 <- ggplot(data = my_ct_genes.df,
               aes(x = ExpressionLevel, y = Gene, fill=Gene))

plot6 <- plot6 + geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none") + ggtitle(sprintf("Expression Distributions of Key Marker Genes for CellType %s",my_celltype)) + theme_minimal()

plot6

```

### **CHALLENGE 2:** Geometry Juggling with GGplot2

Didn't mean for that to rhyme...

For this challenge exercise, start with the data in `my_ct_genes.df` and come up with an alternative visualization geometry for rendering the expression distributions of your set of celltype marker genes. *Hint*: See this  [cheatsheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf) for all of the cool options available for different visualization tasks in `ggplot2`. Specifically, we want a graph type falling under `discrete x, continuous y`.

```{r}
# Insert your code here
```


## Drawing Biological Connections

In the previous exercises, we explored gene expression for your assigned celltype through the lense of a few major marker genes. There are actually many more "differentally expressed" genes associated with each cell type, as we will discover. Once the genes associated with a celltype of interest have been identified, we can use `R` with the help of external libraries to draw conclusions about their biological significance.

There are many types of enrichment analyses available that are useful for drawing functional insight about a set of differentially-expressed genes identified through scRNAseq analysis.

For example, as a preview, we can take the top 30 genes identified as being enriched in the celltype `Astrocytes`, and identify the *top 10 biological functions* associated with this group of genes (as characterized by [Gene Ontology enrichment](http://geneontology.org/docs/go-enrichment-analysis/)).

### Read in genes differentially expressed in astrocytes:

```{r}

genes_of_interest <- readRDS("/data/AlzheimersDSData/celltype_marker_genes_Ast.rds")
head(genes_of_interest)

genes <- rownames(genes_of_interest)[1:30]

```

### Run Gene Ontology enrichment analysis on genes and view output

*Note: Code included as a preview; we will spend more time understanding how GO enrichment analysis in R works in a later bootcamp session!*

```{r echo=FALSE}
require(gprofiler2)

GOres_human <- gost(query = genes,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, user_threshold = 0.05)

# Results from the query are stored in GOres_human$result

# Make a dataframe of major Gene Ontology terms returned and their p-values
# order the results by increasing p-value (most significant first)
GOres_human.df <- GOres_human$result %>%
dplyr::select(term_name,term_id,p_value) %>%
arrange(p_value)

# Show results
kable(head(GOres_human.df))

```

We see that R libraries make it easy to retrieve the same information you could obtain by manually querying these genes at the Gene Ontology API webpage. The advantage of using R to do so is in the automation; it allows GO enrichment (and many similar types of gene enrichment analyses!) to be integrated directly into R workflows, for more efficient analysis of large numbers of gene sets.

As a further preview of a type of analysis that may be used to draw out biological significance,  the `rrvgo` package can be employed after GO analysis to semantically summarize the output GO terms:


```{r}
# Code included will be discussed more in depth
# in a later bootcamp session
terms<- GOres_human.df$term_id

require(rrvgo)
simMatrix <- calculateSimMatrix(GOres_human.df$term_id,
                                # human reference database
                                orgdb="org.Hs.eg.db",
                                # Gene ontology to use: Biological Process
                                ont="BP",
                                method="Rel")

scores <- setNames(-log10(GOres_human.df$p_value), GOres_human.df$term_id)
reducedTerms <- reduceSimMatrix(simMatrix,
                                scores,
                                threshold=0.7,
                                # human reference database
                                orgdb="org.Hs.eg.db")

treemapPlot(reducedTerms)


```

Treemaps are space-filling representations of hierarchies- for instance, the heirarchy of enrichment significance amongst the output GO terms. The terms are grouped (colored) based on their parent GO category, and the space used by the term is proportional to the score (significance). Treemaps can help with the interpretation of GO results because they show the large number of enriched terms as a reduced set of broader categories.

## **Take-Home Challenge:** Exploring the full NSCI dataset through the lens of your "Pet Gene" and assigned Cell Type

As a biology researcher, you will undoubtedly be coming into this bootcamp with a few "pet genes" of interest at the top of your mind, from your literature readings or previous or current research focuses. Let's find out how your favorite gene is expressed in our NSCI dataset, to find out if it has any notable connections to the frontotemporal dementia disease phenotype.

For this exercise, you will be using our newly-deployed online interface to the NSCI dataset- the IDEA "AlzApp". Navigate to https://inciteprojects.idea.rpi.edu/apps/AlzApp/ and plug in your gene of interest using the drop down menu of the `Gene Explorer` panel. Explore its expression levels in the feature plots and across each represented cell type. Report on any interesting trends you find.

*Insert your findings here.*

Now, do an analysis of the genes associated with ***your assigned celltype***. Refer to this [spreadsheet](https://docs.google.com/spreadsheets/d/1oV5GrMy6DsFvzfwBZip2PDOjtRJAYkH-KGKO8WlVwFI/edit?usp=sharing) for known marker genes associated with your celltype. Query the genes in the web app linked above, and write a brief report here on the results (e.g. is the gene differentially expressed between the diseased versus normal phenotype? How does its expression change over time? Feel free to be creative in your analysis and use any external literature to aid in your hypothesis development. This will be useful brainstorming for later exercises.)

*Insert your findings here.*

In your free time, we encourage you to play around with the app and its features, and feel free to bring up any questions or critiques about the tool during the bootcamp or over Slack. Your work this summer will contribute a lot to what the app can do!

## References and Further Reading:

- [_Nature Protocols_ tutorial](https://www.nature.com/articles/s41596-020-00409-w) on the computational analysis of scRNAseq data
- [List of known marker genes for neuronal cell types](https://docs.google.com/spreadsheets/d/1oV5GrMy6DsFvzfwBZip2PDOjtRJAYkH-KGKO8WlVwFI/edit?usp=sharing)
- [Gene Ontology Enrichment Analysis](http://geneontology.org/docs/go-enrichment-analysis/)
- [About REVIGO GO term summarization](http://revigo.irb.hr/FAQ.aspx#q01)
- [Cheatsheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf), preview of all the cool stuff you can do with `ggplot2`!