Notebook2-Intro-to-Seurat.Rmd

---
title: "Examples of Applying Seurat Functions for Single-Cell RNAseq Data Analysis"
author: "Insert your name here"
subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022
output:
  pdf_document:
    latex_engine: xelatex
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# Required R package installation (run this cell before knitting)
# These will install packages if they are not already installed

if (!require("Seurat")) {
   install.packages("Seurat")
   library(Seurat)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}

if (!require("plotly")) {
   install.packages("plotly")
   library(plotly)
}
if (!require("patchwork")) {
  install.packages("patchwork")
  library(patchwork)
}
# only print plotly code when valid
out_type<-NULL
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
if (is.null(out_type)) {out_type <- "none"}

```

# Overview

This notebook is written to be an introduction to the capabilities afforded by the `Seurat` package for performing publication-grade single cell RNAseq data analysis. It draws heavily on examples from the tutorial `Vignettes` provided by the [Seurat official documentation](https://satijalab.org/seurat/index.html), especially the introductory [PBMC 3K Guided Clustering Tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html).

- ***Part I: Structure of the Seurat Object Data Container***
- ***Part II: Intro to the Seurat Standard Data Analysis Workflow***


# Part I: Seurat Object Structure and Data Access

We'll start by reading in a downsampled version of our full NSCI FTD single-cell dataset (all cells and timepoints) with `readRDS()`. This dataset contains all the same attributes as the original, just with fewer cells per cell type, for the sake of rendering speed in the tutorial examples.

```{r}
# Read in FTD data from .RDS format from data location on the cluster
seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")

```


The RNAseq data we're mainly interested in working with for studying FTD-related transcriptomic trends is in the form of an *expression matrix* or *count matrix*. The values in this matrix represent the number (or count) of molecules for each gene (row) that are detected in each individual cell (column), as derived from one laboratory assay.

The expression matrix is contained inside a `Seurat` object, which we initialize directly with `readRDS()`. This data structure serves as a container for housing both data (like the count matrix) and associated analyses (like PCA, or clustering results) for one single-cell dataset. In this sense, the `Seurat` object affords us the ability to look at the same data at each stored "snapshot" of its processing.

At the top level, you could think of the Seurat object as a big filing cabinet of smaller individual storage containers (drawers); these are typically a collection of `Assay` and `DimReduc` objects, representing expression data and dimensionality reductions of the expression data, respectively. The `Assay` objects are designed to hold expression data of a single type, such as RNA-seq gene expression, CITE-seq ADTs, cell hashtags, or imputed gene values. `DimReduc` objects represent transformations of the data contained within the `Assay` object(s) via various dimensional reduction techniques such as PCA.

## Exploring the Seurat Container

The overall structure of the large `Seurat` containing our complete NSCI dataset of scRNAseq data is outlined here:

![seurat_obj](seurat_obj_structure.jpg){#id .class width=70% height=70%}


Summary information about `Seurat` objects can be had quickly and easily using standard R functions.

For example, we can get gene (column) and sample (row) counts, and a report of any major analyses done by calling the object itself:

```{r}
seurat.obj
```

The `Seurat` object stored in `"FinalMergedData-downsampled"` contains 5 data objects total- 2 `Assays`, and 3 `DimReducs`. `Assays` are gene expression information collected in the original sequencing experiment, and `DimReducs` contain the output of several different types of dimensional reduction analyses performed by the researchers on this dataset. These elements can be accessed by the names shown above, bracketed with `[[` - for example, `seurat.obj[["RNA"]]` returns the `"RNA"` assay object. The `features` refer to genes, and `samples` refer to individual cells.

Important notes for our data:

* From the readout above, we can see that the `active assay` is set to `SCT`. This is the correct behavior. This means that `Seurat` functions for plotting and analysis will draw on the `SCT` assay by default- as opposed to others within the structure, such as the raw data `RNA` assay.

* **What does `SCT` refer to?** `seurat.obj[["SCT"]]` is the version of the data following subjection to [SCTransform](https://satijalab.org/seurat/articles/sctransform_vignette.html), an advanced Seurat normalization/standardization routine. This algorithm generates normalized residuals for gene expression and deposits them in slot `seurat.obj[["SCT"]]@scale.data`. Residuals can be positive or negative. Positive residuals for a given gene in a given cell indicate that we observed more UMIs than expected given the gene’s average expression in the population and cellular sequencing depth, while negative residuals indicate the converse.

* `SCTransform` also generates "corrected counts", similar to log-normalized expression values, and deposits them in slot `seurat.obj[["SCT"]]@data`.

* The expression data of interest for the majority of our analysis purposes are these ***corrected counts***, stored in the sparse matrix at `seurat.obj[["SCT"]]@data`.


Summary information about internal data objects can be had quickly and easily using standard `R` functions. Object shape/dimensions can be found using the `dim`, `ncol`, and `nrow` functions; cell and feature names can be found using the `colnames` and `rownames` functions, respectively.

### Accessing Gene Expression Data

We can access this data frame of gene expression values for each gene and cell in our NSCI `Seurat` object either *manually*, as shown:


```{r}
my_rna <- seurat.obj[["SCT"]]@data
head(rownames(my_rna))
```

... or with the `Seurat` built-in accessor `GetAssayData`:

```{r}
my_rna <- GetAssayData(seurat.obj)
head(rownames(my_rna))
```
```{r}
dim(my_rna)
```
40096 total genes detected across 3400 cells.

So that's how to access gene expression data.. what if we want to get information on the actual samples themselves, the individual cells?

### Accessing Cell-Level Metadata

A typical interest for RNAseq data analyst might be to scope out what attributes were collected on the cells used in the original scRNAseq experiment being studied - for example, their age, the line they were drawn from, the experimental batch they were included in, etc.  `Seurat` stores all of this information for the entire corpus of single cells in a different `data frame`, accessible within the object metadata.

We can access this data frame of cell-level metadata in our NSCI `Seurat` object as shown:

```{r}
cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr
# alternatively seurat.obj@meta.data
```

```{r}
# Peek at the available fields
str(cell_attrs)
```

We can access specific cells with `base R`  accessors:

```{r}
# retrieve a vector of all cell names
cells <- colnames(seurat.obj)
head(cells)
str(cells)

# get total number
ncol(seurat.obj)
```

... Or `Seurat` built-in accessors:

```{r}
# retrieve a vector of all cell names
cells <- Cells(seurat.obj)
head(cells)
str(cells)
```

### Managing the `Seurat` identity class of cells

```{r}
# View current "active" cell identity classes
# Used for plotting functions, grouping, etc.

head(Idents(seurat.obj))
```

```{r}
# Set "active" cell identity classes to a different attribute in
# the cell-level metadata
Idents(seurat.obj) <- "CellType"

head(Idents(seurat.obj))

```


*There are many additional convenience functions you cna use with `Seurat` objects beyond those discussed here. For more info on the `Seurat` object structure and data access, please check out the dedicated [GitHub Wiki](https://github.com/satijalab/seurat/wiki).*

---

# Part II: Intro to the Seurat Standard Data Analysis Workflow***

Now that we have a primer on how to work with a `Seurat` data container, let's dive in to the actual core data analysis functionalities available with the `Seurat` library.

This section of the notebook is set up in a "Call and Response" format- we will show a step by step walkthrough of the standard pre-processing and visualization workflow for scRNA-seq data in `Seurat`, using an example dataset. In conjunction, with each step, we will also show you how to access the resulting output in the large `Seurat` object containing the NSCI data from  [Bowles et al. (2020)](https://doi.org/10.1016/j.cell.2021.07.003).

The standard workflow steps encompass the selection and filtration of cells based on QC metrics, data normalization and scaling, the detection of highly variable features (differential expression testing), clustering, and visualization. Reminder that this [Nature Protocols tutorial](https://www.nature.com/articles/s41596-020-00409-w) is a great resource for a more in-depth discussion of the standard single-cell data processing workflow.

## Data

For this tutorial, we will be analyzing a dataset of gene expression in Peripheral Blood Mononuclear Cells (PBMC), a resource made publicly available from the [10X Genomics](https://www.10xgenomics.com/products/single-cell-gene-expression) single-cell sequencing platform. The dataset samples consist of 2,700 single cells that were sequenced on the Illumina NextSeq 500. *We won't need this for the tutorial, but the raw data can be downloaded [here](https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz)*.

We start by reading in the data. The `Seurat::Read10X()` function reads in the output of the 10X Genomics [cellranger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger), returning a unique molecular identifier (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column) in the assay.

Load the PBMC example expression data from its remote location:

```{r}
pbmc.data <- Read10X(data.dir = "~/AlzheimersDS/pbmc/filtered_gene_bc_matrices/hg19/")
```

We next use the count matrix to create a `Seurat` object. This data structure serves as a container for housing both data (like the count matrix) and downstream analyses performed on the data (like PCA, or clustering results) for a given single-cell dataset. After initialization of the `Seurat` object data structure with `CreateSeuratObject()`, the original raw count matrix used for initialization will be accessible within the structure at `pbmc[["RNA"]]@counts`.

Initialize the Seurat object with the raw (non-normalized) example data:

```{r warning=FALSE}
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

```
Note that the resulting data summary includes a note about there being "0 variable features (genes)" identified within the assay. This is because variable feature identiifcation is a step performed on the raw data as part of the standard data analysis workflow, and the object will be updated accordingly. We'll see that  each major step of the `Seurat` workflow results in an updated version of the original `Seurat` object.

The `CreateSeuratObject()` actually does a good bit of work to calculate dataset statistics automatically, and deposits the stats in a specific location in the `Seurat` object for later access. For example, the number of unique genes and total molecules for each cell are automatically calculated. You can access them in the object meta data here:

```{r}
# Show cell-level attributes for the first 5 cells
head(pbmc@meta.data, 5)
```

## Cell Quality Control Measures and Filtering

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly employed in single-cell research to assess cell quality include:

* The number of unique genes detected in each cell
  * Low-quality cells or empty droplets will often have very few genes
  * Cell doublets or multiplets may exhibit an aberrantly high gene count
* Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
* The percentage of reads that map to the mitochondrial genome
  * Low-quality / dying cells often exhibit extensive mitochondrial contamination
  * `Seurat` provides the `PercentageFeatureSet()` function to calculate the percentage of counts originating from a set of genes; it can be used to calculate the percentage of genes expressed in each cell originating from Mitochondria as shown:

```{r}
# calculate for each cell
# percentage of reads that map to the mitochondrial genome,
# producing a vector of one percentage value for each cell
percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# The [[ operator can add columns to object metadata.
# This is a great place to stash QC stats, because
# rows in the metadata are individual cells
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

kable(head(pbmc@meta.data, 5))
```

You can do QC by visualizing the available cell-level QC metrics, and using the distributions (or external knowledge) to determine filter criteria:

```{r}
# Visualize few selected QC metrics as violin plots
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```

Based on these plots and our biological knowledge, we might choose to filter out cells that have unique gene counts over 2,500 or less than 200, and to filter out cells that have >5% mitochondrial counts. This can be done using the `subset()` fucntion:

```{r}
# filter
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# re-render
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

```
## Data Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. To accomplish this,`Seurat` offers a global scaling normalization method `NormalizeData()` that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

After calling `NormalizeData()` on the `pbmc` dataset, Normalized values are stored in `pbmc[["RNA"]]@data.`

```{r}
pbmc <- NormalizeData(pbmc)

```

There are many additional arguments to this function which give the user more control over normlaization methods; for details, see the documentation [here](https://satijalab.org/seurat/reference/normalizedata).

### Access results in the NSCI Data

Our dataset is a little different in that the raw counts were subjected to the specialized normalization procedure `SCTransform`, which both normalizes and scales the data in one step.

As mentioned, `SCTransform` generates "corrected counts", similar to log-normalized expression values, and deposits them in slot `seurat.obj[["SCT"]]@data`.

For example:

```{r}
nsci_normalized_expression <- seurat.obj[["SCT"]]@data
```


## Finding Highly Variable Genes

Transcriptomic studies have found that identifying genes with high cell-to-cell expression variation, and focusing on these genes in downstream analysis, helps to highlight biological signal in single-cell datasets. In the standard `Seurat` data analysis workflow, the next logical step after data normalization is the calculation of this subset of genes that exhibit high cell-to-cell variation (i.e, they are highly expressed in some cells, and lowly expressed in others).

The procedure for this differential expression testing implemented in `Seurat` is described in detail [here](https://doi.org/10.1016/j.cell.2019.05.031). This algorithm is implemented in the `FindVariableFeatures()` function. By default, `Seurat` returns 2,000 features per dataset.

```{r}
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

```

The most variable genes are thereafter accessible in the object through the `VariableFeatures()` function.

We can access the ***overall** most-variable genes identified by differential expression testing (across all cells and sequencing timepoints) as follows:

```{r}
# Access the top 3000 genes with most-variable expression
# as previously identified by DE testing
# stored in this `Seurat` object
top_genes <- VariableFeatures(pbmc)
```


To check out only the top 10 most-variable genes:

```{r}
# Identify the 10 most highly variable genes
# Specifying the first n rows to show
# inside the head() function
top10 <- head(top_genes, 10)
top10
```

### Access results in the NSCI Data

```{r}
top_genes <- VariableFeatures(seurat.obj)
top10 <- head(top_genes, 10)
top10

```

By default, the `Seurat` function `FindVariableFeatures()` returns the top 2000 most-variable genes, but Bowles et al. recorded the top 3000 for good measure.

*We note *APOE* is 3rd among the top 10 genes showing the most variation for frontotemporal dementia. Mutant versions of *APOE* are known risk factors for late-onset Alzheimer's disease^[1](https://medlineplus.gov/genetics/gene/apoe/#conditions)^.*


`Seurat` provides a convenient method for plotting variable genes with `VariableFeaturePlot()`. We can label the top 10 overall most-variable genes with their gene symbols:

```{r echo=FALSE}
plot1 <- VariableFeaturePlot(seurat.obj)

plot_label_top10 <- LabelPoints(plot = plot1, points = VariableFeatures(seurat.obj)[1:10], repel = TRUE, xnudge=0,ynudge=0)
plot_label_top10
```


## Data Scaling

The next step in the Standard Workflow is to apply a linear transformation (‘scaling’) as a pre-processing step for dimensional reduction techniques like PCA. The `Seurat::ScaleData()` function:

* Shifts the expression of each gene, so that the mean expression across cells is 0
* Scales the expression of each gene, so that the variance across cells is 1
* This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

The results of the data scaling are stored in `pbmc[["RNA"]]@scale.data`.

```{r}
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

```

### Access results in the NSCI Data

Recall that the output of `SCTransform` is the counts data already normalized and scaled.


```{r}
nsci_normalized_expression <- seurat.obj[["SCT"]]@data
```

## Dimensional Reduction

Next the researcher would typically want to perform PCA on the scaled data. This technique is a way to simplify a high-dimensional dataset into a set of more meaningful components that retain the overall structure of the data. By default, only the previously determined 2000-3000 variable genes are used as input to PCA in `Seurat`.

```{r}
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

```

Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap().

```{r}
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1

```

```{r}
DimPlot(pbmc, reduction = "pca")
```

### Access results in the NSCI Data

As dimensional reduction has already been performed on the NSCI dataset, all of the standard `Seurat` plotting functions for visualizing and exploring the results of dimensional reduction will work accordingly when called on `seurat.obj`.

For example:

```{r}
DimPlot(seurat.obj)
```
***Challenge:*** Experiment with calling `DimPlot` on `seurat.obj` after changing the `active identities` to a different metadata attribute, such as `Age`.


```{r}
# Insert code here
```


***Challenge:*** Go into the documentation for `DimPlot` ([here](https://satijalab.org/seurat/reference/dimplot)) and see if you can figure out a way to split up the plot into 3 separate graphs- one for each time point- only by changing a function parameter.

```{r}
# Insert code here
```

### Alternative to PCA: Run Nonlinear Dimensional Reduction

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.

Our NSCI dataset was subjected to UMAP in the original study, and as a result, `Seurat` plotting functions actually draw on this `DimReduc` output by default.

***Example; No need to run this code***

```{r}
# pbmc <- RunUMAP(pbmc, dims = 1:10)
```

## Cell Clustering

`Seurat` provides a built-in graph-based clustering approach to group cells with similar attribute profiles and gene expression patterns.
This clustering step is performed using the FindClusters() function, and takes as input the previously defined dimensionality of the dataset (specifically the first 10 PCs). The clusters can thereafter be found using the `Idents()` function, provided the object `active identity` is set to `seurat_clusters`.

```{r}
# Cluster cells
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
```

***What's the difference between the `seurat` clusters and our known `"CellType"` categories?***
* Cluster identification on the NSCI dataset was performed through the automated routine in `Seurat` described above. The identified clusters were then used to *inform* cluster annotation with specific cell type names; the final cell type grouping step was mainly accomplished through expert annotation (with the help of automated tools like `SingleR`). It's worth noting that the mapping between seurat clusters and our cell type identities is not one-to-one.

### Access results in the NSCI Data

```{r}
Idents(seurat.obj) <- "seurat_clusters"
levels(as.factor(Idents(seurat.obj)))
```
```{r}
Idents(seurat.obj) <- "CellType"
levels(as.factor(Idents(seurat.obj)))
```

## Testing for CellType-Specific Differential Gene Expression

As previewed in your Day 2 Take-Home Challenge, we can perform cell type biomarker identification for a cell type of interest using the `Seurat::FindMarkers()` function. The function call below uses the same parameters used in the Bowles 2021 paper to the `MAST` model to run differential expression testing for the celltype `"Ast"`.

***Note: Just a reminder of this step; Don't need to run this code!***

```{r}
# Perform differential expression testing to find top
# distinguishing genes for this cluster of cells
# We match the same parameters used in the Bowles 2021 paper

# # Find celltype markers for celltype = "Ast" (astrocytes)
# Ast.markers <- FindMarkers(seurat.all.large, ident.1 = "Ast", assay = "RNA",
#                # Limit testing to genes which show, on average,
#                # at least X-fold difference (log-scale) between the two groups of cells.
#                # Default is 0.25.
#                logfc.threshold = 0.25,
#                # `test.use` param denotes which type of differential
#                #expression test to use.  We choose "MAST", which identifies differentially expressed genes
#                # between groups of cells using a hurdle model ~tailored to scRNA-seq data~.
#                # Utilizes the `MAST` package to run the DE testing.
#                test.use= "MAST",
#                    # Required data slot to use with 'MAST' model
#                    slot = "counts",
#                    # Covariates to regress out in 'MAST' model. We adopt the protocol used in Bowles 2021,
#                    # which regressed out percentage of mitochondrial genes (percent.mt),
#                    # the number of cells collected per parent organoid (QCcells),
#                    # the number of genes per cell (nFeature_RNA) and the number of reads per cell (nCount_RNA).
#                    latent.vars = c("percent.mt","QCcells","nFeature_RNA", "nCount_RNA"),
#                # only test genes that are detected in a minimum fraction of min.pct cells
#                # in either of the two populations, for performance. Default is 0.1
#                min.pct = 0.5,
#                # Only return positive markers (FALSE by default)
#                only.pos = F)
#
```

You can save the resulting `Seurat` object at this point, so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, and all of your data will be safely retained within the data structure.

```{r}
#saveRDS(pbmc, file = "my_seurat_object_file.rds")
```

## Resources linked within this notebook:

* [`Seurat` introductory vignette- PBMC tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
* [`Seurat` object GitHub wiki](https://github.com/satijalab/seurat/wiki)
* [`Seurat` object interaction tips](https://satijalab.org/seurat/articles/interaction_vignette.html)
* [QC metrics commonly used for classification of low quality cells](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/)
* [`Seurat` variable feature identification paper](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867419305598%3Fshowall%3Dtrue)