Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
title: "Analysis of scRNAseq Data from Organoid Models of Frontotemporal Dementia"
subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022
author: "Rachael White"
output:
pdf_document:
latex_engine: xelatex
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Required R package installation (run this cell before knitting)
# These will install packages if they are not already installed
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
if (!require("dplyr")) {
install.packages("dplyr")
library(dplyr)
}
if (!require("mosaic")) {
install.packages("mosaic")
library(mosaic)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("Seurat")) {
install.packages("Seurat")
library(Seurat)
}
```
# Overview
This notebook contains examples of how to perform a few of the core data manipulation, aggregation, and `ggplot` visualization tasks commonly used for exploring scRNAseq data from FTD organoid models. Much of the included code is the same as that used to achieve the visualizations featured in the `AlzApp`. This includes slightly more advanced data manipulation exercises than you have seen thus far in the bootccamp, but we expect that you will be well-equipped at this stage to have a general sense of the data manipulation procedures and output.
# Data
We read in and analyze matricized subsets of `FinalMergedData.rds`.
```{r}
# Matrix subset of the data- G6 line only
G6cells <- readRDS("/data/AlzheimersDSData/FinalMergedData_linesG6A02-G6E11_all_timepts.rds")
```
```{r}
# Also read in full dataset for fetching data subsets with Seurat::FetchData()
# (for convenience of data access in some exercises)
seurat.all <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")
```
```{r}
# peek data
G6cells[1:10,1:30]
```
```{r}
# check dimensions
dim(G6cells)
```
```{r}
# Identify first and last gene represented in the G6 data
# to be used later
first_gene <- colnames(G6cells)[25][[1]]
last_gene <- colnames(G6cells)[3024][[1]]
```
# Data Aggregation and Visualization Examples
## Summarize gene expression levels by cell type, age and mutant status
To begin, this example illustrates how we would aggregate the overall data with respect to all of our experimental group categories- age, mutant vs. wildtype, and celltype.
```{r message=FALSE, warning=FALSE}
# Summarize expression by `CellType`, `Age`, `Mt`
# keeping all individual genes
# Here we use dplyr::summarize_at()
# to summarize ~only~ at the columns that contain gene expression values
expression_by_age_celltype <- G6cells %>% group_by(CellType,Age,Mt) %>% summarize_at(vars(all_of(first_gene):all_of(last_gene)), mean, na.rm=TRUE)
# peek result
kable(expression_by_age_celltype[1:5,1:10])
```
```{r}
# As a sanity check,
# we calculate the expected number of rows of the resulting dataframe
# given our known experimental group sizes,
# and compare to the actual value:
no_ages = 3
no_celltypes = 21
no_variants = 2
expected_dim <- no_ages*no_celltypes*no_variants
expected_dim
```
```{r}
nrow(expression_by_age_celltype)
```
We can see that the dimensions do not match...
From this finding, we hit on a key feature of this dataset that is revealed by the analytical advantages of single-cell RNA sequencing data: _not all cell types are present at all time points_. For example, `Ast` cells (astrocytes) were only detected at 2 months. Thus, single-cell RNA seq allows us to find important time trends in gene expression data from a tissue-specific perspective.
We will explore this perspective further later in this notebook by visualizing the expression trends of each celltype over time.
## How to summarize expression level of _one_ gene over time, grouped by **Cell type**
A researcher might choose to perform a bulk-type analysis to better analyze expression trends for a given gene over time. To implement this, we can take the cumulative average of expression levels of the gene of interest at each of the 3 experimental time points, and split the visualization with respect to each cell type.
```{r}
gene <- "ELAVL4"
```
```{r}
# create the new data frame
expression_means_for_this_gene <- G6cells %>%
dplyr::select(all_of(gene),Mt,Age,CellType) %>%
group_by(Age,CellType) %>%
summarise(mean_expr = mean(.data[[gene]]))
# convert all character columns to factors for plotting
expression_means_for_this_gene <- as.data.frame(unclass(expression_means_for_this_gene), stringsAsFactors = TRUE)
# convert Age to an integer
expression_means_for_this_gene$Age <- recode(as.factor(expression_means_for_this_gene$Age), "2mo"= 2, "4mo" = 4,"6mo" = 6)
# show result
expression_means_for_this_gene
```
```{r message=FALSE, warning=FALSE}
#Plot the average expressions over time for each celltype
ggplot(expression_means_for_this_gene,aes(x=Age, y=mean_expr, col=CellType)) +
geom_line() + scale_x_continuous(breaks=c(2,4,6)) +
labs(title= sprintf("Gene expression averages over time for %s",gene),
x="Developmental timepoint (months)",
y = "Average Expression") +
# Use facet_wrap to make a separate plot for each cluster
facet_wrap(CellType ~.) +
theme_minimal() +
theme(legend.position="none")
```
## How to summarize expression level of _one_ gene over time, grouped by **variant status**
Similarly, we can take the cumulative average of expression levels of the gene of interest at each of the 3 experimental time points, and split the visualization with respect to mutant vs wildtype gene expression:
```{r}
# summarize expression level of one gene, grouped by variant
expression_means_for_this_gene <- G6cells%>%
dplyr::select(all_of(gene),Mt,Age,CellType) %>%
group_by(Age, Mt) %>%
summarise(mean_expr = mean(.data[[gene]]))
# convert all character columns to factors for plotting
expression_means_for_this_gene <- as.data.frame(unclass(expression_means_for_this_gene), stringsAsFactors = TRUE)
# convert Age to an integer
expression_means_for_this_gene$Age <- recode(as.factor(expression_means_for_this_gene$Age), "2mo"= 2, "4mo" = 4,"6mo" = 6)
# show result
expression_means_for_this_gene
```
```{r}
#Plot the average expressions over time for mt vs wt
ggplot(expression_means_for_this_gene,aes(x=Age, y=mean_expr, col=Mt)) +
geom_line() +
scale_x_continuous(breaks=c(2,4,6)) +
labs(title=sprintf("Gene expression averages over time for %s",gene),
x="Developmental timepoint (months)",
y = "Average expression") +
theme_minimal()
```
# Visualize expression distribution of a gene of interest, in mutant vs. normal cells
## Baseline Violin Plots with `Seurat`
A violin plot is a common way to visualize expression distributions for a specific gene (or genes) of interest.
We can visualize trends across celltypes:
```{r}
VlnPlot(seurat.all, features=c("ELAVL4"),group.by="CellType",split.by="Mt",split.plot=TRUE)
```
Or time:
```{r}
VlnPlot(seurat.all, features=c("ELAVL4"),group.by="Age",split.by="Mt",split.plot=TRUE)
```
## Violin Plots with `ggplot2::geom_violin()`
You may find in your research that you want to exert greater control over the default `Seurat` rendering of your expression distributions. You can alternatively use `ggplot2` to achieve the equivalent or more advanced visuals.
This section uses a `Seurat` utility function `FetchData()` to extract specific subsets of our overall dataset, based on desired cell attributes, for convenience in plotting.
Fetch the data to be plotted:
```{r}
plotdata1 <- FetchData(seurat.all, c('UMAP_1', 'UMAP_2', 'CellType','Age','Mt','ELAVL4'))
head(plotdata1)
```
```{r}
dim(plotdata1)
```
Apply `ggplot2::geom_violin()` to render violin plot geometry, specifying the "Age" on the `x` axis and the expression level of the specific gene on the `y`:
```{r}
ggplot(plotdata1, aes(x=Age, y=ELAVL4, fill=Mt)) +
geom_violin(trim=FALSE) + scale_fill_brewer(palette="Paired",name="Variant") +
scale_y_continuous(breaks=c(0,1,2,3),name="Expression Level")+
ylab("Expression Level") +
xlab("Developmental Timepoint") +
theme_minimal()
```
As an example of the greater control you can achieve with `ggplot2` versus `Seurat` (or in conjunction with it), we can also use `facet_wrap()` on `CellType` to split up the overall plot into individual plots, one for each "CellType". You can think of `facet_wrap()` as away to arrange a list of plots into into rows and columns, so that the layout is the best fit for the number of plots.
```{r}
ggplot(plotdata1, aes(x=Age, y=ELAVL4, fill=Mt)) +
geom_violin(trim=FALSE) + scale_fill_brewer(palette="Paired",name="Variant") +
scale_y_continuous(breaks=c(0,1,2,3),name="Expression Level")+
facet_wrap(~CellType) +
ylab("Expression Level") +
xlab("Developmental Timepoint") +
theme_minimal()
```
# Generate heatmaps of gene expression for specific genes of interest
Heatmaps can be rendered easily using `Seurat` built-in function `DoHeatmap()`. Note that these visualizations of gene expression levels by default draw on the normalized residuals for gene expression in the `seurat` object slot `seurat.obj[["SCT"]]@scale.data`. The residuals can be positive or negative. Positive residuals for a given gene in a given cell indicate that we observed more UMIs than expected given the gene’s average expression in the population and cellular sequencing depth, while negative residuals indicate the converse. [(Read more about these values at the SCTransform docs).](https://satijalab.org/seurat/reference/sctransform)
Heatmaps can be plotted with annotation of the cell type grouping:
```{r}
Seurat::DoHeatmap(seurat.all, features = c("ELAVL4","APOE","TTR"), slot="scale.data", size = 3) +
# remove redundant celltype legend
guides(color = "none")
```
... Or with respect to a specific celltype:
```{r}
my_celltype <- "Ast"
# DoHeatmap now shows a grouping bar, splitting the heatmap into groups or clusters. This can
# be changed with the `group.by` parameter
DoHeatmap(seurat.all, features = VariableFeatures(seurat.all)[1:10], cells = WhichCells(seurat.all, ident=my_celltype), slot="scale.data", size = 4, group.bar = FALSE, angle = 90) +
# remove redundant celltype legend
guides(color = "none") +
# add title
ggtitle(sprintf("Expression levels of the top 10 most-variable genes in the data, \nas expressed in %s cells",my_celltype))
```
See the DoHeatmap() documentation [here](https://satijalab.org/seurat/reference/doheatmap) for details on the function parameters and plot variations you can achieve.
# Feature Plots - Visualize Dimesional Reduction Cell Scatter Plots
As dimensional reduction has already been performed on the NSCI dataset, all of the standard `Seurat` plotting functions for visualizing and exploring the results of dimensional reduction will work accordingly when called on the data in a `Seurat` object format. `Seurat` provides several useful ways of visualizing both cells and features from dimensional reduction output, including `DimPlot()` and `FeaturePlot()`.
We can alternatively visualize these previously-calculated dimensional reductions (projections of the expression data of all cells into a lower-dimensional space) as ordinary scatter plots. You may find this useful if you're interested in exerting greater control over the aesthetics of your feature plots for a project or paper.
In these plots, the principal components and axes `UMAP1` and `UMAP2` capture the majority of variability in gene expression across the entire single-cell dataset. Data points plotted are individual cells.
## Feature plots colored by gene expression level
```{r}
# Use Seurat utility function Seurat::FetchData() to extract a dataframe for plotting
# the result has rows for individual cells, and
# columns containing UMAP1, UMAP2, and the expression data for this gene of interestg
plotdata1 <- FetchData(seurat.all, c('UMAP_1', 'UMAP_2', gene))
# Feed the data into ggplot pipeline
# We use ggplot with geom_point() to render scatterplot
# of the data points (individual cells) after dimensionality reduction,
# with coloring to denote expression level of this gene
plotdata1 %>% ggplot(aes_string(x = 'UMAP_1', y = 'UMAP_2', color=gene)) +
# render scatterplot, optionally adjust point size
geom_point(size = 0.1) +
# adjust plot aspect ratio
coord_fixed(ratio = 1) +
# specify color gradient and a name for the associated legend
scale_colour_gradient(low = "lightgrey", high = "blue", name = "Expression Level") +
# set plot labels
xlab("UMAP 1") + ylab("UMAP 2")
```
## Feature plots colored by cell type class
```{r}
# Use FetchData() to extract a different dataframe for plotting,
# This time with columns containing UMAP1, UMAP2,
# and the expression data averaged by cell type
FetchData(seurat.all, c('UMAP_1', 'UMAP_2', 'CellType')) %>%
ggplot(aes_string(x = 'UMAP_1', y = 'UMAP_2', color = 'CellType')) +
geom_point(size = 0.1) +
coord_fixed(ratio = 1) +
scale_colour_discrete(name = "Cell Type") +
xlab("UMAP 1") + ylab("UMAP 2")
```
## Recommended Summer Challenge
An outstanding analysis type of interest that has not yet been implemented at the celltype-specific level for the Bowles et al. NSCI dataset is **pseudotime analysis** of gene expression trends.
In development, disease, and throughout life, cells transition from one state to another. Pseudotime analysis is a time series perspective of expression data that allows the robust tracking of cellular changes over (pseudo) time. Pseudotime can be simply thought of as a representation of how far a cell has moved through biological progress.
The R library `Monocle3` serves as the standard implementation for pseudotime analysis. Monocle introduced the strategy of using RNA-Seq for single-cell trajectory analysis. Rather than purifying cells into discrete states experimentally, Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process.
A great idea for your summer challenge could be to implement a pseudotime analyis of trasncriptomic trends for your celltype.
More info on the `Monocle3` package and pseudotime implementation can be found [here](https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/).