Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
AlzheimersDS/Notebook3-Data-Analysis.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
356 lines (253 sloc)
14.2 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Analysis of scRNAseq Data from Organoid Models of Frontotemporal Dementia" | |
subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022 | |
author: "Rachael White" | |
output: | |
pdf_document: | |
latex_engine: xelatex | |
html_document: | |
df_print: paged | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
# Required R package installation (run this cell before knitting) | |
# These will install packages if they are not already installed | |
if (!require("ggplot2")) { | |
install.packages("ggplot2") | |
library(ggplot2) | |
} | |
if (!require("knitr")) { | |
install.packages("knitr") | |
library(knitr) | |
} | |
if (!require("dplyr")) { | |
install.packages("dplyr") | |
library(dplyr) | |
} | |
if (!require("mosaic")) { | |
install.packages("mosaic") | |
library(mosaic) | |
} | |
if (!require("tidyverse")) { | |
install.packages("tidyverse") | |
library(tidyverse) | |
} | |
if (!require("Seurat")) { | |
install.packages("Seurat") | |
library(Seurat) | |
} | |
``` | |
# Overview | |
This notebook contains examples of how to perform a few of the core data manipulation, aggregation, and `ggplot` visualization tasks commonly used for exploring scRNAseq data from FTD organoid models. Much of the included code is the same as that used to achieve the visualizations featured in the `AlzApp`. This includes slightly more advanced data manipulation exercises than you have seen thus far in the bootccamp, but we expect that you will be well-equipped at this stage to have a general sense of the data manipulation procedures and output. | |
# Data | |
We read in and analyze matricized subsets of `FinalMergedData.rds`. | |
```{r} | |
# Matrix subset of the data- G6 line only | |
G6cells <- readRDS("/data/AlzheimersDSData/FinalMergedData_linesG6A02-G6E11_all_timepts.rds") | |
``` | |
```{r} | |
# Also read in full dataset for fetching data subsets with Seurat::FetchData() | |
# (for convenience of data access in some exercises) | |
seurat.all <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds") | |
``` | |
```{r} | |
# peek data | |
G6cells[1:10,1:30] | |
``` | |
```{r} | |
# check dimensions | |
dim(G6cells) | |
``` | |
```{r} | |
# Identify first and last gene represented in the G6 data | |
# to be used later | |
first_gene <- colnames(G6cells)[25][[1]] | |
last_gene <- colnames(G6cells)[3024][[1]] | |
``` | |
# Data Aggregation and Visualization Examples | |
## Summarize gene expression levels by cell type, age and mutant status | |
To begin, this example illustrates how we would aggregate the overall data with respect to all of our experimental group categories- age, mutant vs. wildtype, and celltype. | |
```{r message=FALSE, warning=FALSE} | |
# Summarize expression by `CellType`, `Age`, `Mt` | |
# keeping all individual genes | |
# Here we use dplyr::summarize_at() | |
# to summarize ~only~ at the columns that contain gene expression values | |
expression_by_age_celltype <- G6cells %>% group_by(CellType,Age,Mt) %>% summarize_at(vars(all_of(first_gene):all_of(last_gene)), mean, na.rm=TRUE) | |
# peek result | |
kable(expression_by_age_celltype[1:5,1:10]) | |
``` | |
```{r} | |
# As a sanity check, | |
# we calculate the expected number of rows of the resulting dataframe | |
# given our known experimental group sizes, | |
# and compare to the actual value: | |
no_ages = 3 | |
no_celltypes = 21 | |
no_variants = 2 | |
expected_dim <- no_ages*no_celltypes*no_variants | |
expected_dim | |
``` | |
```{r} | |
nrow(expression_by_age_celltype) | |
``` | |
We can see that the dimensions do not match... | |
From this finding, we hit on a key feature of this dataset that is revealed by the analytical advantages of single-cell RNA sequencing data: _not all cell types are present at all time points_. For example, `Ast` cells (astrocytes) were only detected at 2 months. Thus, single-cell RNA seq allows us to find important time trends in gene expression data from a tissue-specific perspective. | |
We will explore this perspective further later in this notebook by visualizing the expression trends of each celltype over time. | |
## How to summarize expression level of _one_ gene over time, grouped by **Cell type** | |
A researcher might choose to perform a bulk-type analysis to better analyze expression trends for a given gene over time. To implement this, we can take the cumulative average of expression levels of the gene of interest at each of the 3 experimental time points, and split the visualization with respect to each cell type. | |
```{r} | |
gene <- "ELAVL4" | |
``` | |
```{r} | |
# create the new data frame | |
expression_means_for_this_gene <- G6cells %>% | |
dplyr::select(all_of(gene),Mt,Age,CellType) %>% | |
group_by(Age,CellType) %>% | |
summarise(mean_expr = mean(.data[[gene]])) | |
# convert all character columns to factors for plotting | |
expression_means_for_this_gene <- as.data.frame(unclass(expression_means_for_this_gene), stringsAsFactors = TRUE) | |
# convert Age to an integer | |
expression_means_for_this_gene$Age <- recode(as.factor(expression_means_for_this_gene$Age), "2mo"= 2, "4mo" = 4,"6mo" = 6) | |
# show result | |
expression_means_for_this_gene | |
``` | |
```{r message=FALSE, warning=FALSE} | |
#Plot the average expressions over time for each celltype | |
ggplot(expression_means_for_this_gene,aes(x=Age, y=mean_expr, col=CellType)) + | |
geom_line() + scale_x_continuous(breaks=c(2,4,6)) + | |
labs(title= sprintf("Gene expression averages over time for %s",gene), | |
x="Developmental timepoint (months)", | |
y = "Average Expression") + | |
# Use facet_wrap to make a separate plot for each cluster | |
facet_wrap(CellType ~.) + | |
theme_minimal() + | |
theme(legend.position="none") | |
``` | |
## How to summarize expression level of _one_ gene over time, grouped by **variant status** | |
Similarly, we can take the cumulative average of expression levels of the gene of interest at each of the 3 experimental time points, and split the visualization with respect to mutant vs wildtype gene expression: | |
```{r} | |
# summarize expression level of one gene, grouped by variant | |
expression_means_for_this_gene <- G6cells%>% | |
dplyr::select(all_of(gene),Mt,Age,CellType) %>% | |
group_by(Age, Mt) %>% | |
summarise(mean_expr = mean(.data[[gene]])) | |
# convert all character columns to factors for plotting | |
expression_means_for_this_gene <- as.data.frame(unclass(expression_means_for_this_gene), stringsAsFactors = TRUE) | |
# convert Age to an integer | |
expression_means_for_this_gene$Age <- recode(as.factor(expression_means_for_this_gene$Age), "2mo"= 2, "4mo" = 4,"6mo" = 6) | |
# show result | |
expression_means_for_this_gene | |
``` | |
```{r} | |
#Plot the average expressions over time for mt vs wt | |
ggplot(expression_means_for_this_gene,aes(x=Age, y=mean_expr, col=Mt)) + | |
geom_line() + | |
scale_x_continuous(breaks=c(2,4,6)) + | |
labs(title=sprintf("Gene expression averages over time for %s",gene), | |
x="Developmental timepoint (months)", | |
y = "Average expression") + | |
theme_minimal() | |
``` | |
# Visualize expression distribution of a gene of interest, in mutant vs. normal cells | |
## Baseline Violin Plots with `Seurat` | |
A violin plot is a common way to visualize expression distributions for a specific gene (or genes) of interest. | |
We can visualize trends across celltypes: | |
```{r} | |
VlnPlot(seurat.all, features=c("ELAVL4"),group.by="CellType",split.by="Mt",split.plot=TRUE) | |
``` | |
Or time: | |
```{r} | |
VlnPlot(seurat.all, features=c("ELAVL4"),group.by="Age",split.by="Mt",split.plot=TRUE) | |
``` | |
## Violin Plots with `ggplot2::geom_violin()` | |
You may find in your research that you want to exert greater control over the default `Seurat` rendering of your expression distributions. You can alternatively use `ggplot2` to achieve the equivalent or more advanced visuals. | |
This section uses a `Seurat` utility function `FetchData()` to extract specific subsets of our overall dataset, based on desired cell attributes, for convenience in plotting. | |
Fetch the data to be plotted: | |
```{r} | |
plotdata1 <- FetchData(seurat.all, c('UMAP_1', 'UMAP_2', 'CellType','Age','Mt','ELAVL4')) | |
head(plotdata1) | |
``` | |
```{r} | |
dim(plotdata1) | |
``` | |
Apply `ggplot2::geom_violin()` to render violin plot geometry, specifying the "Age" on the `x` axis and the expression level of the specific gene on the `y`: | |
```{r} | |
ggplot(plotdata1, aes(x=Age, y=ELAVL4, fill=Mt)) + | |
geom_violin(trim=FALSE) + scale_fill_brewer(palette="Paired",name="Variant") + | |
scale_y_continuous(breaks=c(0,1,2,3),name="Expression Level")+ | |
ylab("Expression Level") + | |
xlab("Developmental Timepoint") + | |
theme_minimal() | |
``` | |
As an example of the greater control you can achieve with `ggplot2` versus `Seurat` (or in conjunction with it), we can also use `facet_wrap()` on `CellType` to split up the overall plot into individual plots, one for each "CellType". You can think of `facet_wrap()` as away to arrange a list of plots into into rows and columns, so that the layout is the best fit for the number of plots. | |
```{r} | |
ggplot(plotdata1, aes(x=Age, y=ELAVL4, fill=Mt)) + | |
geom_violin(trim=FALSE) + scale_fill_brewer(palette="Paired",name="Variant") + | |
scale_y_continuous(breaks=c(0,1,2,3),name="Expression Level")+ | |
facet_wrap(~CellType) + | |
ylab("Expression Level") + | |
xlab("Developmental Timepoint") + | |
theme_minimal() | |
``` | |
# Generate heatmaps of gene expression for specific genes of interest | |
Heatmaps can be rendered easily using `Seurat` built-in function `DoHeatmap()`. Note that these visualizations of gene expression levels by default draw on the normalized residuals for gene expression in the `seurat` object slot `seurat.obj[["SCT"]]@scale.data`. The residuals can be positive or negative. Positive residuals for a given gene in a given cell indicate that we observed more UMIs than expected given the gene’s average expression in the population and cellular sequencing depth, while negative residuals indicate the converse. [(Read more about these values at the SCTransform docs).](https://satijalab.org/seurat/reference/sctransform) | |
Heatmaps can be plotted with annotation of the cell type grouping: | |
```{r} | |
Seurat::DoHeatmap(seurat.all, features = c("ELAVL4","APOE","TTR"), slot="scale.data", size = 3) + | |
# remove redundant celltype legend | |
guides(color = "none") | |
``` | |
... Or with respect to a specific celltype: | |
```{r} | |
my_celltype <- "Ast" | |
# DoHeatmap now shows a grouping bar, splitting the heatmap into groups or clusters. This can | |
# be changed with the `group.by` parameter | |
DoHeatmap(seurat.all, features = VariableFeatures(seurat.all)[1:10], cells = WhichCells(seurat.all, ident=my_celltype), slot="scale.data", size = 4, group.bar = FALSE, angle = 90) + | |
# remove redundant celltype legend | |
guides(color = "none") + | |
# add title | |
ggtitle(sprintf("Expression levels of the top 10 most-variable genes in the data, \nas expressed in %s cells",my_celltype)) | |
``` | |
See the DoHeatmap() documentation [here](https://satijalab.org/seurat/reference/doheatmap) for details on the function parameters and plot variations you can achieve. | |
# Feature Plots - Visualize Dimesional Reduction Cell Scatter Plots | |
As dimensional reduction has already been performed on the NSCI dataset, all of the standard `Seurat` plotting functions for visualizing and exploring the results of dimensional reduction will work accordingly when called on the data in a `Seurat` object format. `Seurat` provides several useful ways of visualizing both cells and features from dimensional reduction output, including `DimPlot()` and `FeaturePlot()`. | |
We can alternatively visualize these previously-calculated dimensional reductions (projections of the expression data of all cells into a lower-dimensional space) as ordinary scatter plots. You may find this useful if you're interested in exerting greater control over the aesthetics of your feature plots for a project or paper. | |
In these plots, the principal components and axes `UMAP1` and `UMAP2` capture the majority of variability in gene expression across the entire single-cell dataset. Data points plotted are individual cells. | |
## Feature plots colored by gene expression level | |
```{r} | |
# Use Seurat utility function Seurat::FetchData() to extract a dataframe for plotting | |
# the result has rows for individual cells, and | |
# columns containing UMAP1, UMAP2, and the expression data for this gene of interestg | |
plotdata1 <- FetchData(seurat.all, c('UMAP_1', 'UMAP_2', gene)) | |
# Feed the data into ggplot pipeline | |
# We use ggplot with geom_point() to render scatterplot | |
# of the data points (individual cells) after dimensionality reduction, | |
# with coloring to denote expression level of this gene | |
plotdata1 %>% ggplot(aes_string(x = 'UMAP_1', y = 'UMAP_2', color=gene)) + | |
# render scatterplot, optionally adjust point size | |
geom_point(size = 0.1) + | |
# adjust plot aspect ratio | |
coord_fixed(ratio = 1) + | |
# specify color gradient and a name for the associated legend | |
scale_colour_gradient(low = "lightgrey", high = "blue", name = "Expression Level") + | |
# set plot labels | |
xlab("UMAP 1") + ylab("UMAP 2") | |
``` | |
## Feature plots colored by cell type class | |
```{r} | |
# Use FetchData() to extract a different dataframe for plotting, | |
# This time with columns containing UMAP1, UMAP2, | |
# and the expression data averaged by cell type | |
FetchData(seurat.all, c('UMAP_1', 'UMAP_2', 'CellType')) %>% | |
ggplot(aes_string(x = 'UMAP_1', y = 'UMAP_2', color = 'CellType')) + | |
geom_point(size = 0.1) + | |
coord_fixed(ratio = 1) + | |
scale_colour_discrete(name = "Cell Type") + | |
xlab("UMAP 1") + ylab("UMAP 2") | |
``` | |
## Recommended Summer Challenge | |
An outstanding analysis type of interest that has not yet been implemented at the celltype-specific level for the Bowles et al. NSCI dataset is **pseudotime analysis** of gene expression trends. | |
In development, disease, and throughout life, cells transition from one state to another. Pseudotime analysis is a time series perspective of expression data that allows the robust tracking of cellular changes over (pseudo) time. Pseudotime can be simply thought of as a representation of how far a cell has moved through biological progress. | |
The R library `Monocle3` serves as the standard implementation for pseudotime analysis. Monocle introduced the strategy of using RNA-Seq for single-cell trajectory analysis. Rather than purifying cells into discrete states experimentally, Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. | |
A great idea for your summer challenge could be to implement a pseudotime analyis of trasncriptomic trends for your celltype. | |
More info on the `Monocle3` package and pseudotime implementation can be found [here](https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/). | |