From a509161b6d23b38baa0e24c69843a6f8084b3fcd Mon Sep 17 00:00:00 2001 From: Alec Nabb Date: Fri, 17 Jun 2022 11:32:02 -0400 Subject: [PATCH] Changes to Rmd notebooks. --- .../Notebook1-Intro-To-The-Data-nabba.Rmd | 8 +- ...book4-Data-Access-and-Extraction-nabba.Rmd | 511 ++++++++++++++++++ ...Notebook5-Biological-Connections-nabba.Rmd | 159 ++++++ 3 files changed, 677 insertions(+), 1 deletion(-) create mode 100644 student-notebooks/Notebook4-Data-Access-and-Extraction-nabba.Rmd create mode 100644 student-notebooks/Notebook5-Biological-Connections-nabba.Rmd diff --git a/student-notebooks/Notebook1-Intro-To-The-Data-nabba.Rmd b/student-notebooks/Notebook1-Intro-To-The-Data-nabba.Rmd index caa950f..77c7d58 100644 --- a/student-notebooks/Notebook1-Intro-To-The-Data-nabba.Rmd +++ b/student-notebooks/Notebook1-Intro-To-The-Data-nabba.Rmd @@ -478,13 +478,19 @@ plot6 ``` ### **CHALLENGE 2:** Geometry Juggling with GGplot2 +##^lol Didn't mean for that to rhyme... For this challenge exercise, start with the data in `my_ct_genes.df` and come up with an alternative visualization geometry for rendering the expression distributions of your set of celltype marker genes. *Hint*: See this [cheatsheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf) for all of the cool options available for different visualization tasks in `ggplot2`. Specifically, we want a graph type falling under `discrete x, continuous y`. ```{r} -# Insert your code here +genes_dot_plot<-ggplot(my_ct_genes.df, aes(x=Gene, y=ExpressionLevel)) + + facet_wrap(~ Mt) + + geom_dotplot(binwidth=0.0000002, binaxis="y", stackdir="center") + +genes_dot_plot + ``` diff --git a/student-notebooks/Notebook4-Data-Access-and-Extraction-nabba.Rmd b/student-notebooks/Notebook4-Data-Access-and-Extraction-nabba.Rmd new file mode 100644 index 0000000..d4d912e --- /dev/null +++ b/student-notebooks/Notebook4-Data-Access-and-Extraction-nabba.Rmd @@ -0,0 +1,511 @@ +--- +title: "Data Manipulation and Pre-Processing for Single-Cell RNAseq Data Analysis" +author: "Alec Nabb" +output: + pdf_document: default + html_document: + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) + +# Required R package installation (run this cell before knitting) +# These will install packages if they are not already installed + +if (!require("ggplot2")) { + install.packages("ggplot2") + library(ggplot2) +} +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} +if (!require("Seurat")) { + install.packages("Seurat") + library(Seurat) + + } + +options(tinytex.verbose = TRUE) + +``` + + +## Change the code below to your celltype code + +```{r} +celltype <- "vRG" +``` + +## Overview + +This notebook contains a data processing workflow for extracting single-cell RNA Sequencing data directly from `Seurat` objects (from the R package `Seurat`) into R matrices. + +The workflow documented here demonstrates how to access subsets of the complete [Bowles et al. (2020)](https://doi.org/10.1016/j.cell.2021.07.003) NSCI dataset, as stored in `FinalMergedData.rds`, based on a combination of cell attributes specified by the user, and extract the resulting subsets into R matrices. Later, we show examples of how to perform similar extraction tasks using automated built-in `Seurat` functions. + +We'll start by reading in a downsampled version of our full NSCI FTD single-cell dataset (all cells and timepoints) with `readRDS()`. This dataset contains all the same attributes as the original, just with fewer cells per cell type, for the sake of rendering speed in the tutorial examples. + +## Data + +```{r} +# Read in FTD data from .RDS format from data location on the cluster +seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds") + +``` + + +### *Reminder of important metadata from this dataset to be familiar with:* + +Cell Line Key + +| Line | Wildtype/mutant | +|----------|-------------------------| +| ND-B06 | Wildtype | +| ND-B09 | Mutant | +| GIH6-E11 | Wildtype | +| GIH6-A02 | Mutant | +| GIH7-F02 | Wildtype | +| GIH7-B12 | Wildtype | +| GIH7-A01 | Mutant | + +**_In the data, "wildtype" and "mutant" are denoted by "V337V" and "V337M", respectively._** + +Previous research conducted on this dataset clustered the cells into 16 primary neuronal cell type categories. These categories are represented as shown below. + +Cell Type Key + +- Ast - astrocytes + +- ExDp1- excitatory deep layer 1 + +- ExDp2- excitatory deep layer 2 + +- ExM- maturing excitatory + +- ExM-U- maturing excitatory upper enriched + +- ExN- newborn excitatory + +- Glia- unspecified glia/non-neuronal cells + +- InCGE- interneurons caudal ganglionic eminence + +- InMGE- interneurons medial ganglionic eminence + +- IP- intermediate progenitors + +- OPC- oligodendrocyte precursor cells + +- oRG- outer radial glia + +- PgG2M- cycling progenitors (G2/M phase) + +- PgS- cycling progenitors (S phase) + +- UN- unspecified neurons + +- vRG- ventricular radial glia. + + +# Setup: Scoping out the cell attributes + +A typical interest for RNAseq data analyst might be to scope out what attributes were collected on the cells used in the original scRNAseq experiment being studied - for example, their age, the line they were drawn from, the experimental batch they were included in, etc. `Seurat` stores all of this information for the entire corpus of single cells in `data frame`, accessible within the object metadata. + +We can access this data frame of cell-level metadata in our NSCI dataset as shown: + +```{r} +cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr +``` + +```{r} +# Peek at the available fields +str(cell_attrs) +``` + +The 3 sample timepoints at which expression data were recorded can be accessed at `cell_attrs$Age`: + +```{r} +levels(as.factor(cell_attrs$Age)) + +``` + +Note that the data type of this feature is not numeric, as might be expected. + +```{r} +class(levels(as.factor(cell_attrs$Age))) + +``` + +We can similarly check how the 2 variants are represented: + +```{r} +levels(as.factor(cell_attrs$Mt)) + +``` + +And our cell types: + +```{r} +levels(as.factor(cell_attrs$CellType)) + +``` + + +# **Workflow: Extracting a subset of cells and the corresponding genes from a `Seurat` object** + +This section extracts the subset of N cells having attributes: + +1. Belonging to **ND cell line** (both mutant and wildtype) + +2. Collected at timepoint **2 months**. + +### **Step 1.** Identify cell IDs of desired subset of cells, based on the attribute columns in the cell_attrs dataframe + +Let's say for example we are interested in analyzing the data for just one cell line, line **ND**, sampled at **2 months**. Then we want to extract the cells with IDs `ND-B06` _or_ `ND-B09`, the mutant and wildtype equivalents of the same cell line. We also need to require `Age == "2mo"`. + +```{r} + +# dplyr pipeline to extract desired cell subset +ND.cells.df <- cell_attrs %>% filter((ID == "B06" | ID == "B09") & Age == "2mo") + +# View result +head(ND.cells.df) +str(ND.cells.df) + +# From this dataframe, we can get the list of cell IDs for our extracted subset: +ND_cells <- rownames(ND.cells.df) + +``` + + +### **Step 2.** Use the identified list or vector of cell IDs to extract the gene expression data for the corresponding cells. We will extract the expression values for the 3000 top most-variable genes. + + +```{r} + +# Access the gene expression matrix: +expression.matrix <- seurat.obj[["SCT"]]@data +str(expression.matrix) + +# convert to data frame for easier data manipulation +exp.df <- as.data.frame(expression.matrix) +dim(exp.df) + +``` + +Note the difference in formatting from the cell attributes data frame: + +* `cell_attrs`: cells are rows, attributes are columns +* `exp.df`: genes are rows, cells are columns + +```{r} +# Fetch gene expression data for the ND line cells only +ND.exp.df <- exp.df %>% dplyr::select(any_of(ND_cells)) + +``` + +### Step 3. Align and join cell attributes with gene expression data + +```{r} + +# Transpose gene expression df to match format of cell attributes df +# We want individual cells as rows and genes as columns + +#t() transposes a matrix-like object (swaps the columns and rows) +ND.exp.df.T <- t(ND.exp.df) + +# show swapped dimensions +# before: +dim(ND.exp.df) +# after: +dim(ND.exp.df.T) +``` + +```{r} +# R coerced it to a matrix for the operation, +# so we switch it back to a dataframe +ND.exp.df.T <- as.data.frame(ND.exp.df.T) + +``` + +```{r} +# next we want to align and join the two dataframes side-by-side +# to do so, we first need to create a common column of features to join them on +# we use the cell IDs, because we know both structures have the same set of cells. + +# convert rownames (cell IDs) into a column to join the two dataframes on +ND.cells.df <- rownames_to_column(ND.cells.df, var = "Cell_ID") +ND.exp.df.T <- rownames_to_column(ND.exp.df.T, var = "Cell_ID") + +``` + +```{r} +# Do the merge, using an inner_join to preserve only common elements +# (in this case we know they match, so the join type doesn't really matter) +merged.df <- inner_join(x=ND.cells.df, y=ND.exp.df.T, by="Cell_ID") + +dim(merged.df) + +``` + +```{r} +# We can use `dplyr` functions +# if we want to select only a few specific features to keep: +# adding `-` before a column name drops that column, and keeps all others +merged.df.clean <- merged.df %>% + dplyr::select(-Batch,-HTO,-Run,-percent.mt,-QCcells,-integrated_snn_res.0.5, -CellTypeRaw) +``` + +```{r} +# Optionally export the data +#saveRDS(merged.df,"~/Alzheimers_DS/GlutNeurons_NDline_2months_bothvariants_expression_data.rds") +``` + + +# Functionalized version of the data subsetting workflow + +We provide a convenience function `subsetCells()` for you to use in your summer work, if desired. This function can be used to extract a subset of the overall NSCI dataset containing only a specific Tau Consortium cell line (from one of the three main lines used in Bowles et al 2021). + +It takes as input a `Seurat` object such as the one containing our overall dataset (`/data/AlzheimersDSData/FinalMergedData-downsampled.rds`), the specification of a desired cell line, and additional parameters described below, and outputs a .rds file containing the specified subset. + +```{r} + +# Written 2/22/2022 +# Last updated 3/28/2022 +# Function to subset cell data stored in Seurat object by cell line +# and join with corresponding gene expression data (SCTransformed) +# from same object. +# author: Rachael White (whiter9) +# +# Inputs: +# - Seurat object containing an "SCT" assay +# - IDs: character vector of size 2 containing an ID pair +# - age: character, one of "2mo", "4mo", "6mo", or "all" +# - slot: one of "data" or "scale.data"; defaults to "data" +# +# Returns: Dataframe with *only* cell IDs from the input cell line as rows, +# with cell attributes and genes expressed as columns. +# Saves data subset as .rds. +# +# Note: Assumes "ID" and not "orig.idents" is the correct term +# to subset cells on. +################################################################### +subsetCells <- function(seurat.obj,IDs,age, slot="data") { + + require(tidyverse) + + if (slot=="scale.data") { + expression.matrix <- seurat.obj[["SCT"]]@scale.data + } else { + expression.matrix <- seurat.obj[["SCT"]]@data + } + + cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr + + #Step 1: Identify cell IDs of desired subset of cells (rows), + #based on the attribute columns in cell_attrs df, + # as well as the gene IDs of the 3000 top most-variable genes + if (age=="all") { + cells.df <- cell_attrs %>% filter((ID == IDs[1] | ID == IDs[2])) + } else { + cells.df <- cell_attrs %>% filter((ID == IDs[1] | ID == IDs[2]) & Age == age) + } + # get list of desired cell IDs + # (use intersect to ensure they are present in both objects): + cells <- intersect(rownames(cells.df),colnames(expression.matrix)) + # get list of the 3000 most-variable genes + top_3000_genes <- seurat.obj[["SCT"]]@var.features + + #Step 2: Use the identified cell and gene IDs to extract the gene expression data + #for the corresponding cells and genes + exp.M <- as.matrix(expression.matrix[top_3000_genes,cells]) + + #Step 3: Align and join cell attributes with gene expression data + # Transpose gene expression df to match format of cell attributes df + # We want individual cells as rows and genes as columns + exp.M.rotated <- t(exp.M) + exp.df <- as.data.frame(exp.M.rotated) + # Convert rownames (cell IDs) into a column to join on + cells.df <- tibble::rownames_to_column(cells.df, var = "Cell_ID") + exp.df <- tibble::rownames_to_column(exp.df, var = "Cell_ID") + # Do the merge + merged.df <- right_join(x=cells.df, y=exp.df, by="Cell_ID") + + #Step 4: Export data + if (age=="all") { + savefile = sprintf("FinalMergedData_lines%s-%s_%s_timepts.rds", IDs[1], IDs[2], age) + } else { + savefile = sprintf("FinalMergedData_lines%s-%s_%snths.rds", IDs[1], IDs[2], age) + } + + saveRDS(merged.df,savefile) + + return(merged.df) +} + +################################################################### +``` + + + +# Example data subset generation + +**NOTE:** The following section is included as a usage example; doesn't need to be run. + +### Read in full dataset (all cell types, total cell count 379142) + +```{r} +#seurat.complete <- readRDS("/data/AlzheimersDSData/FinalMergedData.rds") +``` + +### Create subsets of complete dataset compiled by specified cell line and time period + +```{r} +# test <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo") +# dim(test) +# test[1,1:10] + +``` + +```{r} +# ND2mo <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo") +``` + +### Subset complete dataset by cell line only, include all sequencing time points + +```{r} +# G6cells_all <- subsetCells(seurat.complete,IDs=c("G6A02","G6E11"),age="all") +``` + + +# Data Access and Subsetting the `Seurat` Way + +`Seurat` provides built-in functions for selecting particular cells from your dataset and subsetting the `Seurat` object on those cells. While we've shown how you could create data subsets in base `R`, the `Seurat` methods are probably the easiest and safest way to create data subsets for your own work for most purposes. The following examples will illustrate this functionality. + +First make sure the active identities of cells is set to `CellType`: + +```{r} +Idents(object = seurat.obj) <- "CellType" +levels(Idents(seurat.obj)) + +``` + +## Selecting particular cells + +With active identities set to "CellType", +we can pull out subsets of cells by their cell type identity class: + +```{r} +# access only astrocyte cells +Astrocyte_cells <- CellsByIdentities(seurat.obj, idents = c("Ast")) + +# alternate function that does the same thing +Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast") + +str(Astrocyte_cells) +``` + +What if you want to extract an expression matrix containing data for only one identity group (such as "Ast")? + +```{r} +# Retrieve expression data for specific CellType +Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast") +counts.data <- GetAssayData(seurat.obj, slot = "counts")[, Astrocyte_cells] + +# convert sparse matrix to workable R matrix object +counts.data <- as.matrix(counts.data) + +``` + +## Creating subsets using specific criteria + + +`Seurat` objects are compatible with the `subset()` function for easy subset generation based on user-specified criteria. + +*Note: The following are usage examples, we do not require that you run this code* + +### Examples + + +Can I create a Seurat object based on expression of a feature or value in object metadata? + +```{r} +#my_seurat <- subset(seurat.obj, subset = MS4A1 > 1) +``` + +Subset Seurat object based on identity class (also see `?SubsetData`) + +```{r} +# one group +#my_seurat <- subset(x = seurat.obj, idents = "Ast") + +# multiple groups +#my_seurat <- subset(x = seurat.obj, idents = c("Ast", "vRG")) +``` + +Subset on the expression level of a specific gene + +```{r} +# Only cells with APOE expression > 3 +#subset(x = seurat.obj, subset = APOE > 3) + +``` + +Subset on a combination of criteria + +```{r} + +#subset(x = seurat.obj, subset = APOE > 3 & ELAVL4 > 5) +#subset(x = seurat.obj, subset = APOE > 3, idents = "Ast") +``` + +Subset on a value in the object meta data (a cell-level attribute, for example "CellType") + +```{r} +#subset(x = seurat.obj, subset = Age == "2mo") +``` + +Create a downsampled version of a `Seurat` object (containing fewer cells per identity class) + +```{r} +#subset(x = seurat.obj, downsample = 100) + +``` + +# Day 3 Take-Home Challenge + +For this challenge, you will be creating a subset of the full NSCI dataset containing only the data specific to your `CellType`. + +- Change the celltype code specified below to your celltype + + +```{r} +celltype <- "vRG" +``` + +- Uncomment and run the code below. This will create a single subset of the full dataset `Seurat` object, containing only the data for your pet `CellType`, and save the subset to an rds file. + +Uncomment this code and run it: + +```{r} +# # Read in the FULL dataset from .rds format from data location on the cluster +seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds") +# +# # First make sure the active identities of cells is set to `"CellType"`: +Idents(object = seurat.obj) <- "CellType" +# +# # Create a subset of the full data +my_seurat <- subset(x = seurat.obj, subset = CellType == celltype) +# +# # Save the output +saveRDS(my_seurat, sprintf("~/AlzheimersDS/student-notebooks/FinalMergedData-%s.rds",celltype )) + +``` + + + diff --git a/student-notebooks/Notebook5-Biological-Connections-nabba.Rmd b/student-notebooks/Notebook5-Biological-Connections-nabba.Rmd new file mode 100644 index 0000000..ec67255 --- /dev/null +++ b/student-notebooks/Notebook5-Biological-Connections-nabba.Rmd @@ -0,0 +1,159 @@ +--- +title: 'Drawing Biological Connections with Gene Ontology Enrichment Analysis' +subtitle: IDEA Alzheimer's Data Analysis Bootcamp - Summer 2022 +author: "Alec Nabb" +output: + pdf_document: + latex_engine: xelatex + html_document: + df_print: paged +--- + +```{r setup, include=FALSE} +# Required R package installation: +# These will install packages if they are not already installed + +# Set the correct default repository +r = getOption("repos") +r["CRAN"] = "http://cran.rstudio.com" +options(repos = r) + + +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} +if (!require("tibble")) { + install.packages("tibble") + library(tibble) +} + +if (!require("dplyr")) { + install.packages("dplyr") + library(dplyr) +} + +if (!require("tidyr")) { + install.packages("tidyr") + library(tidyr) +} +if (!require("ggplot2")) { + install.packages("ggplot2") + library(ggplot2) +} + +if (!require('gprofiler2')) { + install.packages("gprofiler2") + library(gprofiler2) +} + +knitr::opts_chunk$set(echo = TRUE) +``` + +# Overview + +This notebook performs functional enrichment analysis on an input list of genes. Specifically, we run Gene Ontology Enrichment followed by Revigo summarization on the input gene list. + +This type of analysis is a common approach to characterizing the molecular functions of a set of genes of interest, in order to better understand and summarize the biological role of the set of genes as a whole, as well as the larger-scale cellular processes the genes may play a role in. + + +* Part I: Basic Gene Ontology analysis with `g:profiler` +* Part II: `Revigo` for GO semantic summarization + +## Data + +For this analysis, we will use the set of genes found to be enriched in the cell type `Astrocytes` in FTD organoid models. + +```{r} +vRG.markers.top <- readRDS("celltype_marker_genes_vRG.rds") +``` + + +## Part I: Run Gene Ontology Enrichment Analysis on genes of interest + +Gene Ontology (GO) term enrichment is a technique for interpreting the biological roles of sets of genes that makes use of the Gene Ontology system of classification. In this ontology, genes are assigned to a set of predefined bins depending on their functional characteristics. For example, a gene may be categorized for its role of encoding a cell signaling receptor, or for having a protein product involved in a cellular repair process. + +The `g:profiler` package enables functional profiling of custom gene lists using Gene Ontology terms. Specifically, the [`gost`](https://www.rdocumentation.org/packages/gprofiler2/versions/0.2.1/topics/gost) function performs statistical enrichment analysis to find over-representation of biological functions among the input genes. Significantly enriched functions are identified by means of a hypergeometric test followed by +correction for multiple testing. + +Using `g:profiler`, we can identify and show the top biological functions enriched among our genes of interest. The output of the GO enrichment analysis from `g:profiler` is a ranked list of GO terms, and their corresponding p-values (indicating term enrichment significance). + +Read more on the background and usage of Gene Ontology Enrichment Analysis [here](http://geneontology.org/docs/introduction-to-go-resource/). + + +```{r} + +## Run GO analysis on gene list +# Results from the query are stored in GOres$result + +library(gprofiler2) + +genes <- rownames(vRG.markers.top) + +GOres <- gost(query = genes, +organism = "hsapiens", ordered_query = TRUE, +multi_query = FALSE, significant = TRUE, user_threshold = 0.05, +sources = c("GO:BP")) + +# Make a dataframe of major Gene Ontology terms returned and their p-values +# Note we order the results by increasing p-value (most signifcant come first) +GOres.df <- GOres$result %>% +dplyr::select(term_name,term_id,p_value) %>% +arrange(p_value) + +# Show results +head(GOres.df) +terms<- GOres.df$term_id +``` + + +## Revigo Summarization of Gene Ontology Terms + +Revigo is a web-based tool for summarizing lists of GO terms (such as our output from Gene Ontology Enrichment Analysis) by finding a representative subset of the terms based on semantic similarity. + +Once summarized, the resulting non-redundant GO term set can be visualized by means of a `treemap` by calling Revigo's `treemapPlot()` function on the similarity matrix, as shown. + +Learn more about Revigo summarization [here](http://revigo.irb.hr/FAQ.aspx#q01). + +To implement Revigo summarization, we will use the `R` package `rrvgo` - usage reference [here](https://bioconductor.org/packages/release/bioc/vignettes/rrvgo/inst/doc/rrvgo.html#using-rrvgo). The following section highlights the core functionality of this package. + +### Step 1: Similarity Calculation +First we use the R library `rrvgo` to calculate the similarity matrix between terms. The function `calculateSimMatrix` takes i) a list of GO terms for which the semantic simlarity is to be calculated, ii) an `orgdb` object for the organism database to reference, iii) the GO ontology category of interest (one of Biological Process, Molecular Function, or Cellular Component), and iv) the method to calculate the similarity scores. + +```{r} +library(rrvgo) + +# calculate similarity matrix between GO terms +# function takes GO term ids as input; +# these are accessible in GOres.df$term_id +simMatrix <- calculateSimMatrix(GOres.df$term_id, + # human reference database + orgdb="org.Hs.eg.db", + # Gene ontology to use: Biological Process + ont="BP", + method="Rel") + +``` + +### Step 2: Term Reduction + +From the similarity matrix, we can next group the GO terms based on similarity. `rrvgo` provides the `reduceSimMatrix` function for that. It takes as arguments i) the similarity matrix, ii) an optional named vector of scores associated to each GO term, iii) a similarity threshold used for grouping terms, and iv) an `orgdb` organism database reference object. + +For component ii), the scores vector is optional information that helps `rrvgo` assign importance to input GO terms for summarization. Scores are interpreted in the direction that higher are better, so we can minus log-transform the p-values from GO enrichment to use as our scores. + +```{r} +scores <- setNames(-log10(GOres.df$p_value), GOres.df$term_id) +reducedTerms <- reduceSimMatrix(simMatrix, + scores, + threshold=0.7, + # human reference database + orgdb="org.Hs.eg.db") + + +``` + +Treemaps are spatial visualizations of hierarchical structures. The terms are grouped (colored) based on their parent, and the space used by the term is proportional to the score. Treemaps can help with the interpretation of the summarized Revigo results. + +```{r} +treemapPlot(reducedTerms) +```