Notebook4-Data-Access-and-Extraction.Rmd

---
title: "Data Manipulation and Pre-Processing for Single-Cell RNAseq Data Analysis"
author: "Insert your name here"
output:
  pdf_document: default
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# Required R package installation (run this cell before knitting)
# These will install packages if they are not already installed

if (!require("ggplot2")) {
   install.packages("ggplot2")
   library(ggplot2)
}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
 install.packages("tidyverse")
 library(tidyverse)
}
if (!require("Seurat")) {
   install.packages("Seurat")
   library(Seurat)

  }

options(tinytex.verbose = TRUE)

```


## Change the code below to your celltype code

```{r}
celltype <- "Ast"
```

## Overview

This notebook contains a data processing workflow for extracting single-cell RNA Sequencing data directly from `Seurat` objects (from the R package `Seurat`) into R matrices.

The workflow documented here demonstrates how to access subsets of the complete [Bowles et al. (2020)](https://doi.org/10.1016/j.cell.2021.07.003) NSCI dataset, as stored in `FinalMergedData.rds`, based on a combination of cell attributes specified by the user, and extract the resulting subsets into R matrices. Later, we show examples of how to perform similar extraction tasks using automated built-in `Seurat` functions.

We'll start by reading in a downsampled version of our full NSCI FTD single-cell dataset (all cells and timepoints) with `readRDS()`. This dataset contains all the same attributes as the original, just with fewer cells per cell type, for the sake of rendering speed in the tutorial examples.

## Data

```{r}
# Read in FTD data from .RDS format from data location on the cluster
seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")

```


### *Reminder of important metadata from this dataset to be familiar with:*

Cell Line Key

| Line     | Wildtype/mutant         |
|----------|-------------------------|
| ND-B06   | Wildtype                |
| ND-B09   | Mutant                  |
| GIH6-E11 | Wildtype                |
| GIH6-A02 | Mutant                  |
| GIH7-F02 | Wildtype                |
| GIH7-B12 | Wildtype                |
| GIH7-A01 | Mutant                  |

**_In the data, "wildtype" and "mutant" are denoted by "V337V" and "V337M", respectively._**

Previous research conducted on this dataset clustered the cells into 16 primary neuronal cell type categories. These categories are represented as shown below.

Cell Type Key

-   Ast - astrocytes

-   ExDp1- excitatory deep layer 1

-   ExDp2- excitatory deep layer 2

-   ExM- maturing excitatory

-   ExM-U- maturing excitatory upper enriched

-   ExN- newborn excitatory

-   Glia- unspecified glia/non-neuronal cells

-   InCGE- interneurons caudal ganglionic eminence

-   InMGE- interneurons medial ganglionic eminence

-   IP- intermediate progenitors

-   OPC- oligodendrocyte precursor cells

-   oRG- outer radial glia

-   PgG2M- cycling progenitors (G2/M phase)

-   PgS- cycling progenitors (S phase)

-   UN- unspecified neurons

-   vRG- ventricular radial glia.


# Setup: Scoping out the cell attributes

A typical interest for RNAseq data analyst might be to scope out what attributes were collected on the cells used in the original scRNAseq experiment being studied - for example, their age, the line they were drawn from, the experimental batch they were included in, etc.  `Seurat` stores all of this information for the entire corpus of single cells in `data frame`, accessible within the object metadata.

We can access this data frame of cell-level metadata in our NSCI dataset as shown:

```{r}
cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr
```

```{r}
# Peek at the available fields
str(cell_attrs)
```

The 3 sample timepoints at which expression data were recorded can be accessed at `cell_attrs$Age`:

```{r}
levels(as.factor(cell_attrs$Age))

```

Note that the data type of this feature is not numeric, as might be expected.

```{r}
class(levels(as.factor(cell_attrs$Age)))

```

We can similarly check how the 2 variants are represented:

```{r}
levels(as.factor(cell_attrs$Mt))

```

And our cell types:

```{r}
levels(as.factor(cell_attrs$CellType))

```


# **Workflow: Extracting a subset of cells and the corresponding genes from a `Seurat` object**

This section extracts the subset of N cells having attributes:

1.  Belonging to **ND cell line** (both mutant and wildtype)

2.  Collected at timepoint **2 months**.

### **Step 1.** Identify cell IDs of desired subset of cells, based on the attribute columns in the cell_attrs dataframe

Let's say for example we are interested in analyzing the data for just one cell line, line **ND**, sampled at **2 months**. Then we want to extract the cells with IDs `ND-B06` _or_ `ND-B09`, the mutant and wildtype equivalents of the same cell line. We also need to require `Age == "2mo"`.

```{r}

# dplyr pipeline to extract desired cell subset
ND.cells.df <- cell_attrs %>% filter((ID == "B06" | ID == "B09") & Age == "2mo")

# View result
head(ND.cells.df)
str(ND.cells.df)

# From this dataframe, we can get the list of cell IDs for our extracted subset:
ND_cells <- rownames(ND.cells.df)

```


### **Step 2.** Use the identified list or vector of cell IDs to extract the gene expression data for the corresponding cells. We will extract the expression values for the 3000 top most-variable genes.


```{r}

# Access the gene expression matrix:
expression.matrix <- seurat.obj[["SCT"]]@data
str(expression.matrix)

# convert to data frame for easier data manipulation
exp.df <- as.data.frame(expression.matrix)
dim(exp.df)

```

Note the difference in formatting from the cell attributes data frame:

*  `cell_attrs`: cells are rows, attributes are columns
*  `exp.df`: genes are rows, cells are columns

```{r}
# Fetch gene expression data for the ND line cells only
ND.exp.df <- exp.df %>% dplyr::select(any_of(ND_cells))

```

### Step 3. Align and join cell attributes with gene expression data

```{r}

# Transpose gene expression df to match format of cell attributes df
# We want individual cells as rows and genes as columns

#t() transposes a matrix-like object (swaps the columns and rows)
ND.exp.df.T <- t(ND.exp.df)

# show swapped dimensions
# before:
dim(ND.exp.df)
# after:
dim(ND.exp.df.T)
```

```{r}
# R coerced it to a matrix for the operation,
# so we switch it back to a dataframe
ND.exp.df.T <- as.data.frame(ND.exp.df.T)

```

```{r}
# next we want to align and join the two dataframes side-by-side
# to do so, we first need to create a common column of features to join them on
# we use the cell IDs, because we know both structures have the same set of cells.

# convert rownames (cell IDs) into a column to join the two dataframes on
ND.cells.df <- rownames_to_column(ND.cells.df, var = "Cell_ID")
ND.exp.df.T <- rownames_to_column(ND.exp.df.T, var = "Cell_ID")

```

```{r}
# Do the merge, using an inner_join to preserve only common elements
# (in this case we know they match, so the join type doesn't really matter)
merged.df <- inner_join(x=ND.cells.df, y=ND.exp.df.T, by="Cell_ID")

dim(merged.df)

```

```{r}
# We can use `dplyr` functions
# if we want to select only a few specific features to keep:
# adding `-` before a column name drops that column, and keeps all others
merged.df.clean <- merged.df %>%
   dplyr::select(-Batch,-HTO,-Run,-percent.mt,-QCcells,-integrated_snn_res.0.5, -CellTypeRaw)
```

```{r}
# Optionally export the data
#saveRDS(merged.df,"~/Alzheimers_DS/GlutNeurons_NDline_2months_bothvariants_expression_data.rds")
```


# Functionalized version of the data subsetting workflow

We provide a convenience function `subsetCells()` for you to use in your summer work, if desired. This function can be used to extract a subset of the overall NSCI dataset containing only a specific Tau Consortium cell line (from one of the three main lines used in Bowles et al 2021).

It takes as input a `Seurat` object such as the one containing our overall dataset (`/data/AlzheimersDSData/FinalMergedData-downsampled.rds`), the specification of a desired cell line, and additional parameters described below, and outputs a .rds file containing the specified subset.

```{r}

# Written 2/22/2022
# Last updated 3/28/2022
# Function to subset cell data stored in Seurat object by cell line
# and join with corresponding gene expression data (SCTransformed)
# from same object.
# author: Rachael White (whiter9)
#
# Inputs:
#     - Seurat object containing an "SCT" assay
#     - IDs: character vector of size 2 containing an ID pair
#     - age: character, one of "2mo", "4mo", "6mo", or "all"
#     - slot: one of "data" or "scale.data"; defaults to "data"
#
# Returns: Dataframe with *only* cell IDs from the input cell line as rows,
#          with cell attributes and genes expressed as columns.
#          Saves data subset as .rds.
#
# Note: Assumes "ID" and not "orig.idents" is the correct term
# to subset cells on.
###################################################################
subsetCells <- function(seurat.obj,IDs,age, slot="data") {

   require(tidyverse)

   if (slot=="scale.data") {
      expression.matrix <- seurat.obj[["SCT"]]@scale.data
   } else {
      expression.matrix <- seurat.obj[["SCT"]]@data
   }

   cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr

   #Step 1: Identify cell IDs of desired subset of cells (rows),
   #based on the attribute columns in cell_attrs df,
   # as well as the gene IDs of the 3000 top most-variable genes
   if (age=="all") {
      cells.df <- cell_attrs %>% filter((ID == IDs[1] | ID == IDs[2]))
   } else {
      cells.df <- cell_attrs %>% filter((ID == IDs[1] | ID == IDs[2]) & Age == age)
   }
   # get list of desired cell IDs
   # (use intersect to ensure they are present in both objects):
   cells <- intersect(rownames(cells.df),colnames(expression.matrix))
   # get list of the 3000 most-variable genes
   top_3000_genes <- seurat.obj[["SCT"]]@var.features

   #Step 2: Use the identified cell and gene IDs to extract the gene expression data
   #for the corresponding cells and genes
   exp.M <- as.matrix(expression.matrix[top_3000_genes,cells])

   #Step 3: Align and join cell attributes with gene expression data
   # Transpose gene expression df to match format of cell attributes df
   # We want individual cells as rows and genes as columns
   exp.M.rotated <- t(exp.M)
   exp.df <- as.data.frame(exp.M.rotated)
   # Convert rownames (cell IDs) into a column to join on
   cells.df <- tibble::rownames_to_column(cells.df, var = "Cell_ID")
   exp.df <- tibble::rownames_to_column(exp.df, var = "Cell_ID")
   # Do the merge
   merged.df <- right_join(x=cells.df, y=exp.df, by="Cell_ID")

   #Step 4: Export data
   if (age=="all") {
      savefile = sprintf("FinalMergedData_lines%s-%s_%s_timepts.rds", IDs[1], IDs[2], age)
   } else {
      savefile = sprintf("FinalMergedData_lines%s-%s_%snths.rds", IDs[1], IDs[2], age)
   }

   saveRDS(merged.df,savefile)

   return(merged.df)
}

###################################################################
```


# Example data subset generation

**NOTE:** The following section is included as a usage example; doesn't need to be run.

### Read in full dataset (all cell types, total cell count 379142)

```{r}
#seurat.complete <- readRDS("/data/AlzheimersDSData/FinalMergedData.rds")
```

### Create subsets of complete dataset compiled by specified cell line and time period

```{r}
# test <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo")
# dim(test)
# test[1,1:10]

```

```{r}
# ND2mo <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo")
```

### Subset complete dataset by cell line only, include all sequencing time points

```{r}
# G6cells_all <- subsetCells(seurat.complete,IDs=c("G6A02","G6E11"),age="all")
```


# Data Access and Subsetting the `Seurat` Way

`Seurat` provides built-in functions for selecting particular cells from your dataset and subsetting the `Seurat` object on those cells. While we've shown how you could create data subsets in base `R`, the `Seurat` methods are probably the easiest and safest way to create data subsets for your own work for most purposes. The following examples will illustrate this functionality.

First make sure the active identities of cells is set to `CellType`:

```{r}
Idents(object = seurat.obj) <- "CellType"
levels(Idents(seurat.obj))

```

## Selecting particular cells

With active identities set to "CellType",
we can pull out subsets of cells by their cell type identity class:

```{r}
# access only astrocyte cells
Astrocyte_cells <- CellsByIdentities(seurat.obj, idents = c("Ast"))

# alternate function that does the same thing
Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast")

str(Astrocyte_cells)
```

What if you want to extract an expression matrix containing data for only one identity group (such as "Ast")?

```{r}
# Retrieve expression data for specific CellType
Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast")
counts.data <- GetAssayData(seurat.obj, slot = "counts")[, Astrocyte_cells]

# convert sparse matrix to workable R matrix object
counts.data <- as.matrix(counts.data)

```

## Creating subsets using specific criteria


`Seurat` objects are compatible with the `subset()` function for easy subset generation based on user-specified criteria.

*Note: The following are usage examples, we do not require that you run this code*

### Examples


Can I create a Seurat object based on expression of a feature or value in object metadata?

```{r}
#my_seurat <- subset(seurat.obj, subset = MS4A1 > 1)
```

Subset Seurat object based on identity class (also see `?SubsetData`)

```{r}
# one group
#my_seurat <- subset(x = seurat.obj, idents = "Ast")

# multiple groups
#my_seurat <- subset(x = seurat.obj, idents = c("Ast", "vRG"))
```

Subset on the expression level of a specific gene

```{r}
# Only cells with APOE expression > 3
#subset(x = seurat.obj, subset = APOE > 3)

```

Subset on a combination of criteria

```{r}

#subset(x = seurat.obj, subset = APOE > 3 & ELAVL4 > 5)
#subset(x = seurat.obj, subset = APOE > 3, idents = "Ast")
```

Subset on a value in the object meta data (a cell-level attribute, for example "CellType")

```{r}
#subset(x = seurat.obj, subset = Age == "2mo")
```

Create a downsampled version of a `Seurat` object (containing fewer cells per identity class)

```{r}
#subset(x = seurat.obj, downsample = 100)

```

# Day 3 Take-Home Challenge

For this challenge, you will be creating a subset of the full NSCI dataset containing only the data specific to your `CellType`.

- Change the celltype code specified below to your celltype


```{r}
celltype <- "Ast"
```

- Uncomment and run the code below. This will create a single subset of the full dataset `Seurat` object, containing only the data for your pet `CellType`, and save the subset to an rds file.

Uncomment this code and run it:

```{r}
# # Read in the FULL dataset from .rds format from data location on the cluster
# seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")
#
# # First make sure the active identities of cells is set to `"CellType"`:
# Idents(object = seurat.obj) <- "CellType"
#
# # Create a subset of the full data
# my_seurat <- subset(x = seurat.obj, subset = CellType == celltype)
#
# # Save the output
# saveRDS(my_seurat, sprintf("~/AlzheimersDS/student-notebooks/FinalMergedData-%s.rds",celltype ))

```
	---
	title: "Data Manipulation and Pre-Processing for Single-Cell RNAseq Data Analysis"
	author: "Insert your name here"
	output:
	pdf_document: default
	html_document:
	df_print: paged
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)

	# Required R package installation (run this cell before knitting)
	# These will install packages if they are not already installed

	if (!require("ggplot2")) {
	install.packages("ggplot2")
	library(ggplot2)
	}
	if (!require("knitr")) {
	install.packages("knitr")
	library(knitr)
	}
	if (!require("tidyverse")) {
	install.packages("tidyverse")
	library(tidyverse)
	}
	if (!require("Seurat")) {
	install.packages("Seurat")
	library(Seurat)

	}

	options(tinytex.verbose = TRUE)

	```


	## Change the code below to your celltype code

	```{r}
	celltype <- "Ast"
	```

	## Overview

	This notebook contains a data processing workflow for extracting single-cell RNA Sequencing data directly from `Seurat` objects (from the R package `Seurat`) into R matrices.

	The workflow documented here demonstrates how to access subsets of the complete [Bowles et al. (2020)](https://doi.org/10.1016/j.cell.2021.07.003) NSCI dataset, as stored in `FinalMergedData.rds`, based on a combination of cell attributes specified by the user, and extract the resulting subsets into R matrices. Later, we show examples of how to perform similar extraction tasks using automated built-in `Seurat` functions.

	We'll start by reading in a downsampled version of our full NSCI FTD single-cell dataset (all cells and timepoints) with `readRDS()`. This dataset contains all the same attributes as the original, just with fewer cells per cell type, for the sake of rendering speed in the tutorial examples.

	## Data

	```{r}
	# Read in FTD data from .RDS format from data location on the cluster
	seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")

	```


	### Reminder of important metadata from this dataset to be familiar with:

	Cell Line Key

	\| Line \| Wildtype/mutant \|
	\|----------\|-------------------------\|
	\| ND-B06 \| Wildtype \|
	\| ND-B09 \| Mutant \|
	\| GIH6-E11 \| Wildtype \|
	\| GIH6-A02 \| Mutant \|
	\| GIH7-F02 \| Wildtype \|
	\| GIH7-B12 \| Wildtype \|
	\| GIH7-A01 \| Mutant \|

	_In the data, "wildtype" and "mutant" are denoted by "V337V" and "V337M", respectively._

	Previous research conducted on this dataset clustered the cells into 16 primary neuronal cell type categories. These categories are represented as shown below.

	Cell Type Key

	- Ast - astrocytes

	- ExDp1- excitatory deep layer 1

	- ExDp2- excitatory deep layer 2

	- ExM- maturing excitatory

	- ExM-U- maturing excitatory upper enriched

	- ExN- newborn excitatory

	- Glia- unspecified glia/non-neuronal cells

	- InCGE- interneurons caudal ganglionic eminence

	- InMGE- interneurons medial ganglionic eminence

	- IP- intermediate progenitors

	- OPC- oligodendrocyte precursor cells

	- oRG- outer radial glia

	- PgG2M- cycling progenitors (G2/M phase)

	- PgS- cycling progenitors (S phase)

	- UN- unspecified neurons

	- vRG- ventricular radial glia.


	# Setup: Scoping out the cell attributes

	A typical interest for RNAseq data analyst might be to scope out what attributes were collected on the cells used in the original scRNAseq experiment being studied - for example, their age, the line they were drawn from, the experimental batch they were included in, etc. `Seurat` stores all of this information for the entire corpus of single cells in `data frame`, accessible within the object metadata.

	We can access this data frame of cell-level metadata in our NSCI dataset as shown:

	```{r}
	cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr
	```

	```{r}
	# Peek at the available fields
	str(cell_attrs)
	```

	The 3 sample timepoints at which expression data were recorded can be accessed at `cell_attrs$Age`:

	```{r}
	levels(as.factor(cell_attrs$Age))

	```

	Note that the data type of this feature is not numeric, as might be expected.

	```{r}
	class(levels(as.factor(cell_attrs$Age)))

	```

	We can similarly check how the 2 variants are represented:

	```{r}
	levels(as.factor(cell_attrs$Mt))

	```

	And our cell types:

	```{r}
	levels(as.factor(cell_attrs$CellType))

	```


	# Workflow: Extracting a subset of cells and the corresponding genes from a `Seurat` object

	This section extracts the subset of N cells having attributes:

	1. Belonging to ND cell line (both mutant and wildtype)

	2. Collected at timepoint 2 months.

	### Step 1. Identify cell IDs of desired subset of cells, based on the attribute columns in the cell_attrs dataframe

	Let's say for example we are interested in analyzing the data for just one cell line, line ND, sampled at 2 months. Then we want to extract the cells with IDs `ND-B06` _or_ `ND-B09`, the mutant and wildtype equivalents of the same cell line. We also need to require `Age == "2mo"`.

	```{r}

	# dplyr pipeline to extract desired cell subset
	ND.cells.df <- cell_attrs %>% filter((ID == "B06" \| ID == "B09") & Age == "2mo")

	# View result
	head(ND.cells.df)
	str(ND.cells.df)

	# From this dataframe, we can get the list of cell IDs for our extracted subset:
	ND_cells <- rownames(ND.cells.df)

	```


	### Step 2. Use the identified list or vector of cell IDs to extract the gene expression data for the corresponding cells. We will extract the expression values for the 3000 top most-variable genes.


	```{r}

	# Access the gene expression matrix:
	expression.matrix <- seurat.obj[["SCT"]]@data
	str(expression.matrix)

	# convert to data frame for easier data manipulation
	exp.df <- as.data.frame(expression.matrix)
	dim(exp.df)

	```

	Note the difference in formatting from the cell attributes data frame:

	* `cell_attrs`: cells are rows, attributes are columns
	* `exp.df`: genes are rows, cells are columns

	```{r}
	# Fetch gene expression data for the ND line cells only
	ND.exp.df <- exp.df %>% dplyr::select(any_of(ND_cells))

	```

	### Step 3. Align and join cell attributes with gene expression data

	```{r}

	# Transpose gene expression df to match format of cell attributes df
	# We want individual cells as rows and genes as columns

	#t() transposes a matrix-like object (swaps the columns and rows)
	ND.exp.df.T <- t(ND.exp.df)

	# show swapped dimensions
	# before:
	dim(ND.exp.df)
	# after:
	dim(ND.exp.df.T)
	```

	```{r}
	# R coerced it to a matrix for the operation,
	# so we switch it back to a dataframe
	ND.exp.df.T <- as.data.frame(ND.exp.df.T)

	```

	```{r}
	# next we want to align and join the two dataframes side-by-side
	# to do so, we first need to create a common column of features to join them on
	# we use the cell IDs, because we know both structures have the same set of cells.

	# convert rownames (cell IDs) into a column to join the two dataframes on
	ND.cells.df <- rownames_to_column(ND.cells.df, var = "Cell_ID")
	ND.exp.df.T <- rownames_to_column(ND.exp.df.T, var = "Cell_ID")

	```

	```{r}
	# Do the merge, using an inner_join to preserve only common elements
	# (in this case we know they match, so the join type doesn't really matter)
	merged.df <- inner_join(x=ND.cells.df, y=ND.exp.df.T, by="Cell_ID")

	dim(merged.df)

	```

	```{r}
	# We can use `dplyr` functions
	# if we want to select only a few specific features to keep:
	# adding `-` before a column name drops that column, and keeps all others
	merged.df.clean <- merged.df %>%
	dplyr::select(-Batch,-HTO,-Run,-percent.mt,-QCcells,-integrated_snn_res.0.5, -CellTypeRaw)
	```

	```{r}
	# Optionally export the data
	#saveRDS(merged.df,"~/Alzheimers_DS/GlutNeurons_NDline_2months_bothvariants_expression_data.rds")
	```


	# Functionalized version of the data subsetting workflow

	We provide a convenience function `subsetCells()` for you to use in your summer work, if desired. This function can be used to extract a subset of the overall NSCI dataset containing only a specific Tau Consortium cell line (from one of the three main lines used in Bowles et al 2021).

	It takes as input a `Seurat` object such as the one containing our overall dataset (`/data/AlzheimersDSData/FinalMergedData-downsampled.rds`), the specification of a desired cell line, and additional parameters described below, and outputs a .rds file containing the specified subset.

	```{r}

	# Written 2/22/2022
	# Last updated 3/28/2022
	# Function to subset cell data stored in Seurat object by cell line
	# and join with corresponding gene expression data (SCTransformed)
	# from same object.
	# author: Rachael White (whiter9)
	#
	# Inputs:
	# - Seurat object containing an "SCT" assay
	# - IDs: character vector of size 2 containing an ID pair
	# - age: character, one of "2mo", "4mo", "6mo", or "all"
	# - slot: one of "data" or "scale.data"; defaults to "data"
	#
	# Returns: Dataframe with only cell IDs from the input cell line as rows,
	# with cell attributes and genes expressed as columns.
	# Saves data subset as .rds.
	#
	# Note: Assumes "ID" and not "orig.idents" is the correct term
	# to subset cells on.
	###################################################################
	subsetCells <- function(seurat.obj,IDs,age, slot="data") {

	require(tidyverse)

	if (slot=="scale.data") {
	expression.matrix <- seurat.obj[["SCT"]]@scale.data
	} else {
	expression.matrix <- seurat.obj[["SCT"]]@data
	}

	cell_attrs <- seurat.obj[["SCT"]]@misc$vst.out$cell_attr

	#Step 1: Identify cell IDs of desired subset of cells (rows),
	#based on the attribute columns in cell_attrs df,
	# as well as the gene IDs of the 3000 top most-variable genes
	if (age=="all") {
	cells.df <- cell_attrs %>% filter((ID == IDs[1] \| ID == IDs[2]))
	} else {
	cells.df <- cell_attrs %>% filter((ID == IDs[1] \| ID == IDs[2]) & Age == age)
	}
	# get list of desired cell IDs
	# (use intersect to ensure they are present in both objects):
	cells <- intersect(rownames(cells.df),colnames(expression.matrix))
	# get list of the 3000 most-variable genes
	top_3000_genes <- seurat.obj[["SCT"]]@var.features

	#Step 2: Use the identified cell and gene IDs to extract the gene expression data
	#for the corresponding cells and genes
	exp.M <- as.matrix(expression.matrix[top_3000_genes,cells])

	#Step 3: Align and join cell attributes with gene expression data
	# Transpose gene expression df to match format of cell attributes df
	# We want individual cells as rows and genes as columns
	exp.M.rotated <- t(exp.M)
	exp.df <- as.data.frame(exp.M.rotated)
	# Convert rownames (cell IDs) into a column to join on
	cells.df <- tibble::rownames_to_column(cells.df, var = "Cell_ID")
	exp.df <- tibble::rownames_to_column(exp.df, var = "Cell_ID")
	# Do the merge
	merged.df <- right_join(x=cells.df, y=exp.df, by="Cell_ID")

	#Step 4: Export data
	if (age=="all") {
	savefile = sprintf("FinalMergedData_lines%s-%s_%s_timepts.rds", IDs[1], IDs[2], age)
	} else {
	savefile = sprintf("FinalMergedData_lines%s-%s_%snths.rds", IDs[1], IDs[2], age)
	}

	saveRDS(merged.df,savefile)

	return(merged.df)
	}

	###################################################################
	```



	# Example data subset generation

	NOTE: The following section is included as a usage example; doesn't need to be run.

	### Read in full dataset (all cell types, total cell count 379142)

	```{r}
	#seurat.complete <- readRDS("/data/AlzheimersDSData/FinalMergedData.rds")
	```

	### Create subsets of complete dataset compiled by specified cell line and time period

	```{r}
	# test <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo")
	# dim(test)
	# test[1,1:10]

	```

	```{r}
	# ND2mo <- subsetCells(seurat.complete,IDs=c("B06","B09"),age="2mo")
	```

	### Subset complete dataset by cell line only, include all sequencing time points

	```{r}
	# G6cells_all <- subsetCells(seurat.complete,IDs=c("G6A02","G6E11"),age="all")
	```


	# Data Access and Subsetting the `Seurat` Way

	`Seurat` provides built-in functions for selecting particular cells from your dataset and subsetting the `Seurat` object on those cells. While we've shown how you could create data subsets in base `R`, the `Seurat` methods are probably the easiest and safest way to create data subsets for your own work for most purposes. The following examples will illustrate this functionality.

	First make sure the active identities of cells is set to `CellType`:

	```{r}
	Idents(object = seurat.obj) <- "CellType"
	levels(Idents(seurat.obj))

	```

	## Selecting particular cells

	With active identities set to "CellType",
	we can pull out subsets of cells by their cell type identity class:

	```{r}
	# access only astrocyte cells
	Astrocyte_cells <- CellsByIdentities(seurat.obj, idents = c("Ast"))

	# alternate function that does the same thing
	Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast")

	str(Astrocyte_cells)
	```

	What if you want to extract an expression matrix containing data for only one identity group (such as "Ast")?

	```{r}
	# Retrieve expression data for specific CellType
	Astrocyte_cells <- WhichCells(seurat.obj, idents = "Ast")
	counts.data <- GetAssayData(seurat.obj, slot = "counts")[, Astrocyte_cells]

	# convert sparse matrix to workable R matrix object
	counts.data <- as.matrix(counts.data)

	```

	## Creating subsets using specific criteria


	`Seurat` objects are compatible with the `subset()` function for easy subset generation based on user-specified criteria.

	Note: The following are usage examples, we do not require that you run this code

	### Examples


	Can I create a Seurat object based on expression of a feature or value in object metadata?

	```{r}
	#my_seurat <- subset(seurat.obj, subset = MS4A1 > 1)
	```

	Subset Seurat object based on identity class (also see `?SubsetData`)

	```{r}
	# one group
	#my_seurat <- subset(x = seurat.obj, idents = "Ast")

	# multiple groups
	#my_seurat <- subset(x = seurat.obj, idents = c("Ast", "vRG"))
	```

	Subset on the expression level of a specific gene

	```{r}
	# Only cells with APOE expression > 3
	#subset(x = seurat.obj, subset = APOE > 3)

	```

	Subset on a combination of criteria

	```{r}

	#subset(x = seurat.obj, subset = APOE > 3 & ELAVL4 > 5)
	#subset(x = seurat.obj, subset = APOE > 3, idents = "Ast")
	```

	Subset on a value in the object meta data (a cell-level attribute, for example "CellType")

	```{r}
	#subset(x = seurat.obj, subset = Age == "2mo")
	```

	Create a downsampled version of a `Seurat` object (containing fewer cells per identity class)

	```{r}
	#subset(x = seurat.obj, downsample = 100)

	```

	# Day 3 Take-Home Challenge

	For this challenge, you will be creating a subset of the full NSCI dataset containing only the data specific to your `CellType`.

	- Change the celltype code specified below to your celltype


	```{r}
	celltype <- "Ast"
	```

	- Uncomment and run the code below. This will create a single subset of the full dataset `Seurat` object, containing only the data for your pet `CellType`, and save the subset to an rds file.

	Uncomment this code and run it:

	```{r}
	# # Read in the FULL dataset from .rds format from data location on the cluster
	# seurat.obj <- readRDS("/data/AlzheimersDSData/FinalMergedData-downsampled.rds")
	#
	# # First make sure the active identities of cells is set to `"CellType"`:
	# Idents(object = seurat.obj) <- "CellType"
	#
	# # Create a subset of the full data
	# my_seurat <- subset(x = seurat.obj, subset = CellType == celltype)
	#
	# # Save the output
	# saveRDS(my_seurat, sprintf("~/AlzheimersDS/student-notebooks/FinalMergedData-%s.rds",celltype ))

	```