Skip to content

13th time's a charm (assignment 1 submission) #44

Merged
merged 1 commit into from Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
397 changes: 397 additions & 0 deletions StudentNotebooks/Assignment01/currac4-dar-f24-assignment1.Rmd
@@ -0,0 +1,397 @@
---
title: "Assignment 1"
subtitle: "First Notebook"
author: "Corey Curran"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
number_sections: true
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
# REQUIRE R PACKAGE INSTALLATIONS
# This section installs packages if they are not already installed.
# This block will not be shown in the knitted file.
# RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
# Set the default CRAN repository
local({r <- getOption("repos")
r["CRAN"] <- "http://cran.r-project.org"
options(repos=r)
})
if (!require("pandoc")) {
install.packages("pandoc")
library(pandoc)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
install.packages("rmarkdown")
library(rmarkdown)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("stringr")) {
install.packages("stringr")
library(stringr)
}
if (!require("ggbiplot")) {
install.packages("ggbiplot")
library(ggbiplot)
}
if (!require("pheatmap")) {
install.packages("pheatmap")
library(pheatmap)
}
if (!require("ggrepel")) {
install.packages("ggrepel")
library(ggrepel)
}
if (!require("farver")) {
install.packages("farver")
library(farver)
}
if (!require("labeling")) {
install.packages("labeling")
library(labeling)
}
knitr::opts_chunk$set(echo = TRUE)
```

# Introductory Data Analytics Research Notebook

This notebook is broken into two main parts:

* Part 1: A basic introduction to github and RStudio Server
* Part 2: An introduction to the Mars 2020 PIXL dataset

The RPI github repository for all the code and data required for this notebook may be found at:

* https://github.rpi.edu/DataINCITE/DAR-Mars-F24


## BEFORE YOU BEGIN: github account setup

To contribute to any RPI github repository or read private repos you _must_ validate your RPI github.com ID and send a confirmation email to John Erickson at `erickj4@rpi.edu`. Please do the following **now**:

**Enabling 2FA on the RPI github and saving personal access tokens, et.al.**

* Browse to http://github.rpi.edu
* Login using your RPI credentials
* Enable github two-factor authentication (2FA)
* Under "Settings" -> "Password and authentication"
* Select "Authenticator app" (Duo or Google authenticator are recommended)
* Follow steps to set up authenticator app; may involve scanning a QR Code)
* See directions for 2FA at https://itssc.rpi.edu/hc/en-us/articles/360004801811-GitHub-Enterprise-Overview#2fa
* **CRITICAL:** Make sure to save your **recovery codes** in a safe place! Recovery codes can be used to access your account in the event you lose access to your device and cannot receive two-factor authentication codes.
* Create and save a *personal access token*
* Under "Settings" -> "Developer settings"
* Select "Personal access tokens"
* Click on "Generate new token (classic)"
* Set an expiration period for the end of the Fall 2024 term
* Enable everything (check the left-most boxes)
* Generate (green button)
* SAVE THE RESULT! You won't be able to see it again...
* _Use this token when command-line git asks you for a password_
* **PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!**

# DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY

In this assignment we're asking you to

* clone the `DAR-Mars-F24` github repository,
* create a personal branch using git,
* create a new notebook that includes your answers to questions in this notebook,
* make additions to the repository by adding your notebook to the repository.

_The instructions which follow explain how to accomplish this._

**For DAR Fall 2024** you *must* be using RStudio Server on the IDEA Cluster. Instructions for accessing "The Cluster" appear at the end of this notebook. Don't forget to validate your RPI github ID as above and email `erickj4@rpi.edu`

### Cloning an RPI github repository

The recommended procedure for cloning and using this repository is as follows:

* Access the RPI network via VPN
* See https://itssc.rpi.edu/hc/en-us/articles/360008783172-VPN-Connection-and-Installation for information

* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/
* You must be on the RPI VPN!!
* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel).
* You now see the Linux shell on the IDEA Cluster
* `cd` (change directory) to enter your home directory using: `cd ~`
* Type `pwd` to confirm
* NOTE: Advanced users may use `ssh` to directly access the Linux shell from a macOS or Linux command line
* Type `git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24` from within your `home` directory
* Enter your RCS ID and your saved personal access token when asked
* This will create a new directory `DAR-Mars-F24`
* In the Linux shell, `cd` to `DAR-Mars-F24/StudentNotebooks/Assignment01`
* Type `ls -al` to list the current contents
* Don't be surprised if you see many files!
* In the Linux shell, type `git checkout -b dar-yourrcs` where `yourrcs` is your RCS id
* For example, if your RCS is `erickj4`, your new branch should be `dar-erickj4`
* It is _critical_ that you include your RCS id in your branch id!
* Back in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment01` directory via the **Files** panel (lower right panel)
* Under the **More** menu, set this to be your R working directory
* Setting the correct working directory is essential for interactive R use!

## REQUIRED FOR ASSIGNMENT 1

1. In RStudio, make a **copy** of `dar-f24-assignment1-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!**
* Open `darf24-assignment1-template.Rmd`
* **Save As...** using a new filename that includes your RCS ID
* Example filename for user `erickj4`: `erickj4-assignment1-f24.Rmd`
* POINTS OFF IF:
* You don't create a new filename!
* You don't include your RCS ID!
* You include `template` in your new filename!
2. Edit your new notebook using RStudio and save
* Change the `title:` and `subtitle:` headers (at the top of the file)
* Change the `author:`
* Don't bother changing the `date:`; it should update automagically...
* **Save** your changes
3. Use the RStudio `Knit` command to create an HTML file; repeat as necessary
* Use the down arrow next to the word `Knit` and select **Knit to HTML**
* You may also knit to PDF...
4. In the Linux terminal, use `git add` to add each new file you want to add to the repository
* Type: `git add yourfilename.Rmd`
* Type: `git add yourfilename.html` (created when you knitted)
* Add your PDF if you also created one...
5. Continue making changes to your personal notebook
* Add code where specified
* Answer questions were indicated.
6. When you're ready, in Linux commit your changes:
* Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes
* This commits your changes to your local repo, and sets the stage for your next operation.
7. Finally, push your commits to the RPI github repo
* Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in)
* Enter your RCS ID and personal access token (as a password) when asked.
* Your changes are now safely on the RPI github.
8. **REQUIRED:** On the RPI github, submit a pull request.
* In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git
and log in using 2FA
* In the branch selector drop-down (by default says **main**), select your branch
* **Submit a pull request for your branch**
* One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo.

Please also see these handy github "cheatsheets":

* https://education.github.com/git-cheat-sheet-education.pdf

# DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset

This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment.

PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument.

In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps.

* More about the PIXL instrument: https://an.rsl.wustl.edu/help/Content/About%20the%20mission/M20/Instruments/M20%20PIXL.htm
* Raw PIXL data bundle: https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/

## Load the PIXL Data and display summary

Here is the MARS PIXL data. Take note of the variables, their types, and distriubtions.

```{r}
# Saved LIBS data with locations added
# NOTE: Use course directory version during the semester
#pixl.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
# Use this version to use downloaded data from github
pixl.df <- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
# convert location to a number
pixl.df$location <- as.numeric(pixl.df$location )
# Automatically converts all strings to factors
pixl.df[sapply(pixl.df, is.character)] <-
lapply(pixl.df[sapply(pixl.df,
is.character)], as.factor)
# Show summary of the data
summary(pixl.df)
```


Create a matrix containing the measurements without any meta data to prepare for clustering. Here we delibrately do not scale the data to get preliminary results.

```{r}
# Prepare dataset for clustering selecting specific columns of interest and putting in a matrix
pixl_trim.mat <- pixl.df %>%
dplyr::select(c("Na20","Mgo","Al203","Si02",
"P205","S03","Cl","K20","Cao","Ti02",
"Cr203","Mno","FeO-T")) %>% as.matrix()
summary(pixl_trim.mat)
```

# Clustering

Our first analysis goal is to cluster the mineralogy data using K-means and pick the appropriate number of clusters.

Here we recall the function `wssplot` we created in MATP-4400 (IDM) to examine cluster sizes in order to perform the "elbow" test. The function takes as its arguments a matrix, the maximum number of clusters and a random seed. It creates clusters for each possible value of k and plots the k-means objective function.

NOTE: The basic syntax for creating a user-defined function in R is:

`output <- function(arguments){ do stuff }`

The following plot shows the K-Means objective value for up to eight clusters.

```{r}
# A user-defined function to examine clusters and plot the results
wssplot <- function(data, nc=15, seed=10){
wss <- data.frame(cluster=1:nc, quality=c(0))
for (i in 1:nc){
set.seed(seed)
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
ggplot(data=wss,aes(x=cluster,y=quality)) +
geom_line() +
ggtitle("Quality of k-means by Cluster")
}
# Apply `wssplot()` to our PIXL data
wssplot(pixl_trim.mat, nc=8, seed=2)
```


Based on where the "elbow" occurs, it looks like `d` might be a good `k` choice for k-means clustering.

## k-means Clustering

We create the final clustering with 5 clusters.

```{r}
# Use our chosen 'k' to perform k-means clustering
set.seed(2)
k <- 3
km <- kmeans(pixl_trim.mat,k)
```

## Examine cluster means

Below is a heat map of the cluster centers with rows and columns clustered. We keep the scale the same as in the original data.

```{r}
pheatmap(km$centers,scale="none")
```

Notice how the means of the clusters vary.

## Perform PCA on PIXL Data

We're now ready to perform PCA. Note we have already scaled data so set `scale=FALSE`.

We first show a [Scree plot](https://en.wikipedia.org/wiki/Scree_plot) to understand the explained variance by principal component. Note the elbow in the Scree plot should roughly match the one you saw in k-means.

```{r}
# Perform the PCA on the matrix `pixl_trim.mat` we created earlier
pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=FALSE)
# generate the Scree plot
ggscreeplot(pixl_trim.mat.pca)
```

Make a table indicating how many samples are in each cluster.

```{r}
# clusters sizes are in the km object produced by kmeans
cluster.df<-data.frame(cluster= 1:3, size=km$size)
kable(cluster.df,caption="Samples per cluster")
```


## Create a PCA Biplot using ggbiplot

Now we'll create a biplot of the data colored by cluster and label by rock type.

```{r message=FALSE, warning=FALSE}
# For this lab we'll create a PCA biplot the easy way using ggbiplot!
ggbiplot::ggbiplot(pixl_trim.mat.pca,
labels = pixl.df$type,
groups = as.factor(km$cluster)) +
xlim(-2,2) + ylim(-2,2)
```

## ANSWER THESE QUESTIONS!

Add a description of each cluster here in your own words.

Cluster 1 Description: Cluster 1 appears to consist of Igneous rock samples with relatively high concentrations of Silicon Dioxide and relatively low concentrations of Manganese (II) Oxide.

Cluster 2 Description: Cluster 2 appears to consist of Sedementary rock samples with relatively high concentrations of Sulfur Dioxide and Manganese (II) Oxide, as well as relatively low concentrations of Silicon Dioxide, Sodium Oxide, and Calcium Oxide.

Cluster 3 Description: Cluster 3 appears to consist of Igneous and unclassified rock samples with relatively high concentrations of Iron (II) Oxide and relatively moderate concentrations of all other compounds.

It is worth noting that, due to the lack of scaling, the heatmap is fairly difficult to discern for several of the compounds since they generally have much lower concentrations than other compounds.

What do the clustering and PCA results tell us about the data detected by the M20 PIXL experiment? _Feel free to add graphs or analyses to support your conclusions._

The following barchart conveys similar information to the preceding heatmap except that it does not show the clustering on the rows and columns. However, it does, in my opintion, make it easier to see which cluster has relatively high and low concentrations of each compound, especially for those compounds which have low concentrations across all 3 clusters.

```{r}
# Student's code for graphs and analysis here!
cluster<-as.factor(rep(1:k,ncol(pixl_trim.mat)))
compound<-rep(colnames(pixl_trim.mat),each=k)
concentration<-as.vector(km$centers)

data<-data.frame(compound,cluster,concentration)

ggplot(data,aes(fill=cluster,y=concentration,x=compound))+geom_bar(position="dodge",stat="identity")

```

For instance, we can now see more clearly that cluster 3 has relatively high concentrations of Titanium (IV) Oxide, whereas they concentrations were almost indistinguishable on the heatmap. As mentioned previously, this is largely due to the lack of scaling on the data.

In general, clustering helps us identify patterns in the data; however, in this case, the PCA biplot shows that the clusters appear to not actually be very close to one another, as exemplified by cluster 2 having samples 'deep' in both quadrants 1 and 4, visually speaking. Thus, this particular clustering is not necessarily very helpful.

## SAVE, COMMIT and PUSH YOUR CHANGES!

When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using **steps 4-8** in **Section 2.2**, summarized here:

**In the Linux terminal:**

* `git branch`
* To double-check that you are in your working branch
* `git add <your changed files>`
* Your Rmd and knitted PDF
* `git commit -m "Some useful comments"`
* `git push origin <your branch name>`

**On github:**

* Log in at https://github.rpi.edu/DataINCITE/DAR-Mars-F24
* Select your branch from drop-down (default is **main**)
* Submit a "pull request" for your branch
* DO NOT MERGE!!!

# APPENDIX: Accessing RStudio Server on the IDEA Cluster

The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)

* The Cluster requires RCS credentials, enabled via registration in class
* email John Erickson for problems `erickj4@rpi.edu`
* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes
* Access via RPI physical network or VPN only

# More info about Rstudio on our Cluster

## RStudio GUI Access:

* Use:
* http://lp01.idea.rpi.edu/rstudio-ose/
* http://lp01.idea.rpi.edu/rstudio-ose-3/
* http://lp01.idea.rpi.edu/rstudio-ose-6/
* http://lp01.idea.rpi.edu/rstudio-ose-7/
* Linux terminal accessible from within RStudio "Terminal" or via ssh (below)

## Shared Data on Cluster:

* Users enrolled in DAR have access to `/academics/MATP-4910-F24`
* Usually DAR users will see a symbolic ("soft") link in their home directories
* If you do not, type the following in the **Terminal** via RStudio: `ln -s /academics/MATP-4910-F23/ MATP-4910-F24`
* All idea_users have access to shared storage via `/data` ("data" in your home directories)
* You might wish to use this for data sharing in team projects...
* ...but we recommend using github for shared code development
* Shell access to nodes: You must access "landing pad" first, then compute node:
* `ssh your_rcs@lp01.idea.rpi.edu` For example: `ssh erickj4@lp01.idea.rpi.edu`
* Then, `ssh` to the desired compute node, e.g.: `ssh idea-node-02`