diff --git a/StudentNotebooks/Assignment01/vanesm-f24-assignment1.Rmd b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.Rmd new file mode 100644 index 0000000..10eb3a7 --- /dev/null +++ b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.Rmd @@ -0,0 +1,419 @@ +--- +title: "RPI github and Mars 2020 PIXL Notebook:" +subtitle: "DAR Assignment 1" +author: "Margo VanEsselstyn" +date: "`r format(Sys.time(), '%d %B %Y')`" +output: + pdf_document: default + html_document: + toc: true + number_sections: true + df_print: paged +--- +```{r setup, include=FALSE} +# REQUIRE R PACKAGE INSTALLATIONS +# This section installs packages if they are not already installed. +# This block will not be shown in the knitted file. + +# RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!! + +# Set the default CRAN repository +local({r <- getOption("repos") + r["CRAN"] <- "http://cran.r-project.org" + options(repos=r) +}) + +if (!require("pandoc")) { + install.packages("pandoc") + library(pandoc) +} + +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} + +# Required packages for M20 LIBS analysis +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} + +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} + +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("ggbiplot")) { + install.packages("ggbiplot") + library(ggbiplot) +} + +if (!require("pheatmap")) { + install.packages("pheatmap") + library(pheatmap) +} + +if (!require("ggrepel")) { + install.packages("ggrepel") + library(ggrepel) +} + +if (!require("farver")) { + install.packages("farver") + library(farver) +} + +if (!require("labeling")) { + install.packages("labeling") + library(labeling) +} + +knitr::opts_chunk$set(echo = TRUE) + +``` + +# Introductory Data Analytics Research Notebook + +This notebook is broken into two main parts: + +* Part 1: A basic introduction to github and RStudio Server +* Part 2: An introduction to the Mars 2020 PIXL dataset + +The RPI github repository for all the code and data required for this notebook may be found at: + +* https://github.rpi.edu/DataINCITE/DAR-Mars-F24 + + +## BEFORE YOU BEGIN: github account setup + +To contribute to any RPI github repository or read private repos you _must_ validate your RPI github.com ID and send a confirmation email to John Erickson at `erickj4@rpi.edu`. Please do the following **now**: + +**Enabling 2FA on the RPI github and saving personal access tokens, et.al.** + +* Browse to http://github.rpi.edu +* Login using your RPI credentials +* Enable github two-factor authentication (2FA) +* Under "Settings" -> "Password and authentication" +* Select "Authenticator app" (Duo or Google authenticator are recommended) + * Follow steps to set up authenticator app; may involve scanning a QR Code) + * See directions for 2FA at https://itssc.rpi.edu/hc/en-us/articles/360004801811-GitHub-Enterprise-Overview#2fa + * **CRITICAL:** Make sure to save your **recovery codes** in a safe place! Recovery codes can be used to access your account in the event you lose access to your device and cannot receive two-factor authentication codes. +* Create and save a *personal access token* + * Under "Settings" -> "Developer settings" + * Select "Personal access tokens" + * Click on "Generate new token (classic)" + * Set an expiration period for the end of the Fall 2024 term + * Enable everything (check the left-most boxes) + * Generate (green button) + * SAVE THE RESULT! You won't be able to see it again... +* _Use this token when command-line git asks you for a password_ +* **PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!** + +# DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY + +In this assignment we're asking you to + +* clone the `DAR-Mars-F24` github repository, +* create a personal branch using git, +* create a new notebook that includes your answers to questions in this notebook, +* make additions to the repository by adding your notebook to the repository. + +_The instructions which follow explain how to accomplish this._ + +**For DAR Fall 2024** you *must* be using RStudio Server on the IDEA Cluster. Instructions for accessing "The Cluster" appear at the end of this notebook. Don't forget to validate your RPI github ID as above and email `erickj4@rpi.edu` + +### Cloning an RPI github repository + +The recommended procedure for cloning and using this repository is as follows: + +* Access the RPI network via VPN + * See https://itssc.rpi.edu/hc/en-us/articles/360008783172-VPN-Connection-and-Installation for information + +* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/ + * You must be on the RPI VPN!! +* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel). + * You now see the Linux shell on the IDEA Cluster + * `cd` (change directory) to enter your home directory using: `cd ~` + * Type `pwd` to confirm + * NOTE: Advanced users may use `ssh` to directly access the Linux shell from a macOS or Linux command line +* Type `git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24` from within your `home` directory + * Enter your RCS ID and your saved personal access token when asked + * This will create a new directory `DAR-Mars-F24` +* In the Linux shell, `cd` to `DAR-Mars-F24/StudentNotebooks/Assignment01` + * Type `ls -al` to list the current contents + * Don't be surprised if you see many files! +* In the Linux shell, type `git checkout -b dar-yourrcs` where `yourrcs` is your RCS id + * For example, if your RCS is `erickj4`, your new branch should be `dar-erickj4` + * It is _critical_ that you include your RCS id in your branch id! +* Back in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment01` directory via the **Files** panel (lower right panel) + * Under the **More** menu, set this to be your R working directory + * Setting the correct working directory is essential for interactive R use! + +## REQUIRED FOR ASSIGNMENT 1 + +1. In RStudio, make a **copy** of `dar-f24-assignment1-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!** + * Open `darf24-assignment1-template.Rmd` + * **Save As...** using a new filename that includes your RCS ID + * Example filename for user `erickj4`: `erickj4-assignment1-f24.Rmd` + * POINTS OFF IF: + * You don't create a new filename! + * You don't include your RCS ID! + * You include `template` in your new filename! +2. Edit your new notebook using RStudio and save + * Change the `title:` and `subtitle:` headers (at the top of the file) + * Change the `author:` + * Don't bother changing the `date:`; it should update automagically... + * **Save** your changes +3. Use the RStudio `Knit` command to create an HTML file; repeat as necessary + * Use the down arrow next to the word `Knit` and select **Knit to HTML** + * You may also knit to PDF... +4. In the Linux terminal, use `git add` to add each new file you want to add to the repository + * Type: `git add yourfilename.Rmd` + * Type: `git add yourfilename.html` (created when you knitted) + * Add your PDF if you also created one... +5. Continue making changes to your personal notebook + * Add code where specified + * Answer questions were indicated. +6. When you're ready, in Linux commit your changes: + * Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes + * This commits your changes to your local repo, and sets the stage for your next operation. +7. Finally, push your commits to the RPI github repo + * Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in) + * Enter your RCS ID and personal access token (as a password) when asked. + * Your changes are now safely on the RPI github. +8. **REQUIRED:** On the RPI github, submit a pull request. + * In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git + and log in using 2FA + * In the branch selector drop-down (by default says **main**), select your branch + * **Submit a pull request for your branch** + * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. + +Please also see these handy github "cheatsheets": + + * https://education.github.com/git-cheat-sheet-education.pdf + +# DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset + +This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment. + +PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument. + +In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps. + +* More about the PIXL instrument: https://an.rsl.wustl.edu/help/Content/About%20the%20mission/M20/Instruments/M20%20PIXL.htm +* Raw PIXL data bundle: https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/ + +## Load the PIXL Data and display summary + +Here is the MARS PIXL data. Take note of the variables, their types, and distriubtions. + +```{r} +# Saved LIBS data with locations added + +# NOTE: Use course directory version during the semester +pixl.df<- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds") +# Use this version to use downloaded data from github +# pixl.df <- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds") +#/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds + +# convert location to a number +pixl.df$location <- as.numeric(pixl.df$location ) + +# Automatically converts all strings to factors +pixl.df[sapply(pixl.df, is.character)] <- + lapply(pixl.df[sapply(pixl.df, + is.character)], as.factor) + +# Show summary of the data +summary(pixl.df) + +``` + + +Create a matrix containing the measurements without any meta data to prepare for clustering. Here we delibrately do not scale the data to get preliminary results. + +```{r} +# Prepare dataset for clustering selecting specific columns of interest and putting in a matrix +pixl_trim.mat <- pixl.df %>% + dplyr::select(c("Na20","Mgo","Al203","Si02", + "P205","S03","Cl","K20","Cao","Ti02", + "Cr203","Mno","FeO-T")) %>% as.matrix() +summary(pixl_trim.mat) +``` + +# Clustering + +Our first analysis goal is to cluster the mineralogy data using K-means and pick the appropriate number of clusters. + +Here we recall the function `wssplot` we created in MATP-4400 (IDM) to examine cluster sizes in order to perform the "elbow" test. The function takes as its arguments a matrix, the maximum number of clusters and a random seed. It creates clusters for each possible value of k and plots the k-means objective function. + +NOTE: The basic syntax for creating a user-defined function in R is: + +`output <- function(arguments){ do stuff }` + +The following plot shows the K-Means objective value for up to eight clusters. + +```{r} +# A user-defined function to examine clusters and plot the results +wssplot <- function(data, nc=15, seed=10){ + wss <- data.frame(cluster=1:nc, quality=c(0)) + for (i in 1:nc){ + set.seed(seed) + wss[i,2] <- kmeans(data, centers=i)$tot.withinss} + ggplot(data=wss,aes(x=cluster,y=quality)) + + geom_line() + + ggtitle("Quality of k-means by Cluster") +} + +# Apply `wssplot()` to our PIXL data +wssplot(pixl_trim.mat, nc=8, seed=2) +``` + + +Based on where the "elbow" occurs, it looks like `d` might be a good `k` choice for k-means clustering. + +## k-means Clustering + +We create the final clustering with 5 clusters. + +```{r} +# Use our chosen 'k' to perform k-means clustering +set.seed(2) +k <- 3 +km <- kmeans(pixl_trim.mat,k) + +``` + +## Examine cluster means + +Below is a heat map of the cluster centers with rows and columns clustered. We keep the scale the same as in the original data. + +```{r} + +pheatmap(km$centers,scale="none") + +``` + +Notice how the means of the clusters vary. + +## Perform PCA on PIXL Data + +We're now ready to perform PCA. Note we have already scaled data so set `scale=FALSE`. + +We first show a [Scree plot](https://en.wikipedia.org/wiki/Scree_plot) to understand the explained variance by principal component. Note the elbow in the Scree plot should roughly match the one you saw in k-means. + +```{r} +# Perform the PCA on the matrix `pixl_trim.mat` we created earlier + +pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=FALSE) + +# generate the Scree plot +ggscreeplot(pixl_trim.mat.pca) +``` + +Make a table indicating how many samples are in each cluster. + +```{r} +# clusters sizes are in the km object produced by kmeans +cluster.df<-data.frame(cluster= 1:3, size=km$size) +kable(cluster.df,caption="Samples per cluster") +``` + + +## Create a PCA Biplot using ggbiplot + +Now we'll create a biplot of the data colored by cluster and label by rock type. + +```{r message=FALSE, warning=FALSE} +# For this lab we'll create a PCA biplot the easy way using ggbiplot! +ggbiplot::ggbiplot(pixl_trim.mat.pca, + labels = pixl.df$type, + groups = as.factor(km$cluster)) + + xlim(-2,2) + ylim(-2,2) + +``` + +## ANSWER THESE QUESTIONS! + +Add a description of each cluster here in your own words. + +Describe Cluster 1: Cluster 1 is made up of igneous rocks with high silicon dioxide + +Describe Cluster 2: Cluster 2 is made up of only sedimentary rocks and contains all the sedimentary samples + +Describe Cluster 3: Cluster 3 is made up of igneous rocks and other samples + + +What do the clustering and PCA results tell us about the data detected by the M20 PIXL experiment? _Feel free to add graphs or analyses to support your conclusions._ + +The clustering and PCA results show us that the amount of silicon dioxide is heavily correlated with whether a rock is igneous or sedimentary, and that the amount of silicon dioxide differs very heavily in the clusters. Maybe if the data was scaled differently this wouldn't necessarily be true though. + +```{r} +# Student's code for graphs and analysis here! + +pheatmap(km$centers, + scale = "none", + main = "Kmeans Cluster Centers Unscaled", + angle_col=0, + cluster_rows = FALSE, + cluster_cols = FALSE + ) +``` + +## SAVE, COMMIT and PUSH YOUR CHANGES! + +When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using **steps 4-8** in **Section 2.2**, summarized here: + +**In the Linux terminal:** + +* `git branch` + * To double-check that you are in your working branch +* `git add ` + * Your Rmd and knitted PDF +* `git commit -m "Some useful comments"` +* `git push origin ` + +**On github:** + +* Log in at https://github.rpi.edu/DataINCITE/DAR-Mars-F24 +* Select your branch from drop-down (default is **main**) +* Submit a "pull request" for your branch +* DO NOT MERGE!!! + +# APPENDIX: Accessing RStudio Server on the IDEA Cluster + +The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server) + +* The Cluster requires RCS credentials, enabled via registration in class + * email John Erickson for problems `erickj4@rpi.edu` +* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes +* Access via RPI physical network or VPN only + +# More info about Rstudio on our Cluster + +## RStudio GUI Access: + +* Use: + * http://lp01.idea.rpi.edu/rstudio-ose/ + * http://lp01.idea.rpi.edu/rstudio-ose-3/ + * http://lp01.idea.rpi.edu/rstudio-ose-6/ + * http://lp01.idea.rpi.edu/rstudio-ose-7/ +* Linux terminal accessible from within RStudio "Terminal" or via ssh (below) + +## Shared Data on Cluster: + +* Users enrolled in DAR have access to `/academics/MATP-4910-F24` + * Usually DAR users will see a symbolic ("soft") link in their home directories + * If you do not, type the following in the **Terminal** via RStudio: `ln -s /academics/MATP-4910-F23/ MATP-4910-F24` +* All idea_users have access to shared storage via `/data` ("data" in your home directories) + * You might wish to use this for data sharing in team projects... + * ...but we recommend using github for shared code development +* Shell access to nodes: You must access "landing pad" first, then compute node: +* `ssh your_rcs@lp01.idea.rpi.edu` For example: `ssh erickj4@lp01.idea.rpi.edu` +* Then, `ssh` to the desired compute node, e.g.: `ssh idea-node-02` \ No newline at end of file diff --git a/StudentNotebooks/Assignment01/vanesm-f24-assignment1.html b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.html new file mode 100644 index 0000000..5697c55 --- /dev/null +++ b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.html @@ -0,0 +1,683 @@ + + + + + + + + + + + + + + + +RPI github and Mars 2020 PIXL Example Notebook: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + +
+

1 Introductory Data Analytics Research Notebook

+

This notebook is broken into two main parts:

+
    +
  • Part 1: A basic introduction to github and RStudio Server
  • +
  • Part 2: An introduction to the Mars 2020 PIXL dataset
  • +
+

The RPI github repository for all the code and data required for this notebook may be found at:

+ +
+

1.1 BEFORE YOU BEGIN: github account setup

+

To contribute to any RPI github repository or read private repos you must validate your RPI github.com ID and send a confirmation email to John Erickson at erickj4@rpi.edu. Please do the following now:

+

Enabling 2FA on the RPI github and saving personal access tokens, et.al.

+
    +
  • Browse to http://github.rpi.edu
  • +
  • Login using your RPI credentials
  • +
  • Enable github two-factor authentication (2FA)
  • +
  • Under “Settings” -> “Password and authentication”
  • +
  • Select “Authenticator app” (Duo or Google authenticator are recommended) +
  • +
  • Create and save a personal access token +
      +
    • Under “Settings” -> “Developer settings”
    • +
    • Select “Personal access tokens”
    • +
    • Click on “Generate new token (classic)”
    • +
    • Set an expiration period for the end of the Fall 2024 term
    • +
    • Enable everything (check the left-most boxes)
    • +
    • Generate (green button)
    • +
    • SAVE THE RESULT! You won’t be able to see it again…
    • +
  • +
  • Use this token when command-line git asks you for a password
  • +
  • PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!
  • +
+
+
+
+

2 DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY

+

In this assignment we’re asking you to

+
    +
  • clone the DAR-Mars-F24 github repository,
  • +
  • create a personal branch using git,
  • +
  • create a new notebook that includes your answers to questions in this notebook,
  • +
  • make additions to the repository by adding your notebook to the repository.
  • +
+

The instructions which follow explain how to accomplish this.

+

For DAR Fall 2024 you must be using RStudio Server on the IDEA Cluster. Instructions for accessing “The Cluster” appear at the end of this notebook. Don’t forget to validate your RPI github ID as above and email erickj4@rpi.edu

+
+

2.0.1 Cloning an RPI github repository

+

The recommended procedure for cloning and using this repository is as follows:

+
    +
  • Access the RPI network via VPN +
  • +
  • Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/ +
      +
    • You must be on the RPI VPN!!
    • +
  • +
  • Access the Linux shell on the IDEA Cluster by clicking the Terminal tab of RStudio Server (lower left panel). +
      +
    • You now see the Linux shell on the IDEA Cluster
    • +
    • cd (change directory) to enter your home directory using: cd ~
    • +
    • Type pwd to confirm
    • +
    • NOTE: Advanced users may use ssh to directly access the Linux shell from a macOS or Linux command line
    • +
  • +
  • Type git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24 from within your home directory +
      +
    • Enter your RCS ID and your saved personal access token when asked
    • +
    • This will create a new directory DAR-Mars-F24
    • +
  • +
  • In the Linux shell, cd to DAR-Mars-F24/StudentNotebooks/Assignment01 +
      +
    • Type ls -al to list the current contents
    • +
    • Don’t be surprised if you see many files!
    • +
  • +
  • In the Linux shell, type git checkout -b dar-yourrcs where yourrcs is your RCS id +
      +
    • For example, if your RCS is erickj4, your new branch should be dar-erickj4
    • +
    • It is critical that you include your RCS id in your branch id!
    • +
  • +
  • Back in the RStudio Server UI, navigate to the DAR-Mars-F24/StudentNotebooks/Assignment01 directory via the Files panel (lower right panel) +
      +
    • Under the More menu, set this to be your R working directory
    • +
    • Setting the correct working directory is essential for interactive R use!
    • +
  • +
+
+
+

2.1 REQUIRED FOR ASSIGNMENT 1

+
    +
  1. In RStudio, make a copy of dar-f24-assignment1-template.Rmd file using a new, original, descriptive filename that includes your RCS ID! +
      +
    • Open darf24-assignment1-template.Rmd
    • +
    • Save As… using a new filename that includes your RCS ID
    • +
    • Example filename for user erickj4: erickj4-assignment1-f24.Rmd
    • +
    • POINTS OFF IF: +
        +
      • You don’t create a new filename!
      • +
      • You don’t include your RCS ID!
      • +
      • You include template in your new filename!
      • +
    • +
  2. +
  3. Edit your new notebook using RStudio and save +
      +
    • Change the title: and subtitle: headers (at the top of the file)
    • +
    • Change the author:
    • +
    • Don’t bother changing the date:; it should update automagically…
    • +
    • Save your changes
    • +
  4. +
  5. Use the RStudio Knit command to create an HTML file; repeat as necessary +
      +
    • Use the down arrow next to the word Knit and select Knit to HTML
    • +
    • You may also knit to PDF…
    • +
  6. +
  7. In the Linux terminal, use git add to add each new file you want to add to the repository +
      +
    • Type: git add yourfilename.Rmd
    • +
    • Type: git add yourfilename.html (created when you knitted)
    • +
    • Add your PDF if you also created one…
    • +
  8. +
  9. Continue making changes to your personal notebook +
      +
    • Add code where specified
    • +
    • Answer questions were indicated.
    • +
  10. +
  11. When you’re ready, in Linux commit your changes: +
      +
    • Type: git commit -m "some comment" where “some comment” is a useful comment describing your changes
    • +
    • This commits your changes to your local repo, and sets the stage for your next operation.
    • +
  12. +
  13. Finally, push your commits to the RPI github repo +
      +
    • Type: git push origin dar-yourrcs (where dar-yourrcs is the branch you’ve been working in)
    • +
    • Enter your RCS ID and personal access token (as a password) when asked.
    • +
    • Your changes are now safely on the RPI github.
    • +
  14. +
  15. REQUIRED: On the RPI github, submit a pull request. +
      +
    • In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git and log in using 2FA
    • +
    • In the branch selector drop-down (by default says main), select your branch
    • +
    • Submit a pull request for your branch
    • +
    • One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo.
    • +
  16. +
+

Please also see these handy github “cheatsheets”:

+ +
+
+
+

3 DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset

+

This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment.

+

PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument.

+

In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps.

+ +
+

3.1 Load the PIXL Data and display summary

+

Here is the MARS PIXL data. Take note of the variables, their types, and distriubtions.

+
# Saved LIBS data with locations added
+
+# NOTE: Use course directory version during the semester
+pixl.df<- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
+# Use this version to  use downloaded data from github
+# pixl.df <- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
+#/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds
+
+# convert location to a number
+pixl.df$location <- as.numeric(pixl.df$location )
+
+# Automatically converts all strings to factors
+pixl.df[sapply(pixl.df, is.character)] <-
+  lapply(pixl.df[sapply(pixl.df, 
+                                  is.character)], as.factor)
+
+# Show summary of the data 
+summary(pixl.df)
+
##      sample           Na20            Mgo             Al203       
+##  Min.   : 1.00   Min.   :1.000   Min.   : 0.730   Min.   : 1.700  
+##  1st Qu.: 4.75   1st Qu.:1.853   1st Qu.: 2.533   1st Qu.: 2.220  
+##  Median : 8.50   Median :1.900   Median :12.800   Median : 3.710  
+##  Mean   : 8.50   Mean   :2.672   Mean   :11.682   Mean   : 5.072  
+##  3rd Qu.:12.25   3rd Qu.:4.500   3rd Qu.:19.100   3rd Qu.: 7.117  
+##  Max.   :16.00   Max.   :5.550   Max.   :22.700   Max.   :11.600  
+##                                                                   
+##       Si02            P205             S03               Cl       
+##  Min.   :22.60   Min.   :0.1000   Min.   : 0.780   Min.   :0.400  
+##  1st Qu.:31.22   1st Qu.:0.2350   1st Qu.: 1.495   1st Qu.:0.940  
+##  Median :38.85   Median :0.5250   Median : 2.600   Median :1.740  
+##  Mean   :38.55   Mean   :0.6512   Mean   : 5.562   Mean   :1.846  
+##  3rd Qu.:41.17   3rd Qu.:0.8400   3rd Qu.: 3.800   3rd Qu.:2.080  
+##  Max.   :57.10   Max.   :2.7600   Max.   :21.530   Max.   :4.500  
+##                                                                   
+##       K20              Cao             Ti02            Cr203      
+##  Min.   :0.0000   Min.   :1.500   Min.   :0.2000   Min.   :0.000  
+##  1st Qu.:0.1600   1st Qu.:2.655   1st Qu.:0.5900   1st Qu.:0.025  
+##  Median :0.2000   Median :3.120   Median :0.7000   Median :0.155  
+##  Mean   :0.5800   Mean   :3.688   Mean   :0.8194   Mean   :0.355  
+##  3rd Qu.:0.8275   3rd Qu.:4.310   3rd Qu.:0.9900   3rd Qu.:0.290  
+##  Max.   :1.9000   Max.   :7.770   Max.   :2.4900   Max.   :1.900  
+##                                                                   
+##       Mno             FeO-T               name             type  
+##  Min.   :0.1000   Min.   :13.24   Atsah     : 1   Igneous    :8  
+##  1st Qu.:0.2800   1st Qu.:16.71   Bearwallow: 1   N/A        :1  
+##  Median :0.4000   Median :23.86   Coulettes : 1   Sedimentary:7  
+##  Mean   :0.3812   Mean   :21.45   Hahonih   : 1                  
+##  3rd Qu.:0.4900   3rd Qu.:25.70   Hazeltop  : 1                  
+##  Max.   :0.6900   Max.   :30.05   Kukaklek  : 1                  
+##                                   (Other)   :10                  
+##          campaign    location             abrasion
+##  Crater Floor:9   Min.   : 1.00   Alfalfa     :2  
+##  Delta Front :7   1st Qu.: 4.75   Bellegrade  :2  
+##                   Median : 8.50   Berry Hollow:2  
+##                   Mean   : 8.50   Dourbes     :2  
+##                   3rd Qu.:12.25   Novarupta   :2  
+##                   Max.   :16.00   Quartier    :2  
+##                                   (Other)     :4
+

Create a matrix containing the measurements without any meta data to prepare for clustering. Here we delibrately do not scale the data to get preliminary results.

+
# Prepare dataset for clustering selecting specific columns of interest and putting in a matrix
+pixl_trim.mat <- pixl.df %>% 
+  dplyr::select(c("Na20","Mgo","Al203","Si02",
+           "P205","S03","Cl","K20","Cao","Ti02",
+           "Cr203","Mno","FeO-T")) %>% as.matrix() 
+summary(pixl_trim.mat)
+
##       Na20            Mgo             Al203             Si02      
+##  Min.   :1.000   Min.   : 0.730   Min.   : 1.700   Min.   :22.60  
+##  1st Qu.:1.853   1st Qu.: 2.533   1st Qu.: 2.220   1st Qu.:31.22  
+##  Median :1.900   Median :12.800   Median : 3.710   Median :38.85  
+##  Mean   :2.672   Mean   :11.682   Mean   : 5.072   Mean   :38.55  
+##  3rd Qu.:4.500   3rd Qu.:19.100   3rd Qu.: 7.117   3rd Qu.:41.17  
+##  Max.   :5.550   Max.   :22.700   Max.   :11.600   Max.   :57.10  
+##       P205             S03               Cl             K20        
+##  Min.   :0.1000   Min.   : 0.780   Min.   :0.400   Min.   :0.0000  
+##  1st Qu.:0.2350   1st Qu.: 1.495   1st Qu.:0.940   1st Qu.:0.1600  
+##  Median :0.5250   Median : 2.600   Median :1.740   Median :0.2000  
+##  Mean   :0.6512   Mean   : 5.562   Mean   :1.846   Mean   :0.5800  
+##  3rd Qu.:0.8400   3rd Qu.: 3.800   3rd Qu.:2.080   3rd Qu.:0.8275  
+##  Max.   :2.7600   Max.   :21.530   Max.   :4.500   Max.   :1.9000  
+##       Cao             Ti02            Cr203            Mno        
+##  Min.   :1.500   Min.   :0.2000   Min.   :0.000   Min.   :0.1000  
+##  1st Qu.:2.655   1st Qu.:0.5900   1st Qu.:0.025   1st Qu.:0.2800  
+##  Median :3.120   Median :0.7000   Median :0.155   Median :0.4000  
+##  Mean   :3.688   Mean   :0.8194   Mean   :0.355   Mean   :0.3812  
+##  3rd Qu.:4.310   3rd Qu.:0.9900   3rd Qu.:0.290   3rd Qu.:0.4900  
+##  Max.   :7.770   Max.   :2.4900   Max.   :1.900   Max.   :0.6900  
+##      FeO-T      
+##  Min.   :13.24  
+##  1st Qu.:16.71  
+##  Median :23.86  
+##  Mean   :21.45  
+##  3rd Qu.:25.70  
+##  Max.   :30.05
+
+
+
+

4 Clustering

+

Our first analysis goal is to cluster the mineralogy data using K-means and pick the appropriate number of clusters.

+

Here we recall the function wssplot we created in MATP-4400 (IDM) to examine cluster sizes in order to perform the “elbow” test. The function takes as its arguments a matrix, the maximum number of clusters and a random seed. It creates clusters for each possible value of k and plots the k-means objective function.

+

NOTE: The basic syntax for creating a user-defined function in R is:

+

output <- function(arguments){ do stuff }

+

The following plot shows the K-Means objective value for up to eight clusters.

+
# A user-defined function to examine clusters and plot the results
+wssplot <- function(data, nc=15, seed=10){
+  wss <- data.frame(cluster=1:nc, quality=c(0))
+  for (i in 1:nc){
+    set.seed(seed)
+    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+  ggplot(data=wss,aes(x=cluster,y=quality)) + 
+    geom_line() + 
+    ggtitle("Quality of k-means by Cluster")
+}
+
+# Apply `wssplot()` to our PIXL data
+wssplot(pixl_trim.mat, nc=8, seed=2) 
+

+

Based on where the “elbow” occurs, it looks like d might be a good k choice for k-means clustering.

+
+

4.1 k-means Clustering

+

We create the final clustering with 5 clusters.

+
# Use our chosen 'k' to perform k-means clustering
+set.seed(2)
+k <- 3
+km <- kmeans(pixl_trim.mat,k)
+
+
+

4.2 Examine cluster means

+

Below is a heat map of the cluster centers with rows and columns clustered. We keep the scale the same as in the original data.

+
pheatmap(km$centers,scale="none")
+

+

Notice how the means of the clusters vary.

+
+
+

4.3 Perform PCA on PIXL Data

+

We’re now ready to perform PCA. Note we have already scaled data so set scale=FALSE.

+

We first show a Scree plot to understand the explained variance by principal component. Note the elbow in the Scree plot should roughly match the one you saw in k-means.

+
# Perform the PCA on the matrix `pixl_trim.mat` we created earlier
+
+pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=FALSE)
+
+# generate the Scree plot
+ggscreeplot(pixl_trim.mat.pca)
+

+

Make a table indicating how many samples are in each cluster.

+
# clusters sizes are in the km object produced by kmeans
+cluster.df<-data.frame(cluster= 1:3, size=km$size)
+kable(cluster.df,caption="Samples per cluster")
+ + + + + + + + + + + + + + + + + + + + + + +
Samples per cluster
clustersize
13
27
36
+
+
+

4.4 Create a PCA Biplot using ggbiplot

+

Now we’ll create a biplot of the data colored by cluster and label by rock type.

+
# For this lab we'll create a PCA biplot the easy way using ggbiplot!
+ggbiplot::ggbiplot(pixl_trim.mat.pca,
+                   labels = pixl.df$type,
+                   groups = as.factor(km$cluster)) +
+  xlim(-2,2) + ylim(-2,2) 
+

+
+
+

4.5 ANSWER THESE QUESTIONS!

+

Add a description of each cluster here in your own words.

+

Describe Cluster 1: Your description here

+

Describe Cluster 2: Your description here

+

Describe Cluster 3: Your description here

+

What do the clustering and PCA results tell us about the data detected by the M20 PIXL experiment? Feel free to add graphs or analyses to support your conclusions.

+
# Student's code for graphs and analysis here!
+
+
+

4.6 SAVE, COMMIT and PUSH YOUR CHANGES!

+

When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using steps 4-8 in Section 2.2, summarized here:

+

In the Linux terminal:

+
    +
  • git branch +
      +
    • To double-check that you are in your working branch
    • +
  • +
  • git add <your changed files> +
      +
    • Your Rmd and knitted PDF
    • +
  • +
  • git commit -m "Some useful comments"
  • +
  • git push origin <your branch name>
  • +
+

On github:

+ +
+
+
+

5 APPENDIX: Accessing RStudio Server on the IDEA Cluster

+

The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)

+
    +
  • The Cluster requires RCS credentials, enabled via registration in class +
      +
    • email John Erickson for problems erickj4@rpi.edu
    • +
  • +
  • RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes
  • +
  • Access via RPI physical network or VPN only
  • +
+
+
+

6 More info about Rstudio on our Cluster

+
+

6.1 RStudio GUI Access:

+ +
+
+

6.2 Shared Data on Cluster:

+
    +
  • Users enrolled in DAR have access to /academics/MATP-4910-F24 +
      +
    • Usually DAR users will see a symbolic (“soft”) link in their home directories
    • +
    • If you do not, type the following in the Terminal via RStudio: ln -s /academics/MATP-4910-F23/ MATP-4910-F24
    • +
  • +
  • All idea_users have access to shared storage via /data (“data” in your home directories) +
      +
    • You might wish to use this for data sharing in team projects…
    • +
    • …but we recommend using github for shared code development
    • +
  • +
  • Shell access to nodes: You must access “landing pad” first, then compute node:
  • +
  • ssh your_rcs@lp01.idea.rpi.edu For example: ssh erickj4@lp01.idea.rpi.edu
  • +
  • Then, ssh to the desired compute node, e.g.: ssh idea-node-02
  • +
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/StudentNotebooks/Assignment01/vanesm-f24-assignment1.pdf b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.pdf new file mode 100644 index 0000000..d9eea0e Binary files /dev/null and b/StudentNotebooks/Assignment01/vanesm-f24-assignment1.pdf differ diff --git a/StudentNotebooks/Assignment02/vanesm-f24-assignment2.Rmd b/StudentNotebooks/Assignment02/vanesm-f24-assignment2.Rmd new file mode 100644 index 0000000..fd98081 --- /dev/null +++ b/StudentNotebooks/Assignment02/vanesm-f24-assignment2.Rmd @@ -0,0 +1,404 @@ +--- +title: "Mars 2020 Mission Data Notebook: LIBS Data" +subtitle: "DAR Assignment 2" +author: "Margo VanEsselstyn" +date: "`r format(Sys.time(), '%d %B %Y')`" +output: + pdf_document: default + html_document: + toc: true + number_sections: true + df_print: paged +--- +```{r setup, include=FALSE} + +# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!! +# This section install packages if they are not already installed. +# This block will not be shown in the knit file. +knitr::opts_chunk$set(echo = TRUE) + +# Set the default CRAN repository +local({r <- getOption("repos") + r["CRAN"] <- "http://cran.r-project.org" + options(repos=r) +}) + +if (!require("pandoc")) { + install.packages("pandoc") + library(pandoc) +} + +# Required packages for M20 LIBS analysis +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("ggbiplot")) { + install.packages("ggbiplot") + library(ggbiplot) +} + +if (!require("pheatmap")) { + install.packages("pheatmap") + library(pheatmap) +} + +if(!require("vegan")) { + install.packages("vegan") + library(vegan) +} + +if(!require("knitr")){ + install.packages("knitr") + library(knitr) +} + +``` + +# DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook + +This notebook is broken into two main parts: + +* **Part 1:** Preparing your local repo for **DAR Assignment 2** +* **Part 2:** Loading and some analysis of the Mars 2020 (M20) Datasets + * Lithology: _Summarizes the mineral characteristics of samples collected at certain sample locations._ + * PIXL: Planetary Instrument for X-ray Lithochemistry. _Measures elemental chemistry of samples at sub-millimeter scales of samples._ + * SHERLOC: Scanning Habitable Environments with Raman and Luminescence for Organics and Chemicals. _Uses cameras, a spectrometer, and a laser of samples to search for organic compounds and minerals that have been altered in watery environments and may be signs of past microbial life._ + * LIBS: Laser-induced breakdown spectroscopy. _Uses a laser beam to help identify minerals in samples and other areas that are beyond the reach of the rover's robotic arm or in areas too steep for the rover to travel._ + +* **Part 3:** Individual analysis of your team's dataset + +* **Part 4:** Preparation of Team Presentation + + +**NOTE:** The RPI github repository for all the code and data required for this notebook may be found at: + +* https://github.rpi.edu/DataINCITE/DAR-Mars-F24 + + +# DAR ASSIGNMENT 2 (Part 1): Preparing your local repo for Assignment 2 + +In this assignment you'll start by making a copy of the Assignment 2 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this. + +**NOTE:** You already cloned the `DAR-Mars-F24` repository for Assignment 1; you **do not** need to make another clone of the repo, but you must begin by updating your copy as instructed below: + +## Updating your local clone of the `DAR-Mars-F24` repository + +* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/ + * REMINDER: You must be on the RPI VPN!! +* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel). + * You now see the Linux shell on the IDEA Cluster + * `cd` (change directory) to enter your home directory using: `cd ~` + * Type `pwd` to confirm where you are +* In the Linux shell, `cd` to `DAR-Mars-F24` + * Type `git pull origin main` to pull any updates + * Always do this when you being work; we might have added or changed something! +* In the Linux shell, `cd` into `Assignment02` + * Type `ls -al` to list the current contents + * Don't be surprised if you see many files! +* In the Linux shell, type `git branch` to verify your current working branch + * If it is not `dar-yourrcs`, type `git checkout dar-yourrcs` (where `yourrcs` is your RCS id) + * Re-type `git branch` to confirm +* Now in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment02` directory via the **Files** panel (lower right panel) + * Under the **More** menu, set this to be your R working directory + * Setting the correct working directory is essential for interactive R use! + +You're now ready to start coding Assignment 2! + +## Creating your copy of the Assignment 2 notebook + +1. In RStudio, make a **copy** of `dar-f24-assignment2-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!** + * Open `dar-f24-assignment2-template.Rmd` + * **Save As...** using a new filename that includes your RCS ID + * Example filename for user `erickj4`: `erickj4-assignment2-f24.Rmd` + * POINTS OFF IF: + * You don't create a new filename! + * You don't include your RCS ID! + * You include `template` in your new filename! +2. Edit your new notebook using RStudio and save + * Change the `title:` and `subtitle:` headers (at the top of the file) + * Change the `author:` + * Don't bother changing the `date:`; it should update automagically... + * **Save** your changes +3. Use the RStudio `Knit` command to create an PDF file; repeat as necessary + * Use the down arrow next to the word `Knit` and select **Knit to PDF** + * You may also knit to HTML... +4. In the Linux terminal, use `git add` to add each new file you want to add to the repository + * Type: `git add yourfilename.Rmd` + * Type: `git add yourfilename.pdf` (created when you knitted) + * Add your HTML if you also created one... +5. When you're ready, in Linux commit your changes: + * Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes + * This commits your changes to your local repo, and sets the stage for your next operation. +6. Finally, push your commits to the RPI github repo + * Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in) + * Your changes are now safely on the RPI github. +7. **REQUIRED:** On the RPI github, **submit a pull request.** + * In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24 + * In the branch selector drop-down (by default says **master**), select your branch + * **Submit a pull request for your branch** + * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_ + +# DAR ASSIGNMENT 2 (Part 2): Loading the Mars 2020 (M20) Datasets + +In this assignment there are four datasets from separate instruments on the Mars Perserverance rover available for analysis: + +* **Lithology:** Summarizes the mineral characteristics of samples collected at certain sample locations +* **PIXL:** Planetary Instrument for X-ray Lithochemistry of collected samples +* **SHERLOC:** Scanning Habitable Environments with Raman and Luminescence for Organics and Chemicals for collected samples +* **LIBS:** Laser-induced breakdown spectroscopy which are measured in many areas (not just samples) + +Each dataset provides data about the mineralogy of the surface of Mars. Based on the purpose and nature of the instrument, the data is collected at different intervals along the path of Perseverance as it makes it way across the Jezero crater. Some of the data (esp. LIBS) is collected almost every Martian day, or _sol_. Some of the data (PIXL and SHERLOC) is only collected at certain sample locations of interest + +Your objective is to perform an analysis of the your team's assigned dataset in order to learn all you can about these Mars samples. + +NOTES: + + * All of these datasets can be found in `/academics/MATP-4910-F24/DAR-Mars-F24/Data` + * We have included a comprehensive `samples.Rds` dataset that includes useful details about the sample locations, including Martian latitude and longitude and the sol that individual samples were collected. + * Also included is `rover.waypoints.Rds` that provides detailed location information (lat/lon) for the Perseverance rover throughout its journey, up to the present. This can be updated when necessary using the included `roverStatus-f24.R` script. + * A general guide to the available Mars 2020 data is available here: https://pds-geosciences.wustl.edu/missions/mars2020/index.htm + * Other useful MARS 2020 sites + https://science.nasa.gov/mission/mars-2020-perseverance/mars-rock-samples/ and https://an.rsl.wustl.edu/m20/AN/an3.aspx?AspxAutoDetectCookieSupport=1 + * Note that PIXL, SHERLOC, and Lithology describe 16 sample that were physically collected. There will eventually be 38 samples. These datasets can be merged by sample. The LIBS data includes observations collected at many more locations so how to combine the LIBS data with the other datasets is an open research question. + +## Data Set A: Load the Lithology Data + +The first five features of the dataset describe twenty-four (24) rover sample locations. + +The remaining features provides a simple binary (`1` or `0`) summary of presence or absence of 35 minerals at the 24 rover sample locations. + +Only the first sixteen (16) samples are maintained, as the remaining are missing the mineral descriptors. + +The following code "cleans" the dataset to prepare for analysis. It first creates a dataframe with metadata and measurements for samples, and then creates a matrix containing only numeric measurements for later analysis. + +## Data Set B: Load the PIXL Data + +The PIXL data provides summaries of the mineral compositions measured at selected sample sites by the PIXL instrument. Note that here we scale pixl.mat so features have mean 0 and standard deviation so results will be different than in Assignment 1. + +## Data Set C: Load the LIBS Data + +The LIBS data provides summaries of the mineral compositions measured at selected sample sites by the LIBS instrument, part of the Perseverance SuperCam. + +```{r} +# Load the saved LIBS data with locations added +libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds") + +#Drop features that are not to be used in the analysis for this notebook +libs.df <- libs.df %>% + select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev, + MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total))) + +# Convert the points to numeric +libs.df$point <- as.numeric(libs.df$point) + +# Review what we have +summary(libs.df) + +# Make the a matrix contain only the libs measurements for each mineral +libs.matrix <- as.matrix(libs.df[,6:13]) + +# Check to see scaling +str(libs.matrix) +``` + + +## Dataset D: Load the SHERLOC Data + +The SHERLOC data you will be using for this lab is the result of scientists' interpretations of extensive spectral analysis of abrasion samples provided by the SHERLOC instrument. + +**NOTE:** This dataset presents minerals as rows and sample sites as columns. You'll probably want to rotate the dataset for easier analysis.... + +## Data Set E: PIXL + Sherloc + +## Data Set F: PIXL + Lithography + +Create data and matrix from prior datasets + +## Data Set G: Sherloc + Lithology + +Create Data and matrix from prior datasets by taking on appropriate combinations. + +## Data Set H: Sherloc + Lithology + PIXL + +Create data frame and matrix from prior datasets by making on appropriate combinations. + +# Analysis of Data (Part 3) + +Each team has been assigned one of six datasets: + +1. Dataset B: PIXL: The PIXL team's goal is to understand and explain how scaling changes results from Assignment 1. The matrix version was scaled above but not in Assignment 1. + +2. Dataset C: LIBS (with appropriate scaling as necessary. Not scaled yet.) + +3. Dataset D: Sherloc (with appropriate scaling as necessary. Not scaled yet.) + +4. Dataset E: PIXL + Sherloc (with appropriate scaling as necessary. Not scaled yet.) + +5. Dataset F: PIXL + Lithography (with appropriate scaling as necessary. Not scaled yet.) + +6. Dataset G: Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.) + +7. Dataset H: PIXL + Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.) + +**For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired. Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same clustering for your team (which is okay but then vary rest), make sure you use the same random seeds. + +1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features? Which features are measurements and which features are metadata about the samples? (3 pts) + +The LIBS dataframe has 13 features and 1932 rows of samples. The first 5 features are metadata like location and sol/day. The other 8 features are all chemicals found in the samples. + +2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose a scaling method or to not scale. (3 pts) + +```{r} +libs.matrix.scaled <- libs.matrix %>% scale(center=TRUE,scale=TRUE) +``` +The data is z-score centered and scaled. I chose to scale the data this way because each feature has a vastly different range, and analyzing the data without scaling it beforehand would lead to certain features like SiO2 being over-weighted. + +3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_ Describe how you picked the best number of clusters. Indicate the number of points in each clusters. Coordinate with your team so you try different approaches. If you want to share results with your team mates, make sure to use the same random seeds. (6 pts) + +```{r, echo=FALSE} +wssplot <- function(data, nc=15, seed=10){ + wss <- data.frame(cluster=1:nc, quality=c(0)) + for (i in 1:nc){ + set.seed(seed) + wss[i,2] <- kmeans(data, centers=i)$tot.withinss} + ggplot(data=wss,aes(x=cluster,y=quality)) + + geom_line() + + ggtitle("Quality of k-means by Cluster") +} +``` + +```{r} +set.seed(100) +wssplot(libs.matrix.scaled,seed=100) + +km<-kmeans(libs.matrix.scaled,5) + +cluster.df<-cbind(c(1,2,3,4,5),km$size) + +colnames(cluster.df) <- c("Cluster","Size") + +kable(cluster.df) +``` +Based on the elbow test, I picked 5 clusters. + +4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data: Alternatively do another creative analysis of your datasets that leads to one of more findings. Make sure to explain what your analysis and discuss your the results. + +```{r} +pheatmap(km$centers,scale="none",cluster_rows=F,main="Heatmap of K-Means Cluster Centers") + +libs.matrix.scaled.pca <- prcomp(libs.matrix.scaled, scale=FALSE) + +ggbiplot::ggbiplot(libs.matrix.scaled.pca, + groups = as.factor(km$cluster),alpha=0.2) + + xlim(-5,3) + ylim(-3,7) + + ggtitle("PCA Analysis of LIBS data by k-means cluster") +``` + +For my creative analysis I wanted to do a similarity analysis of the different sampling groups, or the different "targets" in the LIBS dataframe. I used the anosim (analysis of similarity) function in R. Its default method of measuring distance between groups is the Bray-Curtis Dissimilarity method. This was designed by biologists for comparing populations in different sampling groups. Because of this, it doesn't respond well to negative values, as it was designed for population counts, which are always positive. I rescaled my data because of this, and did not center it at zero as I previously had. +```{r} +locations<-as.factor(libs.df$target) + +libs.matrix.scaled.uncentered<-libs.matrix %>% scale(center=F,scale=T) +libs.matrix.scaled.uncentered.loc<-cbind(locations,libs.matrix.scaled.uncentered) +``` + +```{r} +anosim(libs.matrix.scaled.uncentered.loc,locations,permutations=500,distance="bray") +``` +This similarity analysis tells us that there is very high dissimilarity between the sampling groups, and that this is statistically significant. Our R-value is 0.955, which is very close to 1, and our significance is 0.001, which is lower than a typical threshold of 0.05. This tells us that there is an uneven distribution of chemical compositions in the different samples. This makes sense, there are 201 different sample groups that are spread across many different locations. + +Because I looked at so many different groups, I used a lower number of permutations than the default 999 to save time. Also, I would like to compare a smaller group of sampling groups to see if the similarity changes when we are comparing a smaller number of groups. + +```{r} +libs.cluster4<-as.data.frame(cbind(libs.matrix.scaled.uncentered.loc,km$cluster)) +libs.cluster4<-libs.cluster4[libs.cluster4$V10==4,] + +libs.targets4<-as.data.frame(cbind(libs.matrix.scaled.uncentered.loc,km$cluster)) +libs.targets4<-libs.targets4[libs.targets4$locations %in% c(89,153,155,172,192),] +``` + +Looking at just cluster 4, our smallest cluster, there are 5 targets, hardscrabble_creek__, scct_lanke0101______, scct_tapag0206______, thunderbolt_peak_896, scct_lca530106______, however some of these target groups are incomplete in cluster 4, so we are going to look at the complete set of points from each of these target groups. + +```{r} +#cluster4targets<-as.numeric(as.factor(libs.targets4$V9)) + +libs.targets4<-libs.targets4[,c(1:9)] +#libs.targets4<-cbind(libs.targets4,cluster4targets) +``` + +```{r} +anosim(libs.targets4,libs.targets4$locations,permutations=999,distance="bray") +``` +Here, we can see that when looking at the targets that are found in cluster 4, the R-value is still very close to 1 (0.989) and it is also statistically significant. +# Preparation of Team Presentation (Part 4) + +Prepare a presentation of your teams result to present in class on **September 11** starting at 9am in AE217 (20 pts) +The presentation should include the following elements + +0.Your teams names and members +1. A **Description** of the data set that you analyzed including how many observations and how many features. (<= 1.5 mins) +2. Each team member gets **three minutes** to explain their analysis: + * what analysis they performed + * the results of that analysis + * a brief discussion of their interpretation of these results + * <= 18 mins _total!_ +3. A **Conclusion** slide indicating major findings of the teams (<= 1.5 mins) +4. Thoughts on **potential next steps** for the MARS team (<= 1.5 mins) + +* A template for your team presentation is included here: https://bit.ly/dar-template-f24 + +* The rubric for the presentation is here: + +https://docs.google.com/document/d/1-4o1O4h2r8aMjAplmE-ItblQnyDAKZwNs5XCnmwacjs/pub + + +* Post a link to your teams presentation in the MARS webex chat before class. You can continue to edit until the last minute. + + + + +# When you're done: SAVE, COMMIT and PUSH YOUR CHANGES! + +When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using the following steps: + +* `git branch` + * To double-check that you are in your working branch +* `git add ` +* `git commit -m "Some useful comments"` +* `git push origin ` +* do a pull request + + + + + +# APPENDIX: Accessing RStudio Server on the IDEA Cluster + +The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server) + +* The Cluster requires RCS credentials, enabled via registration in class + * email John Erickson for problems `erickj4@rpi.edu` +* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes +* Access via RPI physical network or VPN only + +# More info about Rstudio on our Cluster + +## RStudio GUI Access: + +* Use: + * http://lp01.idea.rpi.edu/rstudio-ose/ + * http://lp01.idea.rpi.edu/rstudio-ose-3/ + * http://lp01.idea.rpi.edu/rstudio-ose-6/ + * http://lp01.idea.rpi.edu/rstudio-ose-7/ +* Linux terminal accessible from within RStudio "Terminal" or via ssh (below) + diff --git a/StudentNotebooks/Assignment02/vanesm-f24-assignment2.pdf b/StudentNotebooks/Assignment02/vanesm-f24-assignment2.pdf new file mode 100644 index 0000000..7324b2c Binary files /dev/null and b/StudentNotebooks/Assignment02/vanesm-f24-assignment2.pdf differ