qinh2-assignment2-f24.Rmd

---
title: "Mars 2020 Mission Data Notebook qinh2:"
subtitle: "MATP4910 Assignment 2 (Fall 2024)"
author: "Hanzhen Qin(qinh2)"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  pdf_document: default
  html_document:
    toc: true
    number_sections: true
    df_print: paged
---
```{r setup, include=FALSE}

# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
# This section  install packages if they are not already installed.
# This block will not be shown in the knit file.
knitr::opts_chunk$set(echo = TRUE)

# Set the default CRAN repository
local({r <- getOption("repos")
       r["CRAN"] <- "http://cran.r-project.org"
       options(repos=r)
})

if (!require("pandoc")) {
  install.packages("pandoc")
  library(pandoc)
}

# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
  install.packages("rmarkdown")
  library(rmarkdown)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require("stringr")) {
  install.packages("stringr")
  library(stringr)
}

if (!require("ggbiplot")) {
  install.packages("ggbiplot")
  library(ggbiplot)
}

if (!require("pheatmap")) {
  install.packages("pheatmap")
  library(pheatmap)
}

```

# DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook

This notebook is broken into two main parts:

* **Part 1:** Preparing your local repo for **DAR Assignment 2**
* **Part 2:** Loading and some analysis of the Mars 2020 (M20) Datasets
   * Lithology: _Summarizes the mineral characteristics of samples collected at certain sample locations._
   * PIXL: Planetary Instrument for X-ray Lithochemistry. _Measures elemental chemistry of samples at sub-millimeter scales of samples._
   * SHERLOC: Scanning Habitable Environments with Raman and Luminescence for Organics and Chemicals. _Uses cameras, a spectrometer, and a laser of samples to search for organic compounds and minerals that have been altered in watery environments and may be signs of past microbial life._
   * LIBS: Laser-induced breakdown spectroscopy. _Uses a laser beam to help identify minerals in samples and other areas that are beyond the reach of the rover's robotic arm or in areas too steep for the rover to travel._

* **Part 3:** Individual analysis of your team's dataset

* **Part 4:** Preparation of Team Presentation


**NOTE:** The RPI github repository for all the code and data required for this notebook may be found at:

* https://github.rpi.edu/DataINCITE/DAR-Mars-F24


# DAR ASSIGNMENT 2 (Part 1): Preparing your local repo for Assignment 2

In this assignment you'll start by making a copy of the Assignment 2 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this.

**NOTE:** You already cloned the `DAR-Mars-F24` repository for Assignment 1; you **do not** need to make another clone of the repo, but you must begin by updating your copy as instructed below:

## Updating your local clone of the `DAR-Mars-F24` repository

* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/
    * REMINDER: You must be on the RPI VPN!!
* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel).
    * You now see the Linux shell on the IDEA Cluster
    * `cd` (change directory) to enter your home directory using: `cd ~`
    * Type `pwd` to confirm where you are
* In the Linux shell, `cd` to `DAR-Mars-F24`
    * Type `git pull origin main` to pull any updates
    * Always do this when you being work; we might have added or changed something!
* In the Linux shell, `cd` into `Assignment02`
    * Type `ls -al` to list the current contents
    * Don't be surprised if you see many files!
* In the Linux shell, type `git branch` to verify your current working branch
    * If it is not `dar-yourrcs`, type `git checkout dar-yourrcs` (where `yourrcs` is your RCS id)
    * Re-type `git branch` to confirm
* Now in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment02` directory via the **Files** panel (lower right panel)
    * Under the **More** menu, set this to be your R working directory
    * Setting the correct working directory is essential for interactive R use!

You're now ready to start coding Assignment 2!

## Creating your copy of the Assignment 2 notebook

1. In RStudio, make a **copy** of `dar-f24-assignment2-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!**
    * Open `dar-f24-assignment2-template.Rmd`
    * **Save As...** using a new filename that includes your RCS ID
    * Example filename for user `erickj4`: `erickj4-assignment2-f24.Rmd`
    * POINTS OFF IF:
       * You don't create a new filename!
       * You don't include your RCS ID!
       * You include `template` in your new filename!
2. Edit your new notebook using RStudio and save
    * Change the `title:` and `subtitle:` headers (at the top of the file)
    * Change the `author:`
    * Don't bother changing the `date:`; it should update automagically...
    * **Save** your changes
3. Use the RStudio `Knit` command to create an PDF file; repeat as necessary
    * Use the down arrow next to the word `Knit` and select **Knit to PDF**
    * You may also knit to HTML...
4. In the Linux terminal, use `git add` to add each new file you want to add to the repository
    * Type: `git add yourfilename.Rmd`
    * Type: `git add yourfilename.pdf` (created when you knitted)
    * Add your HTML if you also created one...
5. When you're ready, in Linux commit your changes:
    * Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes
    * This commits your changes to your local repo, and sets the stage for your next operation.
6. Finally, push your commits to the RPI github repo
    * Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in)
    * Your changes are now safely on the RPI github.
7. **REQUIRED:** On the RPI github, **submit a pull request.**
    * In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24
    * In the branch selector drop-down (by default says **master**), select your branch
    * **Submit a pull request for your branch**
    * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_

# DAR ASSIGNMENT 2 (Part 2): Loading the Mars 2020 (M20) Datasets

In this assignment there are four datasets from separate instruments on the Mars Perserverance rover available for analysis:

* **Lithology:** Summarizes the mineral characteristics of samples collected at certain sample locations
* **PIXL:** Planetary Instrument for X-ray Lithochemistry of collected samples
* **SHERLOC:** Scanning Habitable Environments with Raman and Luminescence for Organics and Chemicals for collected samples
* **LIBS:** Laser-induced breakdown spectroscopy which are measured in many areas (not just samples)

Each dataset provides data about the mineralogy of the surface of Mars. Based on the purpose and nature of the instrument, the data is collected at different intervals along the path of Perseverance as it makes it way across the Jezero crater. Some of the data (esp. LIBS) is collected almost every Martian day, or _sol_. Some of the data (PIXL and SHERLOC) is only collected at certain sample locations of interest

Your objective is to perform an analysis of the your team's assigned dataset in order to learn all you can about these Mars samples.

NOTES:

   * All of these datasets can be found in `/academics/MATP-4910-F24/DAR-Mars-F24/Data`
   * We have included a comprehensive `samples.Rds` dataset that includes useful details about the sample locations, including Martian latitude and longitude and the sol that individual samples were collected.
   * Also included is `rover.waypoints.Rds` that provides detailed location information (lat/lon) for the Perseverance rover throughout its journey, up to the present. This can be updated when necessary using the included `roverStatus-f24.R` script.
   * A general guide to the available Mars 2020 data is available here: https://pds-geosciences.wustl.edu/missions/mars2020/index.htm
   * Other useful MARS 2020 sites
   https://science.nasa.gov/mission/mars-2020-perseverance/mars-rock-samples/ and https://an.rsl.wustl.edu/m20/AN/an3.aspx?AspxAutoDetectCookieSupport=1
   * Note that PIXL, SHERLOC, and Lithology describe 16  sample that were physically collected. There will eventually be 38 samples.   These datasets can be merged by sample.   The LIBS data includes observations collected at many more locations so how to combine the LIBS data with the other datasets is an open research question.

## Data Set A: Load the Lithology Data

The first five features of the dataset describe twenty-four (24) rover sample locations.

The remaining features  provides a simple binary (`1` or `0`) summary of presence or absence of 35 minerals at the 24 rover sample locations.

Only the first sixteen (16) samples are maintained, as the remaining are missing the mineral descriptors.

The following code "cleans" the dataset to prepare for analysis.  It first creates a dataframe with metadata and measurements for samples, and then creates a matrix containing only numeric measurements for later analysis.

```{r}
# Load the saved lithology data with locations added
lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")

# Cast samples as numbers
lithology.df$sample <- as.numeric(lithology.df$sample)

# Convert rest into factors
lithology.df[sapply(lithology.df, is.character)] <-
  lapply(lithology.df[sapply(lithology.df, is.character)],
                                       as.factor)

# Keep only first 16 samples because the data for the rest of the samples is not available yet
lithology.df<-lithology.df[1:16,]

# Look at summary of cleaned data frame
summary(lithology.df)

# Create a matrix containing only the numeric measurements.  The remaining features are metadata about the sample.
lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1

# Review the structure of our matrix
str(lithology.matrix)
```


## Data Set B: Load the PIXL Data

The PIXL data provides summaries of the mineral compositions measured at selected sample sites by the PIXL instrument.  Note that here we scale pixl.mat so features have mean 0 and standard deviation so results will be different than in Assignment 1.

```{r}
# Load the saved PIXL data with locations added
pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")

# Convert to factors
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
                                       as.factor)

# Review our dataframe
summary(pixl.df)

# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14] %>% scale()

# Review the structure
str(pixl.matrix)
```

### Description for data set B:

The dataset contains 16 sample points, and 13 mineral components were measured at each sample point, including sodium oxide (Na2O), magnesium oxide (MgO), aluminum oxide (Al2O3), silicon dioxide (SiO2), phosphorus pentoxide (P2O5), sulfur trioxide (SO3), chlorine (Cl), potassium oxide (K2O), calcium oxide (CaO), titanium dioxide (TiO2), chromium trioxide (Cr2O3), manganese oxide (MnO) and total iron oxide (FeO-T). There are three Metadata features: name (sample name), type (sample type, such as igneous or sedimentary rock), campaign (geographic area where the sample was taken).

The sample types are mainly divided into igneous and sedimentary rocks, including 8 igneous rock samples, 7 sedimentary rock samples, and 1 sample type is undefined. At the same time, the sample locations are classified in detail, including different locations such as crater floors and delta fronts.

I will then scale the measurements to ensure that the clustering algorithm is not affected by the different dimensions of the features. I will use a common scaling method called normalization (Z-score normalization), which sets the mean of each feature to 0 and the standard deviation to 1. The reason that i choose Z-score normalization is because it scales the features to the same scale so that they do not affect the clustering results due to different units or magnitudes.

### Data Scaling
```{r}
# display the scaled dataset, using the Z-score normalization
scaled_matrix <- scale(pixl.matrix)

# display the first few rows of scaled data
head(scaled_matrix)

# view the entire dataset
# View(scaled_matrix)
```

Next, I will use hierarchical clustering as the clustering method. The advantage of hierarchical clustering is that it does not require the number of clusters to be specified in advance, and the hierarchical structure of different clusters can be visually observed through a dendrogram. To determine the optimal number of clusters, I used a dendrogram, which identifies optimal cut points based on node heights in the dendrogram. By observing the hierarchical structure of the dendrogram, when the distance between branches increases significantly, larger intervals usually mark a more natural number of clusters.
```{r}
# check and remove duplicate sample rows
duplicated_rows <- duplicated(scaled_matrix)
scaled_matrix_unique <- scaled_matrix[!duplicated_rows, ]

# assume the original data frame pixl.df has a column named 'name' representing the sample names
# create a label vector, replacing the names in the original data
labels <- pixl.df$name[!duplicated_rows] # make sure the labels are aligned with the deduplicated data

# calculate the distance matrix of the samples
dist_matrix <- dist(scaled_matrix_unique)

# use hierarchical clustering method
hclust_result <- hclust(dist_matrix, method="ward.D")

# plot a hierarchical clustering dendrogram and use sample names as labels
# replace the default index labels with sample names also ensure that the branches are aligned to the bottom
plot(hclust_result, main="Hierarchical Clustering Dendrogram", xlab="Samples", ylab="Height", sub="", labels=labels, hang=-1)
```
```{r}
# cut the dendrogram and choose the appropriate number of clusters
k <- 3 # select the number of clusters, because there are three groups of clusters
clusters_hierarchical <- cutree(hclust_result, k=k)

# view the number of samples in each cluster (hierarchical clustering results)
cluster_counts_hierarchical <- table(clusters_hierarchical)
# print(cluster_counts_hierarchical)

# perform principal component analysis (PCA)
pca_result <- prcomp(scaled_matrix_unique)

# draw a PCA graph of hierarchical clustering
plot(pca_result$x[,1], pca_result$x[,2], col=clusters_hierarchical,
pch=19, xlab="Principal Component 1", ylab="Principal Component 2",
main="PCA of Mineral Data with Hierarchical Clusters")

# add cluster numbers to points
text(pca_result$x[,1], pca_result$x[,2], labels=clusters_hierarchical, pos=4)
```

In this analysis, I used the Ward.D method to minimize the variance within each cluster. In the dendrogram, the length of the vertical lines represents the distance or similarity between samples. Generally, when the long vertical lines are cut short, it means that the sample groupings before this point are more different. Therefore, by choosing the appropriate position to cut these long branches, I can determine the reasonable number of clusters.

According to the results of hierarchical clustering, I divide the samples into 3 clusters. The following is the number of samples in each cluster:

- Cluster 1 includes: Coulettes, Roubion, Montdenier

- Cluster 2 includes: Montagnac, Salette, Swift Run, Shuyak

- Cluster 3 includes: Hazeltop, Kukaklek

### Creative analysis

Next, I want to compare the mineralogy of different types of samples to find the difference between igneous and sedimentary rocks, and use correlation analysis and principal component analysis (PCA) to explore the correlation between mineralogy and its relationship with the geological background of the samples.

### Comparison of mineralogical composition between igneous and sedimentary rock samples
```{r}
# split data by sample type
igneous <- pixl.df[pixl.df$type == "Igneous", ]
sedimentary <- pixl.df[pixl.df$type == "Sedimentary", ]

# name of mineral composition column
mineral_columns <- c("Na20", "Mgo", "Al203", "Si02", "P205", "S03",
                     "Cl", "K20", "Cao", "Ti02", "Cr203", "Mno", "FeO-T")

# set a larger plotting area to ensure both datasets can be displayed
par(mfrow=c(1, 1))  # set as single graphs

# make a box plot of each mineral composition, showing igneous and sedimentary rocks separately
for (mineral in mineral_columns) {
  # this line of code checks whether the current mineral column (mineral) exists in both the igneous and sedimentary datasets
  if (mineral %in% colnames(igneous) & mineral %in% colnames(sedimentary)) {
    boxplot(igneous[[mineral]], sedimentary[[mineral]],
            names=c("Igneous", "Sedimentary"),  # set the x_axis information
            main=paste(mineral, "Comparison"),
            ylab=mineral,
            col=c("lightblue", "lightgreen"),  # set the background color of the graph
            border=c("pink", "purple"))  # set the border color
  }
  # if one of the mineral missing in igneous or sedimentary, it will be printed out
  else {
    print(paste("Mineral column", mineral, "is missing in one of the datasets."))
  }
}
```

I used a box plot to compare the distribution of mineral contents in two types of samples: igneous and sedimentary. The middle value of the sample can be seen through a thick line in the middle of the graph.

Focus on MgO and SiO2. The box plot of MgO shows that the median and distribution range of MgO in igneous rock samples are lower than those in sedimentary rocks. This means that the magnesium oxide content in igneous rocks is relatively low, while the MgO content in sedimentary rocks is distributed more widely, which reflects that the sedimentary rock samples may contain more magnesium-rich minerals in Dataset B. The box plot of SiO2 shows that the content of SiO2 in sedimentary rocks is lower than the median of igneous rocks, indicating that the silica content in igneous rocks is relatively high and that shows igneous rocks may contain more silicate minerals.

### Correlation analysis and principal component analysis of mineral components
```{r}
library(fields)
# create a matrix containing only mineral components
pixl.matrix <- pixl.df[, 2:14]  # select the column for mineral composition

# calculate the correlation matrix of mineralogical composition
cor_matrix <- cor(pixl.matrix, use="complete.obs")

# set the labels for minerals
mineral_labels <- colnames(pixl.matrix)

# plot the correlation matrix with added scale and labels
image(1:ncol(cor_matrix), 1:ncol(cor_matrix), cor_matrix,
      main="Mineral Correlation Matrix",
      xlab="Minerals", ylab="Minerals",
      col=heat.colors(20), axes=FALSE)

# add x and y axis labels (mineral names)
axis(1, at=1:ncol(cor_matrix), labels=mineral_labels, las=2, cex.axis=0.8)
axis(2, at=1:ncol(cor_matrix), labels=mineral_labels, las=2, cex.axis=0.8)

# add a color bar to represent correlation values
colorbar.plot(1, 0.5, strip=seq(-1, 1, length.out=20),
              col=heat.colors(20), horizontal=TRUE,
              legend.lab="Correlation")
```

In this heat map, bright yellow represents a highly positively correlated mineral pair, which means that the content of these two mineral components in the sample changes in the same trend.

- The most obvious example in the figure is Al2O3 (aluminum oxide) and SiO2 (silicon dioxide), which show a high positive correlation (brighter colors), which may be related to their co-existence in the composition of sedimentary or igneous rocks.

- Also another example is P2O5 (Phosphorus pentoxide) and TiO2 (titanium dioxide). The positive correlation of TiO2 and P2O5 may indicate that they co-occur in certain types of rocks. In igneous rocks, this co-occurrence may be related to magmatic differentiation, with apatite and ilmenite crystallizing and precipitating simultaneously in neutral or basic magmas. In sedimentary rocks, the two may accumulate together through weathering or sedimentation.

Meanwhile, red represents pairs of mineral components with low or even negative correlation, which means that when one mineral component increases, the other mineral component may decrease.

- For instance, There is a relatively low correlation between Na2O (sodium oxide) and Mgo (magnesium oxide), as reflected in the color being close to red. This may be due to the fact that they occur in different proportions in different rock types.

- Also for Na2O and FeO-T (iron oxide), these two mineral components also have low correlations, indicating that they behave differently in different rock types.

When considering the geological significance of mineral components, SiO2 (silicon dioxide), as a primary silicate, often shows a positive correlation with Al2O3 and K2O, reflecting their common presence in igneous rocks. FeO-T (total iron oxide) exhibits a strong positive correlation with MgO, suggesting a connection to volcanic rocks or other iron- and magnesium-rich minerals.


## Data Set C: Load the LIBS Data

The LIBS data provides summaries of the mineral compositions measured at selected sample sites by the LIBS instrument, part of the Perseverance SuperCam.

```{r}
# Load the saved LIBS data with locations added
libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")

#Drop  features that are not to be used in the analysis for this notebook
libs.df <- libs.df %>%
  select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
             MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))

# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)

# Review what we have
summary(libs.df)

# Make the a matrix contain only the libs measurements for each mineral
libs.matrix <- as.matrix(libs.df[,6:13])

# Check to see scaling
str(libs.matrix)
```


## Dataset D: Load the SHERLOC Data

The SHERLOC data you will be using for this lab is the result of scientists' interpretations of extensive spectral analysis of abrasion samples provided by the SHERLOC instrument.

**NOTE:** This dataset presents minerals as rows and sample sites as columns. You'll probably want to rotate the dataset for easier analysis....

```{r}

# Read in data as provided.
sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")

# Clean up data types
sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
                                       as.numeric)
# Transform NA's to 0
sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)

# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
# Do this by "pivoting" to a long format, and then back to the desired wide format.

sherloc_long <- sherloc_abrasion_raw %>%
  pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")

# Make abrasion a factor
sherloc_long$Name <- as.factor(sherloc_long$Name)

# Make it a matrix
sherloc.matrix <- sherloc_long %>%
  pivot_wider(names_from = Mineral, values_from = Presence)

# Get sample information from PIXL and add to measurements -- assumes order is the same

sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)

# Review what we have
summary(sherloc.df)

# Measurements are everything except first column
sherloc.matrix<-as.matrix(sherloc.matrix[,-1])

# Sherlock measurement matrix
# Review the structure
str(sherloc.matrix)
```
## Data Set E: PIXL + Sherloc
```{r}
# Combine PIXL and SHERLOC dataframes
pixl_sherloc.df <- cbind(pixl.df,sherloc.df )

# Review what we have
summary(pixl_sherloc.df)

# Combine PIXL and SHERLOC matrices
pixl_sherloc.matrix<-cbind(pixl.matrix,sherloc.matrix)

# Review the structure of our matrix
str(pixl_sherloc.matrix)

```


## Data Set F: PIXL + Lithography

Create data and matrix from prior datasets

```{r}
# Combine our PIXL and Lithology dataframes
pixl_lithology.df <- cbind(pixl.df,lithology.df )

# Review what we have
summary(pixl_lithology.df)

# Combine PIXL and Lithology matrices
pixl_lithology.matrix<-cbind(pixl.matrix,lithology.matrix)

# Review the structure
str(pixl_lithology.matrix)

```

## Data Set G: Sherloc + Lithology

Create Data and matrix from prior datasets by taking on appropriate combinations.

```{r}
# Combine the Lithology and SHERLOC dataframes
sherloc_lithology.df <- cbind(sherloc.df,lithology.df )

# Review what we have
summary(sherloc_lithology.df)

# Combine the Lithology and SHERLOC matrices
sherloc_lithology.matrix<-cbind(sherloc.matrix,lithology.matrix)

# Review the resulting matrix
str(sherloc_lithology.matrix)

```
## Data Set H: Sherloc + Lithology + PIXL

Create data frame and matrix from prior datasets by making on appropriate combinations.

```{r}
# Combine the Lithology and SHERLOC dataframes
sherloc_lithology_pixl.df <- cbind(sherloc.df,lithology.df, pixl.df )

# Review what we have
summary(sherloc_lithology_pixl.df)

# Combine the Lithology, SHERLOC and PIXLmatrices
sherloc_lithology_pixl.matrix<-cbind(sherloc.matrix,lithology.matrix,pixl.matrix)

# Review the resulting matrix
str(sherloc_lithology_pixl.matrix)

```

# Analysis of Data (Part 3)

Each team has been assigned one of six datasets:

1. Dataset B: PIXL: The PIXL team's goal is to understand and explain how scaling changes results from Assignment 1.   The matrix version was scaled above but not in Assignment 1.

2. Dataset C: LIBS (with appropriate scaling as necessary. Not scaled yet.)

3. Dataset D: Sherloc (with appropriate scaling as necessary. Not scaled yet.)

4. Dataset E: PIXL + Sherloc (with appropriate scaling as necessary. Not scaled yet.)

5. Dataset F: PIXL + Lithography (with appropriate scaling as necessary. Not scaled yet.)

6. Dataset G: Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.)

7. Dataset H: PIXL + Sherloc + Lithograpy  (with appropriate scaling as necessary. Not scaled yet.)

**For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired.  Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same  clustering for your team (which is okay but then vary rest), make sure you use the same random seeds.

1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features?  Which features are measurements and which features are metadata about the samples?  (3 pts)

2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose  a scaling method or to not scale. (3 pts)

3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_   Describe how you picked the best number of clusters.  Indicate the number of points in each clusters. Coordinate with your team so you try different approaches.   If you want to share results with your team mates, make sure to use the same random seeds.  (6 pts)

4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:  Alternatively do another creative analysis of your datasets that leads to one of more findings.  Make sure to explain what your analysis and discuss your the results.


# Preparation of Team Presentation (Part 4)

Prepare a presentation of your teams result to present in class on **September 11** starting at 9am in AE217 (20 pts)
The presentation should include the following elements

0.Your teams names and members
1. A **Description** of the data set that you analyzed including how many observations and how many features. (<= 1.5 mins)
2. Each team member gets **three minutes** to explain their analysis:
  * what analysis they performed
  * the results of that analysis
  * a brief discussion of their interpretation of these results
  * <= 18 mins _total!_
3. A **Conclusion** slide indicating major findings of the teams (<= 1.5 mins)
4. Thoughts on **potential next steps** for the MARS team (<= 1.5 mins)

* A template for your team presentation is included here: https://bit.ly/dar-template-f24

* The rubric for the presentation is here:

https://docs.google.com/document/d/1-4o1O4h2r8aMjAplmE-ItblQnyDAKZwNs5XCnmwacjs/pub


* Post a link to your teams presentation in the MARS webex chat before class.  You can continue to edit until the last minute.


# When you're done: SAVE, COMMIT and PUSH YOUR CHANGES!

When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using the following steps:

* `git branch`
   * To double-check that you are in your working branch
* `git add <your changed files>`
* `git commit -m "Some useful comments"`
* `git push origin <your branch name>`
*  do a pull request


# APPENDIX: Accessing RStudio Server on the IDEA Cluster

The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)

* The Cluster requires RCS credentials, enabled via registration in class
    * email John Erickson for problems `erickj4@rpi.edu`
* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes
* Access via RPI physical network or VPN only

# More info about Rstudio on our Cluster

## RStudio GUI Access:

* Use:
   * http://lp01.idea.rpi.edu/rstudio-ose/
   * http://lp01.idea.rpi.edu/rstudio-ose-3/
   * http://lp01.idea.rpi.edu/rstudio-ose-6/
   * http://lp01.idea.rpi.edu/rstudio-ose-7/
* Linux terminal accessible from within RStudio "Terminal" or via ssh (below)