compta-assignment05_f24.Rmd

---
title: "DAR F24 Assignment 5 Notebook"
author: "Ashton Compton"
date: "`r Sys.Date()`"
output:
  pdf_document:
    toc: yes
  html_document:
    toc: yes
subtitle: "DAR Project Name: Mars"
---

![Dr. Roger's Mars Question on 9-4-24](../../Resources/MarsQuestions2024-09-04.png)

These are Dr. Bennett's Mars questions from Day 1 lecture plus so more


1) Whatever Dr. Roger’s wants.
2) Develop new analyses/visualizations with an emphasis on integrative analysis of the different types of data (e.g. PIXL, SHERLOC, SuperCam/LIBS, and Lithology)
3) Incorporate these analysis into improved Campfire demo. What would Dr. Rogers consider to be an improved Campfire demo?
4) Develop standalone "2D" app incorporating Campfire and enhanced analysis/visualization capabilities.
5) Understand LIBS and develop an insightful standalone Libs analysis.
6) Develop a deeper understanding of each dataset by looking at NASA sources and published papers. Are we missing data on campaigns etc?  Are we correctly integrating data?   How does LIBS data correspond to the 16 Samples?

Please put valuable resources (like websites and papers) on github https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki   Dr. Bennett has started this process by creating a wiki on DAR-MARS-F24 with short description and links to resources.  Files (like letures are in DAR-Mars-F24/Resources.  You can add more to the wiki and add files to the the Resources directory on github.  You can edit the wiki too.

## BiWeekly Work Summary

**NOTE:** Follow an outline format; use bullets to express individual points.

* RCS ID: compta
* Project Name: Mars, DAR 2024
* Summary of work since last week

    * Describe the important aspects of what you worked on and accomplished


    Created bar plot showing feature count of each feature in the Lithology dataframe, grouped by campaign.
    Reveals interesting differences between the two campaigns.

    Beginning work on scaling Pixl, going to make similar chart as the one described above but for pixl.


* Summary of github commits

    * include branch name(s)
    * include browsable links to all external files on github
    * Include links to shared Shiny apps


    Commiting an html and rmd knit

* List of presentations,  papers, or other outputs

    * Include browsable links

* List of references (if necessary)
* Indicate any use of group shared code base
* Indicate which parts of your described work were done by you or as part of joint efforts

* **Required:** Provide illustrating figures and/or tables

![Lithology Feature Count by Campaign](../../StudentNotebooks/Assignment04/LithologyFeatCountbyCampaign.png)
![Lithology Cluster Distribution ](../../StudentNotebooks/Assignment05/lithologyFeatureDistributionHeatmap.png)
![pixlDistributionbyCampaign](../../StudentNotebooks/Assignment05/pixlDistributionbyCampaign.png)


## Personal Contribution

* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
* Include github issues you've addressed if any

Load libaries
Set up dataframes/matrices
```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

# Set the default CRAN repository
local({r <- getOption("repos")
       r["CRAN"] <- "http://cran.r-project.org"
       options(repos=r)
})

if (!require("pandoc")) {
  install.packages("pandoc")
  library(pandoc)
}

# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
  install.packages("rmarkdown")
  library(rmarkdown)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require("stringr")) {
  install.packages("stringr")
  library(stringr)
}

if (!require("ggbiplot")) {
  install.packages("ggbiplot")
  library(ggbiplot)
}

if (!require("pheatmap")) {
  install.packages("pheatmap")
  library(pheatmap)
}

if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}

if (!require("paletteer")) {
  install.packages("paletteer")
  library(paletteer)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}

if (!require("GGally")) {
  install.packages("GGally")
  library(GGally)
}
```


```{R}
#Load in data
###
# Load the saved lithology data with locations added
lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")

# Cast samples as numbers
lithology.df$sample <- as.numeric(lithology.df$sample)

# Convert rest into factors
lithology.df[sapply(lithology.df, is.character)] <-
  lapply(lithology.df[sapply(lithology.df, is.character)],
                                       as.factor)

# Keep only first 16 samples because the data for the rest of the samples is not available yet
#Also i'm getting rid of the atmospheric sample for now
lithology.df<-lithology.df[2:16,]
# Create a matrix containing only the numeric measurements.  The remaining features are metadata about the sample.
lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1

###
# Load the saved PIXL data with locations added
pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")

# Convert to factors
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
                                       as.factor)

#Get rid of atmospheric sample
pixl.df <- pixl.df[2:16,]

# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14]

###
# Load the saved LIBS data with locations added
libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")

#Drop  features that are not to be used in the analysis for this notebook
libs.df <- libs.df %>%
  select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
             MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))

# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)

# Make the a matrix contain only the libs measurements for each mineral
libs.matrix <- as.matrix(libs.df[,6:13])

###
# Read in data as provided.
sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")

# Clean up data types
sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
                                       as.numeric)
# Transform NA's to 0
sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)

# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
# Do this by "pivoting" to a long format, and then back to the desired wide format.

sherloc_long <- sherloc_abrasion_raw %>%
  pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")

# Make abrasion a factor
sherloc_long$Name <- as.factor(sherloc_long$Name)

# Make it a matrix
sherloc.matrix <- sherloc_long %>%
  pivot_wider(names_from = Mineral, values_from = Presence)

#Remove atmospheric sample
sherloc.matrix <- sherloc.matrix[2:16,]

# Get sample information from PIXL and add to measurements -- assumes order is the same

sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)

# Measurements are everything except first column
sherloc.matrix<-as.matrix(sherloc.matrix[,-1])

###
#Add in wss plot for elbow method clustering
wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") {
  wss <- data.frame(cluster=1:nc, quality=c(0))
  for (i in 1:nc){
    set.seed(seed)
    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
  ggplot(data=wss,aes(x=cluster,y=quality)) +
    geom_line() +
    ggtitle(title)
}

#Make a scaled version of pixl using a log scale
#Make log scale function
#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column
#Not going to be used, for the moment
# logScale <- function(frame) {
#   #Try converting frame to matrix
#   try(frame <- as.matrix(frame), TRUE)
#   #Center and absolute value frame
#   frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs()
#   #Prepare data frame to take scaled columns of frame
#   #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column
#   frame.scaled <- data.frame()
#   for (i in 1:ncol(frame)) {
#     frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i]))
#   }
#   #Produce scaled frame
#   frame.scaled
# }
#Just do log10 on a matrix

seed <- 14
set.seed
```

##Before Questions, important notes
-Lithology and Sherloc measure the exact same features, and a point in lithology is 1 if the same point in sherloc is non zero. So effectively, sherloc and lithology are the same, but sherloc provides more detail than lithology.
  -The extra detail from sherloc is not very reliable, since it was derived from text descriptions of each measurement
-The atmospheric sample is not being regarded alongside the other samples because it is fundamentally different and will confuse analysis of the other 15 samples
-Samples 17 and 18 have been released
-I'm not using Sherloc for simplicity for the moment
## Analysis: Question 1 (Clustering and Campaign)

### Question being asked

_Provide in natural language a statement of what question you're trying to answer_
What does clustering reveal about Lithology and Pixl? Do certain clusters correlate to certain campaigns?

### Data Preparation

_Provide in natural language a description of the data you are using for this analysis_

_Include a step-by-step description of how you prepare your data for analysis_

_If you're re-using dataframes prepared in another section, simply re-state what data you're using_

Perform elbow test on lithology and pixl to pick # of clusters

```{r, result01_data}
# Include all data processing code (if necessary), clearly commented
#Do elbow method on each data set preparing for clustering
wssplot(lithology.matrix, nc=8, seed=14)
#4 clusters

wssplot(pixl.matrix, nc=8, seed=14)
#3 clusters
```

So cluster Lithology to 4 clusters and pixl to 3
### Analysis: Methods and results

_Describe in natural language a statement of the analysis you're trying to do_

_Provide clearly commented analysis code; include code for tables and figures!_
Perform kmeans on lithology and pixl, display results with table

```{r, result01_analysis}
# Include all analysis code, clearly commented
#Data is binary, no need for scaling
lith.kmeans <- kmeans(lithology.matrix, 4)
#Add cluster # to litho matrix
lithology.df["Cluster"] <- lith.kmeans[["cluster"]]
lithology.df[c("Cluster","campaign")]
#Litho Results
table(lithology.df[c("campaign","Cluster")])

#Cluster pixl.scaled
#pixl.kmeans <- kmeans(pixl.matrix, 4)
pixl.kmeans <- kmeans(pixl.matrix, 3)
#Add cluster # to pixl matrix
pixl.df["Cluster"] <- pixl.kmeans[["cluster"]]
pixl.df[c("Cluster","campaign")]
#Litho Results
table(pixl.df[c("campaign","Cluster")])
#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table
```

### Discussion of results

_Provide in natural language a clear discussion of your observations._

Lithology:
Crater Floor contains clusters 1,2, & 4.
Delta Front contains clusters 2,3, & 4.

Pixl.scaled:
Crater Floor contains clusters 1, 2, & 3.
Delta Front contains clusters 2 & 3

Across Lithology & Pixl, there are clusters present in Crater Floor but not in Delta Front!

Additionally, I will make heat maps to show the distribution of features across each cluster

```{R}
#Heat map for Lithology
rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4")
pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12)

#Heat map for Pixl
rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3")
pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12)
```
From these we can conclude
Lithology:
Cluster 1
  -Uniquely high in Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and FeTi Oxides
Cluster 2
  -Uniquely midlevel for Spinels, Zircon, Ilmenite, Chromite, apatite, and Hydrated Sulfates
Cluster 3
  -Uniquely high in Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate
Cluster 4
  -Uniquely high in Other Hydrated Phases & Phyllosilicates
Note some features are high across multiple clusters, which is significant as well

Tying into Campaign, this means Crater Floor samples are uniquely high in the features described above for cluster 1,
  and Delta Front is uniquely high in features described above for cluster 3.

Pixl:
Cluster 1
  -Uniquely low in Cr2O3
Cluster 2
  -High in SO3
Cluster 3
  -Not much stands out

Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 compared to Delta Front

## Analysis: Question 2 (Provide short name)

### Question being asked

_Provide in natural language a statement of what question you're trying to answer_
Compare feature distribution across campaigns via graphs

### Data Preparation

_Provide in natural language a description of the data you are using for this analysis_
Lithology, pixl, dividing by campaign and plotting feature distribution by campaign

_Include a step-by-step description of how you prepare your data for analysis_

_If you're re-using dataframes prepared in another section, simply re-state what data you're using_

```{r, result02_data}
# Include all data processing code (if necessary), clearly commented
#Start with lithology
#Group by campaign & remove metadata
lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster))

#Turn into long form and only keep positive cases
lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1)

#Count # of identical cases
lithology.df.sorted <- lithology.df.sorted %>% count(Feature)

#Sort, Crater Floor is High to low & Delta Front is added back in low to high
lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n))
```

### Analysis: Methods and Results

_Describe in natural language a statement of the analysis you're trying to do_

_Provide clearly commented analysis code; include code for tables and figures!_

```{r, result02_analysis}
p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) +
  geom_col(position=position_dodge(preserve="total"), width=0.6) +
  theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) +
  labs(x="", y="Count") +
  ggtitle("Lithology Features Count by Campaign") +
  scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus")

#ggplotly(p, tooltip = c("campaign",'x', "n"))
#Commented out to knit to pdf, picture at top of report
```

```{R}
#Make box plots
pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13)
colnames(pixl.lf)<- c("campaign", "feature", "value")
ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) +
  geom_boxplot() +
  scale_y_log10() +
  ggtitle("pixl distribution by campaign") +
  labs(x="", y="log10 scale from percent composition")
```
### Discussion of results

_Provide in natural language a clear discussion of your observations._

Lithology:
Certain minerals are abundant in both campaigns, especially Crater Floor.
  -Carbonate is common in both campaigns
  -Organic Matter is also common in both campaigns
  -Sulfate and Olivine are also common in both

High in Crater Floor:
  -Pyroxene and amorphous silicate are abundant in Crater Floor but sparse in Delta Front

Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in 3 samples in Delta Front, but not at all in Crater Floor.

There are 20 minerals that are exclusively in either Crater Floor or Delta Front.

4 minerals have a count of zero, meaning they weren't detected in any campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & Hydrated_Iron_Oxide). These minerals are present in the atmospheric sample, which is absent in this analysis.

The pixl graph reveals some big differences between Crater Floor and Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & SiO2.

During our presentation, Dr Roger noted that a predictor for Organic Matter would be very valuable, and also concluded Delta Front has some igneous components to it, contradicting the rock type on all Delta Front samples which says they are sedimentary.


## Analysis: Question 3 (Provide short name)

### Question being asked

_Provide in natural language a statement of what question you're trying to answer_

The data in pixl is represented by percentages. Is log scaling pixl better for clustering and PCA?


### Data Preparation

_Provide in natural language a description of the data you are using for this analysis_

_Include a step-by-step description of how you prepare your data for analysis_

_If you're re-using dataframes prepared in another section, re-state what data you're using_

```{r, result03_data}
# Include all data processing code (if necessary), clearly commented
#First replace 0.0 entries with 0.00001 so they don't scale to inf
pixl.matrix[pixl.matrix == 0] <- 0.00001
#Apply log10 to every entry in pixl.matrix & get new scaled df
pixl.scaled <- log10(pixl.matrix)
```

### Analysis methods used

_Describe in natural language a statement of the analysis you're trying to do_

First, how does clustering differ between pixl.matrix and pixl.scaled?

_Provide clearly commented analysis code; include code for tables and figures!_

```{r, result03_analysis}
# Include all analysis code, clearly commented
# If not possible, screen shots are acceptable.
# If your contributions included things that are not done in an R-notebook,
#   (e.g. researching, writing, and coding in Python), you still need to do
#   this status notebook in R.  Describe what you did here and put any products
#   that you created in github. If you are writing online documents (e.g. overleaf
#   or google docs), you can include links to the documents in this notebook
#   instead of actual text.
#Create an elbow plot for both pixl.matrix & pixl.scaled
wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled')
wssplot(pixl.scaled, nc=8, seed=14, "Scaled")

#Do kmeans for both matrices
unscaled.kmeans <- kmeans(pixl.matrix, 3)
scaled.kmeans <- kmeans(pixl.scaled, 3)

#Produce heatmaps for both
pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl")
pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl")

#Do pca for both matrices
unscaled.pca <- prcomp(pixl.matrix)
scaled.pca <- prcomp(pixl.scaled)

#Make biplots
unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca,
                   labels = pixl.df$type,
                   groups = as.factor(unscaled.kmeans$cluster)) +
                   ggtitle("Unscaled Pixl")

scaled.plot <- ggbiplot::ggbiplot(scaled.pca,
                   labels = pixl.df$type,
                   groups = as.factor(scaled.kmeans$cluster)) +
                   ggtitle("Scaled Pixl")

#ggplotly(unscaled.plot)
#ggplotly(scaled.plot)

pheatmap(pixl.scaled, scale="none")
```

### Discussion of results

_Provide in natural language a clear discussion of your observations._
Both elbow plots suggest 3 clusters as the best choice, however the "quality" value for the unscaled data is much higher than with the scaled data. - update: quality matters relatively, not absolutely. Thus this point is unimportant

Looking at the two biplots, the most influential features are totally different. For unscaled, the samples appear more spread out and the features appear more balanced than for the scaled biplot.

My suggestion is to not cluster using a log10 scaled pixl matrix from the above observations.


## Summary and next steps

_Provide in natural language a clear summary and your proposed next steps._

I scaled a copy of the pixl matrix, and then compared the two through a series of analysis. My conclusion is the scaled copy is not as good for clustering and PCA.
Next steps involve looking at other solutions for scaling, including scale() and the logscale function I made.
We concluded pixl should not be scaled.

Potential organic matter predictor.

I will continue exploring the differences between campaigns and implementing these features into the 2d app.
	---
	title: "DAR F24 Assignment 5 Notebook"
	author: "Ashton Compton"
	date: "`r Sys.Date()`"
	output:
	pdf_document:
	toc: yes
	html_document:
	toc: yes
	subtitle: "DAR Project Name: Mars"
	---

	![Dr. Roger's Mars Question on 9-4-24](../../Resources/MarsQuestions2024-09-04.png)

	These are Dr. Bennett's Mars questions from Day 1 lecture plus so more


	1) Whatever Dr. Roger’s wants.
	2) Develop new analyses/visualizations with an emphasis on integrative analysis of the different types of data (e.g. PIXL, SHERLOC, SuperCam/LIBS, and Lithology)
	3) Incorporate these analysis into improved Campfire demo. What would Dr. Rogers consider to be an improved Campfire demo?
	4) Develop standalone "2D" app incorporating Campfire and enhanced analysis/visualization capabilities.
	5) Understand LIBS and develop an insightful standalone Libs analysis.
	6) Develop a deeper understanding of each dataset by looking at NASA sources and published papers. Are we missing data on campaigns etc? Are we correctly integrating data? How does LIBS data correspond to the 16 Samples?

	Please put valuable resources (like websites and papers) on github https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki Dr. Bennett has started this process by creating a wiki on DAR-MARS-F24 with short description and links to resources. Files (like letures are in DAR-Mars-F24/Resources. You can add more to the wiki and add files to the the Resources directory on github. You can edit the wiki too.

	## BiWeekly Work Summary

	NOTE: Follow an outline format; use bullets to express individual points.

	* RCS ID: compta
	* Project Name: Mars, DAR 2024
	* Summary of work since last week

	* Describe the important aspects of what you worked on and accomplished


	Created bar plot showing feature count of each feature in the Lithology dataframe, grouped by campaign.
	Reveals interesting differences between the two campaigns.

	Beginning work on scaling Pixl, going to make similar chart as the one described above but for pixl.


	* Summary of github commits

	* include branch name(s)
	* include browsable links to all external files on github
	* Include links to shared Shiny apps


	Commiting an html and rmd knit

	* List of presentations, papers, or other outputs

	* Include browsable links

	* List of references (if necessary)
	* Indicate any use of group shared code base
	* Indicate which parts of your described work were done by you or as part of joint efforts

	* Required: Provide illustrating figures and/or tables

	![Lithology Feature Count by Campaign](../../StudentNotebooks/Assignment04/LithologyFeatCountbyCampaign.png)
	![Lithology Cluster Distribution ](../../StudentNotebooks/Assignment05/lithologyFeatureDistributionHeatmap.png)
	![pixlDistributionbyCampaign](../../StudentNotebooks/Assignment05/pixlDistributionbyCampaign.png)


	## Personal Contribution

	* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
	* Include github issues you've addressed if any

	Load libaries
	Set up dataframes/matrices
	```{r setup, include=FALSE}

	knitr::opts_chunk$set(echo = TRUE)

	# Set the default CRAN repository
	local({r <- getOption("repos")
	r["CRAN"] <- "http://cran.r-project.org"
	options(repos=r)
	})

	if (!require("pandoc")) {
	install.packages("pandoc")
	library(pandoc)
	}

	# Required packages for M20 LIBS analysis
	if (!require("rmarkdown")) {
	install.packages("rmarkdown")
	library(rmarkdown)
	}
	if (!require("tidyverse")) {
	install.packages("tidyverse")
	library(tidyverse)
	}
	if (!require("stringr")) {
	install.packages("stringr")
	library(stringr)
	}

	if (!require("ggbiplot")) {
	install.packages("ggbiplot")
	library(ggbiplot)
	}

	if (!require("pheatmap")) {
	install.packages("pheatmap")
	library(pheatmap)
	}

	if (!require("knitr")) {
	install.packages("knitr")
	library(knitr)
	}

	if (!require("paletteer")) {
	install.packages("paletteer")
	library(paletteer)
	}

	if (!require("plotly")) {
	install.packages("plotly")
	library(plotly)
	}

	if (!require("GGally")) {
	install.packages("GGally")
	library(GGally)
	}
	```


	```{R}
	#Load in data
	###
	# Load the saved lithology data with locations added
	lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")

	# Cast samples as numbers
	lithology.df$sample <- as.numeric(lithology.df$sample)

	# Convert rest into factors
	lithology.df[sapply(lithology.df, is.character)] <-
	lapply(lithology.df[sapply(lithology.df, is.character)],
	as.factor)

	# Keep only first 16 samples because the data for the rest of the samples is not available yet
	#Also i'm getting rid of the atmospheric sample for now
	lithology.df<-lithology.df[2:16,]
	# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample.
	lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1

	###
	# Load the saved PIXL data with locations added
	pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")

	# Convert to factors
	pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
	as.factor)

	#Get rid of atmospheric sample
	pixl.df <- pixl.df[2:16,]

	# Make the matrix of just mineral percentage measurements
	pixl.matrix <- pixl.df[,2:14]

	###
	# Load the saved LIBS data with locations added
	libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")

	#Drop features that are not to be used in the analysis for this notebook
	libs.df <- libs.df %>%
	select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
	MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))

	# Convert the points to numeric
	libs.df$point <- as.numeric(libs.df$point)

	# Make the a matrix contain only the libs measurements for each mineral
	libs.matrix <- as.matrix(libs.df[,6:13])

	###
	# Read in data as provided.
	sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")

	# Clean up data types
	sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
	sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
	as.numeric)
	# Transform NA's to 0
	sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)

	# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
	# Do this by "pivoting" to a long format, and then back to the desired wide format.

	sherloc_long <- sherloc_abrasion_raw %>%
	pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")

	# Make abrasion a factor
	sherloc_long$Name <- as.factor(sherloc_long$Name)

	# Make it a matrix
	sherloc.matrix <- sherloc_long %>%
	pivot_wider(names_from = Mineral, values_from = Presence)

	#Remove atmospheric sample
	sherloc.matrix <- sherloc.matrix[2:16,]

	# Get sample information from PIXL and add to measurements -- assumes order is the same

	sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)

	# Measurements are everything except first column
	sherloc.matrix<-as.matrix(sherloc.matrix[,-1])

	###
	#Add in wss plot for elbow method clustering
	wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") {
	wss <- data.frame(cluster=1:nc, quality=c(0))
	for (i in 1:nc){
	set.seed(seed)
	wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
	ggplot(data=wss,aes(x=cluster,y=quality)) +
	geom_line() +
	ggtitle(title)
	}

	#Make a scaled version of pixl using a log scale
	#Make log scale function
	#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column
	#Not going to be used, for the moment
	# logScale <- function(frame) {
	# #Try converting frame to matrix
	# try(frame <- as.matrix(frame), TRUE)
	# #Center and absolute value frame
	# frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs()
	# #Prepare data frame to take scaled columns of frame
	# #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column
	# frame.scaled <- data.frame()
	# for (i in 1:ncol(frame)) {
	# frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i]))
	# }
	# #Produce scaled frame
	# frame.scaled
	# }
	#Just do log10 on a matrix

	seed <- 14
	set.seed
	```

	##Before Questions, important notes
	-Lithology and Sherloc measure the exact same features, and a point in lithology is 1 if the same point in sherloc is non zero. So effectively, sherloc and lithology are the same, but sherloc provides more detail than lithology.
	-The extra detail from sherloc is not very reliable, since it was derived from text descriptions of each measurement
	-The atmospheric sample is not being regarded alongside the other samples because it is fundamentally different and will confuse analysis of the other 15 samples
	-Samples 17 and 18 have been released
	-I'm not using Sherloc for simplicity for the moment
	## Analysis: Question 1 (Clustering and Campaign)

	### Question being asked

	_Provide in natural language a statement of what question you're trying to answer_
	What does clustering reveal about Lithology and Pixl? Do certain clusters correlate to certain campaigns?

	### Data Preparation

	_Provide in natural language a description of the data you are using for this analysis_

	_Include a step-by-step description of how you prepare your data for analysis_

	_If you're re-using dataframes prepared in another section, simply re-state what data you're using_

	Perform elbow test on lithology and pixl to pick # of clusters

	```{r, result01_data}
	# Include all data processing code (if necessary), clearly commented
	#Do elbow method on each data set preparing for clustering
	wssplot(lithology.matrix, nc=8, seed=14)
	#4 clusters

	wssplot(pixl.matrix, nc=8, seed=14)
	#3 clusters
	```

	So cluster Lithology to 4 clusters and pixl to 3
	### Analysis: Methods and results

	_Describe in natural language a statement of the analysis you're trying to do_

	_Provide clearly commented analysis code; include code for tables and figures!_
	Perform kmeans on lithology and pixl, display results with table

	```{r, result01_analysis}
	# Include all analysis code, clearly commented
	#Data is binary, no need for scaling
	lith.kmeans <- kmeans(lithology.matrix, 4)
	#Add cluster # to litho matrix
	lithology.df["Cluster"] <- lith.kmeans[["cluster"]]
	lithology.df[c("Cluster","campaign")]
	#Litho Results
	table(lithology.df[c("campaign","Cluster")])

	#Cluster pixl.scaled
	#pixl.kmeans <- kmeans(pixl.matrix, 4)
	pixl.kmeans <- kmeans(pixl.matrix, 3)
	#Add cluster # to pixl matrix
	pixl.df["Cluster"] <- pixl.kmeans[["cluster"]]
	pixl.df[c("Cluster","campaign")]
	#Litho Results
	table(pixl.df[c("campaign","Cluster")])
	#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table
	```

	### Discussion of results

	_Provide in natural language a clear discussion of your observations._

	Lithology:
	Crater Floor contains clusters 1,2, & 4.
	Delta Front contains clusters 2,3, & 4.

	Pixl.scaled:
	Crater Floor contains clusters 1, 2, & 3.
	Delta Front contains clusters 2 & 3

	Across Lithology & Pixl, there are clusters present in Crater Floor but not in Delta Front!

	Additionally, I will make heat maps to show the distribution of features across each cluster

	```{R}
	#Heat map for Lithology
	rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4")
	pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12)

	#Heat map for Pixl
	rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3")
	pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12)
	```
	From these we can conclude
	Lithology:
	Cluster 1
	-Uniquely high in Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and FeTi Oxides
	Cluster 2
	-Uniquely midlevel for Spinels, Zircon, Ilmenite, Chromite, apatite, and Hydrated Sulfates
	Cluster 3
	-Uniquely high in Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate
	Cluster 4
	-Uniquely high in Other Hydrated Phases & Phyllosilicates
	Note some features are high across multiple clusters, which is significant as well

	Tying into Campaign, this means Crater Floor samples are uniquely high in the features described above for cluster 1,
	and Delta Front is uniquely high in features described above for cluster 3.

	Pixl:
	Cluster 1
	-Uniquely low in Cr2O3
	Cluster 2
	-High in SO3
	Cluster 3
	-Not much stands out

	Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 compared to Delta Front

	## Analysis: Question 2 (Provide short name)

	### Question being asked

	_Provide in natural language a statement of what question you're trying to answer_
	Compare feature distribution across campaigns via graphs

	### Data Preparation

	_Provide in natural language a description of the data you are using for this analysis_
	Lithology, pixl, dividing by campaign and plotting feature distribution by campaign

	_Include a step-by-step description of how you prepare your data for analysis_

	_If you're re-using dataframes prepared in another section, simply re-state what data you're using_

	```{r, result02_data}
	# Include all data processing code (if necessary), clearly commented
	#Start with lithology
	#Group by campaign & remove metadata
	lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster))

	#Turn into long form and only keep positive cases
	lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1)

	#Count # of identical cases
	lithology.df.sorted <- lithology.df.sorted %>% count(Feature)

	#Sort, Crater Floor is High to low & Delta Front is added back in low to high
	lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n))
	```

	### Analysis: Methods and Results

	_Describe in natural language a statement of the analysis you're trying to do_

	_Provide clearly commented analysis code; include code for tables and figures!_

	```{r, result02_analysis}
	p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) +
	geom_col(position=position_dodge(preserve="total"), width=0.6) +
	theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) +
	labs(x="", y="Count") +
	ggtitle("Lithology Features Count by Campaign") +
	scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus")

	#ggplotly(p, tooltip = c("campaign",'x', "n"))
	#Commented out to knit to pdf, picture at top of report
	```

	```{R}
	#Make box plots
	pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13)
	colnames(pixl.lf)<- c("campaign", "feature", "value")
	ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) +
	geom_boxplot() +
	scale_y_log10() +
	ggtitle("pixl distribution by campaign") +
	labs(x="", y="log10 scale from percent composition")
	```
	### Discussion of results

	_Provide in natural language a clear discussion of your observations._

	Lithology:
	Certain minerals are abundant in both campaigns, especially Crater Floor.
	-Carbonate is common in both campaigns
	-Organic Matter is also common in both campaigns
	-Sulfate and Olivine are also common in both

	High in Crater Floor:
	-Pyroxene and amorphous silicate are abundant in Crater Floor but sparse in Delta Front

	Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in 3 samples in Delta Front, but not at all in Crater Floor.

	There are 20 minerals that are exclusively in either Crater Floor or Delta Front.

	4 minerals have a count of zero, meaning they weren't detected in any campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & Hydrated_Iron_Oxide). These minerals are present in the atmospheric sample, which is absent in this analysis.

	The pixl graph reveals some big differences between Crater Floor and Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & SiO2.

	During our presentation, Dr Roger noted that a predictor for Organic Matter would be very valuable, and also concluded Delta Front has some igneous components to it, contradicting the rock type on all Delta Front samples which says they are sedimentary.



	## Analysis: Question 3 (Provide short name)

	### Question being asked

	_Provide in natural language a statement of what question you're trying to answer_

	The data in pixl is represented by percentages. Is log scaling pixl better for clustering and PCA?



	### Data Preparation

	_Provide in natural language a description of the data you are using for this analysis_

	_Include a step-by-step description of how you prepare your data for analysis_

	_If you're re-using dataframes prepared in another section, re-state what data you're using_

	```{r, result03_data}
	# Include all data processing code (if necessary), clearly commented
	#First replace 0.0 entries with 0.00001 so they don't scale to inf
	pixl.matrix[pixl.matrix == 0] <- 0.00001
	#Apply log10 to every entry in pixl.matrix & get new scaled df
	pixl.scaled <- log10(pixl.matrix)
	```

	### Analysis methods used

	_Describe in natural language a statement of the analysis you're trying to do_

	First, how does clustering differ between pixl.matrix and pixl.scaled?

	_Provide clearly commented analysis code; include code for tables and figures!_

	```{r, result03_analysis}
	# Include all analysis code, clearly commented
	# If not possible, screen shots are acceptable.
	# If your contributions included things that are not done in an R-notebook,
	# (e.g. researching, writing, and coding in Python), you still need to do
	# this status notebook in R. Describe what you did here and put any products
	# that you created in github. If you are writing online documents (e.g. overleaf
	# or google docs), you can include links to the documents in this notebook
	# instead of actual text.
	#Create an elbow plot for both pixl.matrix & pixl.scaled
	wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled')
	wssplot(pixl.scaled, nc=8, seed=14, "Scaled")

	#Do kmeans for both matrices
	unscaled.kmeans <- kmeans(pixl.matrix, 3)
	scaled.kmeans <- kmeans(pixl.scaled, 3)

	#Produce heatmaps for both
	pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl")
	pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl")

	#Do pca for both matrices
	unscaled.pca <- prcomp(pixl.matrix)
	scaled.pca <- prcomp(pixl.scaled)

	#Make biplots
	unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca,
	labels = pixl.df$type,
	groups = as.factor(unscaled.kmeans$cluster)) +
	ggtitle("Unscaled Pixl")

	scaled.plot <- ggbiplot::ggbiplot(scaled.pca,
	labels = pixl.df$type,
	groups = as.factor(scaled.kmeans$cluster)) +
	ggtitle("Scaled Pixl")

	#ggplotly(unscaled.plot)
	#ggplotly(scaled.plot)

	pheatmap(pixl.scaled, scale="none")
	```

	### Discussion of results

	_Provide in natural language a clear discussion of your observations._
	Both elbow plots suggest 3 clusters as the best choice, however the "quality" value for the unscaled data is much higher than with the scaled data. - update: quality matters relatively, not absolutely. Thus this point is unimportant

	Looking at the two biplots, the most influential features are totally different. For unscaled, the samples appear more spread out and the features appear more balanced than for the scaled biplot.

	My suggestion is to not cluster using a log10 scaled pixl matrix from the above observations.


	## Summary and next steps

	_Provide in natural language a clear summary and your proposed next steps._

	I scaled a copy of the pixl matrix, and then compared the two through a series of analysis. My conclusion is the scaled copy is not as good for clustering and PCA.
	Next steps involve looking at other solutions for scaling, including scale() and the logscale function I made.
	We concluded pixl should not be scaled.

	Potential organic matter predictor.

	I will continue exploring the differences between campaigns and implementing these features into the 2d app.