diff --git a/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd b/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd new file mode 100644 index 0000000..4275b0e --- /dev/null +++ b/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd @@ -0,0 +1,551 @@ +--- +title: "DAR F24 Assignment 5 Notebook" +author: "Ashton Compton" +date: "`r Sys.Date()`" +output: + pdf_document: + toc: yes + html_document: + toc: yes +subtitle: "DAR Project Name: Mars" +--- + +![Dr. Roger's Mars Question on 9-4-24](../../Resources/MarsQuestions2024-09-04.png) + +These are Dr. Bennett's Mars questions from Day 1 lecture plus so more + + +1) Whatever Dr. Roger’s wants. +2) Develop new analyses/visualizations with an emphasis on integrative analysis of the different types of data (e.g. PIXL, SHERLOC, SuperCam/LIBS, and Lithology) +3) Incorporate these analysis into improved Campfire demo. What would Dr. Rogers consider to be an improved Campfire demo? +4) Develop standalone "2D" app incorporating Campfire and enhanced analysis/visualization capabilities. +5) Understand LIBS and develop an insightful standalone Libs analysis. +6) Develop a deeper understanding of each dataset by looking at NASA sources and published papers. Are we missing data on campaigns etc? Are we correctly integrating data? How does LIBS data correspond to the 16 Samples? + +Please put valuable resources (like websites and papers) on github https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki Dr. Bennett has started this process by creating a wiki on DAR-MARS-F24 with short description and links to resources. Files (like letures are in DAR-Mars-F24/Resources. You can add more to the wiki and add files to the the Resources directory on github. You can edit the wiki too. + +## BiWeekly Work Summary + +**NOTE:** Follow an outline format; use bullets to express individual points. + +* RCS ID: compta +* Project Name: Mars, DAR 2024 +* Summary of work since last week + + * Describe the important aspects of what you worked on and accomplished + + + Created bar plot showing feature count of each feature in the Lithology dataframe, grouped by campaign. + Reveals interesting differences between the two campaigns. + + Beginning work on scaling Pixl, going to make similar chart as the one described above but for pixl. + + +* Summary of github commits + + * include branch name(s) + * include browsable links to all external files on github + * Include links to shared Shiny apps + + + Commiting an html and rmd knit + +* List of presentations, papers, or other outputs + + * Include browsable links + +* List of references (if necessary) +* Indicate any use of group shared code base +* Indicate which parts of your described work were done by you or as part of joint efforts + +* **Required:** Provide illustrating figures and/or tables + +![Lithology Feature Count by Campaign](../../StudentNotebooks/Assignment04/LithologyFeatCountbyCampaign.png) +![Lithology Cluster Distribution ](../../StudentNotebooks/Assignment05/lithologyFeatureDistributionHeatmap.png) +![pixlDistributionbyCampaign](../../StudentNotebooks/Assignment05/pixlDistributionbyCampaign.png) + + +## Personal Contribution + +* Clearly defined, unique contribution(s) done by you: code, ideas, writing... +* Include github issues you've addressed if any + +Load libaries +Set up dataframes/matrices +```{r setup, include=FALSE} + +knitr::opts_chunk$set(echo = TRUE) + +# Set the default CRAN repository +local({r <- getOption("repos") + r["CRAN"] <- "http://cran.r-project.org" + options(repos=r) +}) + +if (!require("pandoc")) { + install.packages("pandoc") + library(pandoc) +} + +# Required packages for M20 LIBS analysis +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("ggbiplot")) { + install.packages("ggbiplot") + library(ggbiplot) +} + +if (!require("pheatmap")) { + install.packages("pheatmap") + library(pheatmap) +} + +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} + +if (!require("paletteer")) { + install.packages("paletteer") + library(paletteer) +} + +if (!require("plotly")) { + install.packages("plotly") + library(plotly) +} + +if (!require("GGally")) { + install.packages("GGally") + library(GGally) +} +``` + + +```{R} +#Load in data +### +# Load the saved lithology data with locations added +lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds") + +# Cast samples as numbers +lithology.df$sample <- as.numeric(lithology.df$sample) + +# Convert rest into factors +lithology.df[sapply(lithology.df, is.character)] <- + lapply(lithology.df[sapply(lithology.df, is.character)], + as.factor) + +# Keep only first 16 samples because the data for the rest of the samples is not available yet +#Also i'm getting rid of the atmospheric sample for now +lithology.df<-lithology.df[2:16,] +# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample. +lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1 + +### +# Load the saved PIXL data with locations added +pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") + +# Convert to factors +pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], + as.factor) + +#Get rid of atmospheric sample +pixl.df <- pixl.df[2:16,] + +# Make the matrix of just mineral percentage measurements +pixl.matrix <- pixl.df[,2:14] + +### +# Load the saved LIBS data with locations added +libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds") + +#Drop features that are not to be used in the analysis for this notebook +libs.df <- libs.df %>% + select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev, + MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total))) + +# Convert the points to numeric +libs.df$point <- as.numeric(libs.df$point) + +# Make the a matrix contain only the libs measurements for each mineral +libs.matrix <- as.matrix(libs.df[,6:13]) + +### +# Read in data as provided. +sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds") + +# Clean up data types +sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral) +sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)], + as.numeric) +# Transform NA's to 0 +sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0) + +# Reformat data so that rows are "abrasions" and columns list the presence of minerals. +# Do this by "pivoting" to a long format, and then back to the desired wide format. + +sherloc_long <- sherloc_abrasion_raw %>% + pivot_longer(!Mineral, names_to = "Name", values_to = "Presence") + +# Make abrasion a factor +sherloc_long$Name <- as.factor(sherloc_long$Name) + +# Make it a matrix +sherloc.matrix <- sherloc_long %>% + pivot_wider(names_from = Mineral, values_from = Presence) + +#Remove atmospheric sample +sherloc.matrix <- sherloc.matrix[2:16,] + +# Get sample information from PIXL and add to measurements -- assumes order is the same + +sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix) + +# Measurements are everything except first column +sherloc.matrix<-as.matrix(sherloc.matrix[,-1]) + +### +#Add in wss plot for elbow method clustering +wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") { + wss <- data.frame(cluster=1:nc, quality=c(0)) + for (i in 1:nc){ + set.seed(seed) + wss[i,2] <- kmeans(data, centers=i)$tot.withinss} + ggplot(data=wss,aes(x=cluster,y=quality)) + + geom_line() + + ggtitle(title) +} + +#Make a scaled version of pixl using a log scale +#Make log scale function +#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column +#Not going to be used, for the moment +# logScale <- function(frame) { +# #Try converting frame to matrix +# try(frame <- as.matrix(frame), TRUE) +# #Center and absolute value frame +# frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs() +# #Prepare data frame to take scaled columns of frame +# #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column +# frame.scaled <- data.frame() +# for (i in 1:ncol(frame)) { +# frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i])) +# } +# #Produce scaled frame +# frame.scaled +# } +#Just do log10 on a matrix + +seed <- 14 +set.seed +``` + +##Before Questions, important notes +-Lithology and Sherloc measure the exact same features, and a point in lithology is 1 if the same point in sherloc is non zero. So effectively, sherloc and lithology are the same, but sherloc provides more detail than lithology. + -The extra detail from sherloc is not very reliable, since it was derived from text descriptions of each measurement +-The atmospheric sample is not being regarded alongside the other samples because it is fundamentally different and will confuse analysis of the other 15 samples +-Samples 17 and 18 have been released +-I'm not using Sherloc for simplicity for the moment +## Analysis: Question 1 (Clustering and Campaign) + +### Question being asked + +_Provide in natural language a statement of what question you're trying to answer_ +What does clustering reveal about Lithology and Pixl? Do certain clusters correlate to certain campaigns? + +### Data Preparation + +_Provide in natural language a description of the data you are using for this analysis_ + +_Include a step-by-step description of how you prepare your data for analysis_ + +_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ + +Perform elbow test on lithology and pixl to pick # of clusters + +```{r, result01_data} +# Include all data processing code (if necessary), clearly commented +#Do elbow method on each data set preparing for clustering +wssplot(lithology.matrix, nc=8, seed=14) +#4 clusters + +wssplot(pixl.matrix, nc=8, seed=14) +#3 clusters +``` + +So cluster Lithology to 4 clusters and pixl to 3 +### Analysis: Methods and results + +_Describe in natural language a statement of the analysis you're trying to do_ + +_Provide clearly commented analysis code; include code for tables and figures!_ +Perform kmeans on lithology and pixl, display results with table + +```{r, result01_analysis} +# Include all analysis code, clearly commented +#Data is binary, no need for scaling +lith.kmeans <- kmeans(lithology.matrix, 4) +#Add cluster # to litho matrix +lithology.df["Cluster"] <- lith.kmeans[["cluster"]] +lithology.df[c("Cluster","campaign")] +#Litho Results +table(lithology.df[c("campaign","Cluster")]) + +#Cluster pixl.scaled +#pixl.kmeans <- kmeans(pixl.matrix, 4) +pixl.kmeans <- kmeans(pixl.matrix, 3) +#Add cluster # to pixl matrix +pixl.df["Cluster"] <- pixl.kmeans[["cluster"]] +pixl.df[c("Cluster","campaign")] +#Litho Results +table(pixl.df[c("campaign","Cluster")]) +#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table +``` + +### Discussion of results + +_Provide in natural language a clear discussion of your observations._ + +Lithology: +Crater Floor contains clusters 1,2, & 4. +Delta Front contains clusters 2,3, & 4. + +Pixl.scaled: +Crater Floor contains clusters 1, 2, & 3. +Delta Front contains clusters 2 & 3 + +Across Lithology & Pixl, there are clusters present in Crater Floor but not in Delta Front! + +Additionally, I will make heat maps to show the distribution of features across each cluster + +```{R} +#Heat map for Lithology +rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4") +pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12) + +#Heat map for Pixl +rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3") +pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12) +``` +From these we can conclude +Lithology: +Cluster 1 + -Uniquely high in Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and FeTi Oxides +Cluster 2 + -Uniquely midlevel for Spinels, Zircon, Ilmenite, Chromite, apatite, and Hydrated Sulfates +Cluster 3 + -Uniquely high in Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate +Cluster 4 + -Uniquely high in Other Hydrated Phases & Phyllosilicates +Note some features are high across multiple clusters, which is significant as well + +Tying into Campaign, this means Crater Floor samples are uniquely high in the features described above for cluster 1, + and Delta Front is uniquely high in features described above for cluster 3. + +Pixl: +Cluster 1 + -Uniquely low in Cr2O3 +Cluster 2 + -High in SO3 +Cluster 3 + -Not much stands out + +Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 compared to Delta Front + +## Analysis: Question 2 (Provide short name) + +### Question being asked + +_Provide in natural language a statement of what question you're trying to answer_ +Compare feature distribution across campaigns via graphs + +### Data Preparation + +_Provide in natural language a description of the data you are using for this analysis_ +Lithology, pixl, dividing by campaign and plotting feature distribution by campaign + +_Include a step-by-step description of how you prepare your data for analysis_ + +_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ + +```{r, result02_data} +# Include all data processing code (if necessary), clearly commented +#Start with lithology +#Group by campaign & remove metadata +lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster)) + +#Turn into long form and only keep positive cases +lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1) + +#Count # of identical cases +lithology.df.sorted <- lithology.df.sorted %>% count(Feature) + +#Sort, Crater Floor is High to low & Delta Front is added back in low to high +lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n)) +``` + +### Analysis: Methods and Results + +_Describe in natural language a statement of the analysis you're trying to do_ + +_Provide clearly commented analysis code; include code for tables and figures!_ + +```{r, result02_analysis} +p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) + + geom_col(position=position_dodge(preserve="total"), width=0.6) + + theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) + + labs(x="", y="Count") + + ggtitle("Lithology Features Count by Campaign") + + scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus") + +#ggplotly(p, tooltip = c("campaign",'x', "n")) +#Commented out to knit to pdf, picture at top of report +``` + +```{R} +#Make box plots +pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13) +colnames(pixl.lf)<- c("campaign", "feature", "value") +ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) + + geom_boxplot() + + scale_y_log10() + + ggtitle("pixl distribution by campaign") + + labs(x="", y="log10 scale from percent composition") +``` +### Discussion of results + +_Provide in natural language a clear discussion of your observations._ + +Lithology: +Certain minerals are abundant in both campaigns, especially Crater Floor. + -Carbonate is common in both campaigns + -Organic Matter is also common in both campaigns + -Sulfate and Olivine are also common in both + +High in Crater Floor: + -Pyroxene and amorphous silicate are abundant in Crater Floor but sparse in Delta Front + +Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in 3 samples in Delta Front, but not at all in Crater Floor. + +There are 20 minerals that are exclusively in either Crater Floor or Delta Front. + +4 minerals have a count of zero, meaning they weren't detected in any campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & Hydrated_Iron_Oxide). These minerals are present in the atmospheric sample, which is absent in this analysis. + +The pixl graph reveals some big differences between Crater Floor and Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & SiO2. + +During our presentation, Dr Roger noted that a predictor for Organic Matter would be very valuable, and also concluded Delta Front has some igneous components to it, contradicting the rock type on all Delta Front samples which says they are sedimentary. + + + +## Analysis: Question 3 (Provide short name) + +### Question being asked + +_Provide in natural language a statement of what question you're trying to answer_ + +The data in pixl is represented by percentages. Is log scaling pixl better for clustering and PCA? + + + +### Data Preparation + +_Provide in natural language a description of the data you are using for this analysis_ + +_Include a step-by-step description of how you prepare your data for analysis_ + +_If you're re-using dataframes prepared in another section, re-state what data you're using_ + +```{r, result03_data} +# Include all data processing code (if necessary), clearly commented +#First replace 0.0 entries with 0.00001 so they don't scale to inf +pixl.matrix[pixl.matrix == 0] <- 0.00001 +#Apply log10 to every entry in pixl.matrix & get new scaled df +pixl.scaled <- log10(pixl.matrix) +``` + +### Analysis methods used + +_Describe in natural language a statement of the analysis you're trying to do_ + +First, how does clustering differ between pixl.matrix and pixl.scaled? + +_Provide clearly commented analysis code; include code for tables and figures!_ + +```{r, result03_analysis} +# Include all analysis code, clearly commented +# If not possible, screen shots are acceptable. +# If your contributions included things that are not done in an R-notebook, +# (e.g. researching, writing, and coding in Python), you still need to do +# this status notebook in R. Describe what you did here and put any products +# that you created in github. If you are writing online documents (e.g. overleaf +# or google docs), you can include links to the documents in this notebook +# instead of actual text. +#Create an elbow plot for both pixl.matrix & pixl.scaled +wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled') +wssplot(pixl.scaled, nc=8, seed=14, "Scaled") + +#Do kmeans for both matrices +unscaled.kmeans <- kmeans(pixl.matrix, 3) +scaled.kmeans <- kmeans(pixl.scaled, 3) + +#Produce heatmaps for both +pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl") +pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl") + +#Do pca for both matrices +unscaled.pca <- prcomp(pixl.matrix) +scaled.pca <- prcomp(pixl.scaled) + +#Make biplots +unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca, + labels = pixl.df$type, + groups = as.factor(unscaled.kmeans$cluster)) + + ggtitle("Unscaled Pixl") + +scaled.plot <- ggbiplot::ggbiplot(scaled.pca, + labels = pixl.df$type, + groups = as.factor(scaled.kmeans$cluster)) + + ggtitle("Scaled Pixl") + +#ggplotly(unscaled.plot) +#ggplotly(scaled.plot) + +pheatmap(pixl.scaled, scale="none") +``` + +### Discussion of results + +_Provide in natural language a clear discussion of your observations._ +Both elbow plots suggest 3 clusters as the best choice, however the "quality" value for the unscaled data is much higher than with the scaled data. - update: quality matters relatively, not absolutely. Thus this point is unimportant + +Looking at the two biplots, the most influential features are totally different. For unscaled, the samples appear more spread out and the features appear more balanced than for the scaled biplot. + +My suggestion is to not cluster using a log10 scaled pixl matrix from the above observations. + + +## Summary and next steps + +_Provide in natural language a clear summary and your proposed next steps._ + +I scaled a copy of the pixl matrix, and then compared the two through a series of analysis. My conclusion is the scaled copy is not as good for clustering and PCA. +Next steps involve looking at other solutions for scaling, including scale() and the logscale function I made. +We concluded pixl should not be scaled. + +Potential organic matter predictor. + +I will continue exploring the differences between campaigns and implementing these features into the 2d app. + + + diff --git a/StudentNotebooks/Assignment05/compta-assignment05_f24.html b/StudentNotebooks/Assignment05/compta-assignment05_f24.html new file mode 100644 index 0000000..a10f085 --- /dev/null +++ b/StudentNotebooks/Assignment05/compta-assignment05_f24.html @@ -0,0 +1,2951 @@ + + + + +
+ + + + + + + + + + +These are Dr. Bennett’s Mars questions from Day 1 lecture plus so +more
+Please put valuable resources (like websites and papers) on github https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki +Dr. Bennett has started this process by creating a wiki on DAR-MARS-F24 +with short description and links to resources. Files (like letures are +in DAR-Mars-F24/Resources. You can add more to the wiki and add files to +the the Resources directory on github. You can edit the wiki too.
+NOTE: Follow an outline format; use bullets to +express individual points.
+RCS ID: compta
Project Name: Mars, DAR 2024
Summary of work since last week
+Created bar plot showing feature count of each feature in the +Lithology dataframe, grouped by campaign. Reveals interesting +differences between the two campaigns.
+Beginning work on scaling Pixl, going to make similar chart as the +one described above but for pixl.
Summary of github commits
+Commiting an html and rmd knit
List of presentations, papers, or other outputs
+List of references (if necessary)
Indicate any use of group shared code base
Indicate which parts of your described work were done by you or +as part of joint efforts
Required: Provide illustrating figures and/or +tables
+
Load libaries Set up dataframes/matrices
+#Load in data
+###
+# Load the saved lithology data with locations added
+lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")
+
+# Cast samples as numbers
+lithology.df$sample <- as.numeric(lithology.df$sample)
+
+# Convert rest into factors
+lithology.df[sapply(lithology.df, is.character)] <-
+ lapply(lithology.df[sapply(lithology.df, is.character)],
+ as.factor)
+
+# Keep only first 16 samples because the data for the rest of the samples is not available yet
+#Also i'm getting rid of the atmospheric sample for now
+lithology.df<-lithology.df[2:16,]
+# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample.
+lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1
+
+###
+# Load the saved PIXL data with locations added
+pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
+
+# Convert to factors
+pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
+ as.factor)
+
+#Get rid of atmospheric sample
+pixl.df <- pixl.df[2:16,]
+
+# Make the matrix of just mineral percentage measurements
+pixl.matrix <- pixl.df[,2:14]
+
+###
+# Load the saved LIBS data with locations added
+libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")
+
+#Drop features that are not to be used in the analysis for this notebook
+libs.df <- libs.df %>%
+ select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
+ MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))
+
+# Convert the points to numeric
+libs.df$point <- as.numeric(libs.df$point)
+
+# Make the a matrix contain only the libs measurements for each mineral
+libs.matrix <- as.matrix(libs.df[,6:13])
+
+###
+# Read in data as provided.
+sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")
+
+# Clean up data types
+sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
+sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
+ as.numeric)
+# Transform NA's to 0
+sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)
+
+# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
+# Do this by "pivoting" to a long format, and then back to the desired wide format.
+
+sherloc_long <- sherloc_abrasion_raw %>%
+ pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")
+
+# Make abrasion a factor
+sherloc_long$Name <- as.factor(sherloc_long$Name)
+
+# Make it a matrix
+sherloc.matrix <- sherloc_long %>%
+ pivot_wider(names_from = Mineral, values_from = Presence)
+
+#Remove atmospheric sample
+sherloc.matrix <- sherloc.matrix[2:16,]
+
+# Get sample information from PIXL and add to measurements -- assumes order is the same
+
+sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
+
+# Measurements are everything except first column
+sherloc.matrix<-as.matrix(sherloc.matrix[,-1])
+
+###
+#Add in wss plot for elbow method clustering
+wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") {
+ wss <- data.frame(cluster=1:nc, quality=c(0))
+ for (i in 1:nc){
+ set.seed(seed)
+ wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+ ggplot(data=wss,aes(x=cluster,y=quality)) +
+ geom_line() +
+ ggtitle(title)
+}
+
+#Make a scaled version of pixl using a log scale
+#Make log scale function
+#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column
+#Not going to be used, for the moment
+# logScale <- function(frame) {
+# #Try converting frame to matrix
+# try(frame <- as.matrix(frame), TRUE)
+# #Center and absolute value frame
+# frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs()
+# #Prepare data frame to take scaled columns of frame
+# #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column
+# frame.scaled <- data.frame()
+# for (i in 1:ncol(frame)) {
+# frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i]))
+# }
+# #Produce scaled frame
+# frame.scaled
+# }
+#Just do log10 on a matrix
+
+seed <- 14
+set.seed
+## function (seed, kind = NULL, normal.kind = NULL, sample.kind = NULL)
+## {
+## kinds <- c("Wichmann-Hill", "Marsaglia-Multicarry", "Super-Duper",
+## "Mersenne-Twister", "Knuth-TAOCP", "user-supplied", "Knuth-TAOCP-2002",
+## "L'Ecuyer-CMRG", "default")
+## n.kinds <- c("Buggy Kinderman-Ramage", "Ahrens-Dieter", "Box-Muller",
+## "user-supplied", "Inversion", "Kinderman-Ramage", "default")
+## s.kinds <- c("Rounding", "Rejection", "default")
+## if (length(kind)) {
+## if (!is.character(kind) || length(kind) > 1L)
+## stop("'kind' must be a character string of length 1 (RNG to be used).")
+## if (is.na(i.knd <- pmatch(kind, kinds) - 1L))
+## stop(gettextf("'%s' is not a valid abbreviation of an RNG",
+## kind), domain = NA)
+## if (i.knd == length(kinds) - 1L)
+## i.knd <- -1L
+## }
+## else i.knd <- NULL
+## if (!is.null(normal.kind)) {
+## if (!is.character(normal.kind) || length(normal.kind) !=
+## 1L)
+## stop("'normal.kind' must be a character string of length 1")
+## normal.kind <- pmatch(normal.kind, n.kinds) - 1L
+## if (is.na(normal.kind))
+## stop(gettextf("'%s' is not a valid choice", normal.kind),
+## domain = NA)
+## if (normal.kind == 0L)
+## stop("buggy version of Kinderman-Ramage generator is not allowed",
+## domain = NA)
+## if (normal.kind == length(n.kinds) - 1L)
+## normal.kind <- -1L
+## }
+## if (!is.null(sample.kind)) {
+## if (!is.character(sample.kind) || length(sample.kind) !=
+## 1L)
+## stop("'sample.kind' must be a character string of length 1")
+## sample.kind <- pmatch(sample.kind, s.kinds) - 1L
+## if (is.na(sample.kind))
+## stop(gettextf("'%s' is not a valid choice", sample.kind),
+## domain = NA)
+## if (sample.kind == 0L)
+## warning("non-uniform 'Rounding' sampler used", domain = NA)
+## if (sample.kind == length(s.kinds) - 1L)
+## sample.kind <- -1L
+## }
+## .Internal(set.seed(seed, i.knd, normal.kind, sample.kind))
+## }
+## <bytecode: 0x55b4af391d28>
+## <environment: namespace:base>
+##Before Questions, important notes -Lithology and Sherloc measure +the exact same features, and a point in lithology is 1 if the same point +in sherloc is non zero. So effectively, sherloc and lithology are the +same, but sherloc provides more detail than lithology. -The extra detail +from sherloc is not very reliable, since it was derived from text +descriptions of each measurement -The atmospheric sample is not being +regarded alongside the other samples because it is fundamentally +different and will confuse analysis of the other 15 samples -Samples 17 +and 18 have been released -I’m not using Sherloc for simplicity for the +moment ## Analysis: Question 1 (Clustering and Campaign)
+Provide in natural language a statement of what question you’re +trying to answer What does clustering reveal about Lithology and +Pixl? Do certain clusters correlate to certain campaigns?
+Provide in natural language a description of the data you are +using for this analysis
+Include a step-by-step description of how you prepare your data +for analysis
+If you’re re-using dataframes prepared in another section, simply +re-state what data you’re using
+Perform elbow test on lithology and pixl to pick # of clusters
+# Include all data processing code (if necessary), clearly commented
+#Do elbow method on each data set preparing for clustering
+wssplot(lithology.matrix, nc=8, seed=14)
+
+#4 clusters
+
+wssplot(pixl.matrix, nc=8, seed=14)
+
+#3 clusters
+So cluster Lithology to 4 clusters and pixl to 3 ### Analysis: +Methods and results
+Describe in natural language a statement of the analysis you’re +trying to do
+Provide clearly commented analysis code; include code for tables +and figures! Perform kmeans on lithology and pixl, display results +with table
+# Include all analysis code, clearly commented
+#Data is binary, no need for scaling
+lith.kmeans <- kmeans(lithology.matrix, 4)
+#Add cluster # to litho matrix
+lithology.df["Cluster"] <- lith.kmeans[["cluster"]]
+lithology.df[c("Cluster","campaign")]
+## # A tibble: 15 × 2
+## Cluster campaign
+## <int> <fct>
+## 1 1 Crater Floor
+## 2 1 Crater Floor
+## 3 2 Crater Floor
+## 4 2 Crater Floor
+## 5 2 Crater Floor
+## 6 2 Crater Floor
+## 7 4 Crater Floor
+## 8 4 Crater Floor
+## 9 4 Delta Front
+## 10 4 Delta Front
+## 11 3 Delta Front
+## 12 3 Delta Front
+## 13 2 Delta Front
+## 14 2 Delta Front
+## 15 3 Delta Front
+#Litho Results
+table(lithology.df[c("campaign","Cluster")])
+## Cluster
+## campaign 1 2 3 4
+## Crater Floor 2 4 0 2
+## Delta Front 0 2 3 2
+## Margin Unit 0 0 0 0
+#Cluster pixl.scaled
+#pixl.kmeans <- kmeans(pixl.matrix, 4)
+pixl.kmeans <- kmeans(pixl.matrix, 3)
+#Add cluster # to pixl matrix
+pixl.df["Cluster"] <- pixl.kmeans[["cluster"]]
+pixl.df[c("Cluster","campaign")]
+## # A tibble: 15 × 2
+## Cluster campaign
+## <int> <fct>
+## 1 1 Crater Floor
+## 2 2 Crater Floor
+## 3 2 Crater Floor
+## 4 1 Crater Floor
+## 5 2 Crater Floor
+## 6 2 Crater Floor
+## 7 1 Crater Floor
+## 8 1 Crater Floor
+## 9 2 Delta Front
+## 10 2 Delta Front
+## 11 3 Delta Front
+## 12 3 Delta Front
+## 13 2 Delta Front
+## 14 2 Delta Front
+## 15 3 Delta Front
+#Litho Results
+table(pixl.df[c("campaign","Cluster")])
+## Cluster
+## campaign 1 2 3
+## Crater Floor 4 4 0
+## Delta Front 0 4 3
+#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table
+Provide in natural language a clear discussion of your +observations.
+Lithology: Crater Floor contains clusters 1,2, & 4. Delta Front +contains clusters 2,3, & 4.
+Pixl.scaled: Crater Floor contains clusters 1, 2, & 3. Delta +Front contains clusters 2 & 3
+Across Lithology & Pixl, there are clusters present in Crater +Floor but not in Delta Front!
+Additionally, I will make heat maps to show the distribution of +features across each cluster
+#Heat map for Lithology
+rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4")
+pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12)
+
+#Heat map for Pixl
+rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3")
+pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12)
++From these we can conclude Lithology: Cluster 1 -Uniquely high in +Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and +FeTi Oxides Cluster 2 -Uniquely midlevel for Spinels, Zircon, Ilmenite, +Chromite, apatite, and Hydrated Sulfates Cluster 3 -Uniquely high in +Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate Cluster 4 +-Uniquely high in Other Hydrated Phases & Phyllosilicates Note some +features are high across multiple clusters, which is significant as +well
+Tying into Campaign, this means Crater Floor samples are uniquely +high in the features described above for cluster 1, and Delta Front is +uniquely high in features described above for cluster 3.
+Pixl: Cluster 1 -Uniquely low in Cr2O3 Cluster 2 -High in SO3 Cluster +3 -Not much stands out
+Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 +compared to Delta Front
+Provide in natural language a statement of what question you’re +trying to answer Compare feature distribution across campaigns via +graphs
+Provide in natural language a description of the data you are +using for this analysis Lithology, pixl, dividing by campaign and +plotting feature distribution by campaign
+Include a step-by-step description of how you prepare your data +for analysis
+If you’re re-using dataframes prepared in another section, simply +re-state what data you’re using
+# Include all data processing code (if necessary), clearly commented
+#Start with lithology
+#Group by campaign & remove metadata
+lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster))
+
+#Turn into long form and only keep positive cases
+lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1)
+
+#Count # of identical cases
+lithology.df.sorted <- lithology.df.sorted %>% count(Feature)
+
+#Sort, Crater Floor is High to low & Delta Front is added back in low to high
+lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n))
+Describe in natural language a statement of the analysis you’re +trying to do
+Provide clearly commented analysis code; include code for tables +and figures!
+p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) +
+ geom_col(position=position_dodge(preserve="total"), width=0.6) +
+ theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) +
+ labs(x="", y="Count") +
+ ggtitle("Lithology Features Count by Campaign") +
+ scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus")
+
+ggplotly(p, tooltip = c("campaign",'x', "n"))
+
+
+#Commented out to knit to pdf, picture at top of report
+#Make box plots
+pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13)
+colnames(pixl.lf)<- c("campaign", "feature", "value")
+ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) +
+ geom_boxplot() +
+ scale_y_log10() +
+ ggtitle("pixl distribution by campaign") +
+ labs(x="", y="log10 scale from percent composition")
+## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
+## Warning: Removed 5 rows containing non-finite outside the scale range
+## (`stat_boxplot()`).
++### Discussion of results
+Provide in natural language a clear discussion of your +observations.
+Lithology: Certain minerals are abundant in both campaigns, +especially Crater Floor. -Carbonate is common in both campaigns -Organic +Matter is also common in both campaigns -Sulfate and Olivine are also +common in both
+High in Crater Floor: -Pyroxene and amorphous silicate are abundant +in Crater Floor but sparse in Delta Front
+Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in +3 samples in Delta Front, but not at all in Crater Floor.
+There are 20 minerals that are exclusively in either Crater Floor or +Delta Front.
+4 minerals have a count of zero, meaning they weren’t detected in any +campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & +Hydrated_Iron_Oxide). These minerals are present in the atmospheric +sample, which is absent in this analysis.
+The pixl graph reveals some big differences between Crater Floor and +Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & +SiO2.
+During our presentation, Dr Roger noted that a predictor for Organic +Matter would be very valuable, and also concluded Delta Front has some +igneous components to it, contradicting the rock type on all Delta Front +samples which says they are sedimentary.
+Provide in natural language a statement of what question you’re +trying to answer
+The data in pixl is represented by percentages. Is log scaling pixl +better for clustering and PCA?
+Provide in natural language a description of the data you are +using for this analysis
+Include a step-by-step description of how you prepare your data +for analysis
+If you’re re-using dataframes prepared in another section, +re-state what data you’re using
+# Include all data processing code (if necessary), clearly commented
+#First replace 0.0 entries with 0.00001 so they don't scale to inf
+pixl.matrix[pixl.matrix == 0] <- 0.00001
+#Apply log10 to every entry in pixl.matrix & get new scaled df
+pixl.scaled <- log10(pixl.matrix)
+Describe in natural language a statement of the analysis you’re +trying to do
+First, how does clustering differ between pixl.matrix and +pixl.scaled?
+Provide clearly commented analysis code; include code for tables +and figures!
+# Include all analysis code, clearly commented
+# If not possible, screen shots are acceptable.
+# If your contributions included things that are not done in an R-notebook,
+# (e.g. researching, writing, and coding in Python), you still need to do
+# this status notebook in R. Describe what you did here and put any products
+# that you created in github. If you are writing online documents (e.g. overleaf
+# or google docs), you can include links to the documents in this notebook
+# instead of actual text.
+#Create an elbow plot for both pixl.matrix & pixl.scaled
+wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled')
+
+wssplot(pixl.scaled, nc=8, seed=14, "Scaled")
+
+#Do kmeans for both matrices
+unscaled.kmeans <- kmeans(pixl.matrix, 3)
+scaled.kmeans <- kmeans(pixl.scaled, 3)
+
+#Produce heatmaps for both
+pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl")
+
+pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl")
+
+#Do pca for both matrices
+unscaled.pca <- prcomp(pixl.matrix)
+scaled.pca <- prcomp(pixl.scaled)
+
+#Make biplots
+unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca,
+ labels = pixl.df$type,
+ groups = as.factor(unscaled.kmeans$cluster)) +
+ ggtitle("Unscaled Pixl")
+
+scaled.plot <- ggbiplot::ggbiplot(scaled.pca,
+ labels = pixl.df$type,
+ groups = as.factor(scaled.kmeans$cluster)) +
+ ggtitle("Scaled Pixl")
+
+ggplotly(unscaled.plot)
+
+
+ggplotly(scaled.plot)
+
+
+pheatmap(pixl.scaled, scale="none")
+
+Provide in natural language a clear discussion of your +observations. Both elbow plots suggest 3 clusters as the best +choice, however the “quality” value for the unscaled data is much higher +than with the scaled data. - update: quality matters relatively, not +absolutely. Thus this point is unimportant
+Looking at the two biplots, the most influential features are totally +different. For unscaled, the samples appear more spread out and the +features appear more balanced than for the scaled biplot.
+My suggestion is to not cluster using a log10 scaled pixl matrix from +the above observations.
+Provide in natural language a clear summary and your proposed +next steps.
+I scaled a copy of the pixl matrix, and then compared the two through +a series of analysis. My conclusion is the scaled copy is not as good +for clustering and PCA. Next steps involve looking at other solutions +for scaling, including scale() and the logscale function I made. We +concluded pixl should not be scaled.
+Potential organic matter predictor.
+I will continue exploring the differences between campaigns and +implementing these features into the 2d app.
+