Skip to content
Permalink
dc39e228be
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
551 lines (400 sloc) 20.2 KB
---
title: "DAR F24 Assignment 5 Notebook"
author: "Ashton Compton"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
html_document:
toc: yes
subtitle: "DAR Project Name: Mars"
---
![Dr. Roger's Mars Question on 9-4-24](../../Resources/MarsQuestions2024-09-04.png)
These are Dr. Bennett's Mars questions from Day 1 lecture plus so more
1) Whatever Dr. Roger’s wants.
2) Develop new analyses/visualizations with an emphasis on integrative analysis of the different types of data (e.g. PIXL, SHERLOC, SuperCam/LIBS, and Lithology)
3) Incorporate these analysis into improved Campfire demo. What would Dr. Rogers consider to be an improved Campfire demo?
4) Develop standalone "2D" app incorporating Campfire and enhanced analysis/visualization capabilities.
5) Understand LIBS and develop an insightful standalone Libs analysis.
6) Develop a deeper understanding of each dataset by looking at NASA sources and published papers. Are we missing data on campaigns etc? Are we correctly integrating data? How does LIBS data correspond to the 16 Samples?
Please put valuable resources (like websites and papers) on github https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki Dr. Bennett has started this process by creating a wiki on DAR-MARS-F24 with short description and links to resources. Files (like letures are in DAR-Mars-F24/Resources. You can add more to the wiki and add files to the the Resources directory on github. You can edit the wiki too.
## BiWeekly Work Summary
**NOTE:** Follow an outline format; use bullets to express individual points.
* RCS ID: compta
* Project Name: Mars, DAR 2024
* Summary of work since last week
* Describe the important aspects of what you worked on and accomplished
Created bar plot showing feature count of each feature in the Lithology dataframe, grouped by campaign.
Reveals interesting differences between the two campaigns.
Beginning work on scaling Pixl, going to make similar chart as the one described above but for pixl.
* Summary of github commits
* include branch name(s)
* include browsable links to all external files on github
* Include links to shared Shiny apps
Commiting an html and rmd knit
* List of presentations, papers, or other outputs
* Include browsable links
* List of references (if necessary)
* Indicate any use of group shared code base
* Indicate which parts of your described work were done by you or as part of joint efforts
* **Required:** Provide illustrating figures and/or tables
![Lithology Feature Count by Campaign](../../StudentNotebooks/Assignment04/LithologyFeatCountbyCampaign.png)
![Lithology Cluster Distribution ](../../StudentNotebooks/Assignment05/lithologyFeatureDistributionHeatmap.png)
![pixlDistributionbyCampaign](../../StudentNotebooks/Assignment05/pixlDistributionbyCampaign.png)
## Personal Contribution
* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
* Include github issues you've addressed if any
Load libaries
Set up dataframes/matrices
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Set the default CRAN repository
local({r <- getOption("repos")
r["CRAN"] <- "http://cran.r-project.org"
options(repos=r)
})
if (!require("pandoc")) {
install.packages("pandoc")
library(pandoc)
}
# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
install.packages("rmarkdown")
library(rmarkdown)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("stringr")) {
install.packages("stringr")
library(stringr)
}
if (!require("ggbiplot")) {
install.packages("ggbiplot")
library(ggbiplot)
}
if (!require("pheatmap")) {
install.packages("pheatmap")
library(pheatmap)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
if (!require("paletteer")) {
install.packages("paletteer")
library(paletteer)
}
if (!require("plotly")) {
install.packages("plotly")
library(plotly)
}
if (!require("GGally")) {
install.packages("GGally")
library(GGally)
}
```
```{R}
#Load in data
###
# Load the saved lithology data with locations added
lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")
# Cast samples as numbers
lithology.df$sample <- as.numeric(lithology.df$sample)
# Convert rest into factors
lithology.df[sapply(lithology.df, is.character)] <-
lapply(lithology.df[sapply(lithology.df, is.character)],
as.factor)
# Keep only first 16 samples because the data for the rest of the samples is not available yet
#Also i'm getting rid of the atmospheric sample for now
lithology.df<-lithology.df[2:16,]
# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample.
lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1
###
# Load the saved PIXL data with locations added
pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
# Convert to factors
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
as.factor)
#Get rid of atmospheric sample
pixl.df <- pixl.df[2:16,]
# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14]
###
# Load the saved LIBS data with locations added
libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")
#Drop features that are not to be used in the analysis for this notebook
libs.df <- libs.df %>%
select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))
# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)
# Make the a matrix contain only the libs measurements for each mineral
libs.matrix <- as.matrix(libs.df[,6:13])
###
# Read in data as provided.
sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")
# Clean up data types
sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
as.numeric)
# Transform NA's to 0
sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)
# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
# Do this by "pivoting" to a long format, and then back to the desired wide format.
sherloc_long <- sherloc_abrasion_raw %>%
pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")
# Make abrasion a factor
sherloc_long$Name <- as.factor(sherloc_long$Name)
# Make it a matrix
sherloc.matrix <- sherloc_long %>%
pivot_wider(names_from = Mineral, values_from = Presence)
#Remove atmospheric sample
sherloc.matrix <- sherloc.matrix[2:16,]
# Get sample information from PIXL and add to measurements -- assumes order is the same
sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
# Measurements are everything except first column
sherloc.matrix<-as.matrix(sherloc.matrix[,-1])
###
#Add in wss plot for elbow method clustering
wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") {
wss <- data.frame(cluster=1:nc, quality=c(0))
for (i in 1:nc){
set.seed(seed)
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
ggplot(data=wss,aes(x=cluster,y=quality)) +
geom_line() +
ggtitle(title)
}
#Make a scaled version of pixl using a log scale
#Make log scale function
#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column
#Not going to be used, for the moment
# logScale <- function(frame) {
# #Try converting frame to matrix
# try(frame <- as.matrix(frame), TRUE)
# #Center and absolute value frame
# frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs()
# #Prepare data frame to take scaled columns of frame
# #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column
# frame.scaled <- data.frame()
# for (i in 1:ncol(frame)) {
# frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i]))
# }
# #Produce scaled frame
# frame.scaled
# }
#Just do log10 on a matrix
seed <- 14
set.seed
```
##Before Questions, important notes
-Lithology and Sherloc measure the exact same features, and a point in lithology is 1 if the same point in sherloc is non zero. So effectively, sherloc and lithology are the same, but sherloc provides more detail than lithology.
-The extra detail from sherloc is not very reliable, since it was derived from text descriptions of each measurement
-The atmospheric sample is not being regarded alongside the other samples because it is fundamentally different and will confuse analysis of the other 15 samples
-Samples 17 and 18 have been released
-I'm not using Sherloc for simplicity for the moment
## Analysis: Question 1 (Clustering and Campaign)
### Question being asked
_Provide in natural language a statement of what question you're trying to answer_
What does clustering reveal about Lithology and Pixl? Do certain clusters correlate to certain campaigns?
### Data Preparation
_Provide in natural language a description of the data you are using for this analysis_
_Include a step-by-step description of how you prepare your data for analysis_
_If you're re-using dataframes prepared in another section, simply re-state what data you're using_
Perform elbow test on lithology and pixl to pick # of clusters
```{r, result01_data}
# Include all data processing code (if necessary), clearly commented
#Do elbow method on each data set preparing for clustering
wssplot(lithology.matrix, nc=8, seed=14)
#4 clusters
wssplot(pixl.matrix, nc=8, seed=14)
#3 clusters
```
So cluster Lithology to 4 clusters and pixl to 3
### Analysis: Methods and results
_Describe in natural language a statement of the analysis you're trying to do_
_Provide clearly commented analysis code; include code for tables and figures!_
Perform kmeans on lithology and pixl, display results with table
```{r, result01_analysis}
# Include all analysis code, clearly commented
#Data is binary, no need for scaling
lith.kmeans <- kmeans(lithology.matrix, 4)
#Add cluster # to litho matrix
lithology.df["Cluster"] <- lith.kmeans[["cluster"]]
lithology.df[c("Cluster","campaign")]
#Litho Results
table(lithology.df[c("campaign","Cluster")])
#Cluster pixl.scaled
#pixl.kmeans <- kmeans(pixl.matrix, 4)
pixl.kmeans <- kmeans(pixl.matrix, 3)
#Add cluster # to pixl matrix
pixl.df["Cluster"] <- pixl.kmeans[["cluster"]]
pixl.df[c("Cluster","campaign")]
#Litho Results
table(pixl.df[c("campaign","Cluster")])
#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table
```
### Discussion of results
_Provide in natural language a clear discussion of your observations._
Lithology:
Crater Floor contains clusters 1,2, & 4.
Delta Front contains clusters 2,3, & 4.
Pixl.scaled:
Crater Floor contains clusters 1, 2, & 3.
Delta Front contains clusters 2 & 3
Across Lithology & Pixl, there are clusters present in Crater Floor but not in Delta Front!
Additionally, I will make heat maps to show the distribution of features across each cluster
```{R}
#Heat map for Lithology
rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4")
pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12)
#Heat map for Pixl
rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3")
pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12)
```
From these we can conclude
Lithology:
Cluster 1
-Uniquely high in Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and FeTi Oxides
Cluster 2
-Uniquely midlevel for Spinels, Zircon, Ilmenite, Chromite, apatite, and Hydrated Sulfates
Cluster 3
-Uniquely high in Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate
Cluster 4
-Uniquely high in Other Hydrated Phases & Phyllosilicates
Note some features are high across multiple clusters, which is significant as well
Tying into Campaign, this means Crater Floor samples are uniquely high in the features described above for cluster 1,
and Delta Front is uniquely high in features described above for cluster 3.
Pixl:
Cluster 1
-Uniquely low in Cr2O3
Cluster 2
-High in SO3
Cluster 3
-Not much stands out
Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 compared to Delta Front
## Analysis: Question 2 (Provide short name)
### Question being asked
_Provide in natural language a statement of what question you're trying to answer_
Compare feature distribution across campaigns via graphs
### Data Preparation
_Provide in natural language a description of the data you are using for this analysis_
Lithology, pixl, dividing by campaign and plotting feature distribution by campaign
_Include a step-by-step description of how you prepare your data for analysis_
_If you're re-using dataframes prepared in another section, simply re-state what data you're using_
```{r, result02_data}
# Include all data processing code (if necessary), clearly commented
#Start with lithology
#Group by campaign & remove metadata
lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster))
#Turn into long form and only keep positive cases
lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1)
#Count # of identical cases
lithology.df.sorted <- lithology.df.sorted %>% count(Feature)
#Sort, Crater Floor is High to low & Delta Front is added back in low to high
lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n))
```
### Analysis: Methods and Results
_Describe in natural language a statement of the analysis you're trying to do_
_Provide clearly commented analysis code; include code for tables and figures!_
```{r, result02_analysis}
p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) +
geom_col(position=position_dodge(preserve="total"), width=0.6) +
theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) +
labs(x="", y="Count") +
ggtitle("Lithology Features Count by Campaign") +
scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus")
#ggplotly(p, tooltip = c("campaign",'x', "n"))
#Commented out to knit to pdf, picture at top of report
```
```{R}
#Make box plots
pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13)
colnames(pixl.lf)<- c("campaign", "feature", "value")
ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("pixl distribution by campaign") +
labs(x="", y="log10 scale from percent composition")
```
### Discussion of results
_Provide in natural language a clear discussion of your observations._
Lithology:
Certain minerals are abundant in both campaigns, especially Crater Floor.
-Carbonate is common in both campaigns
-Organic Matter is also common in both campaigns
-Sulfate and Olivine are also common in both
High in Crater Floor:
-Pyroxene and amorphous silicate are abundant in Crater Floor but sparse in Delta Front
Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in 3 samples in Delta Front, but not at all in Crater Floor.
There are 20 minerals that are exclusively in either Crater Floor or Delta Front.
4 minerals have a count of zero, meaning they weren't detected in any campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & Hydrated_Iron_Oxide). These minerals are present in the atmospheric sample, which is absent in this analysis.
The pixl graph reveals some big differences between Crater Floor and Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & SiO2.
During our presentation, Dr Roger noted that a predictor for Organic Matter would be very valuable, and also concluded Delta Front has some igneous components to it, contradicting the rock type on all Delta Front samples which says they are sedimentary.
## Analysis: Question 3 (Provide short name)
### Question being asked
_Provide in natural language a statement of what question you're trying to answer_
The data in pixl is represented by percentages. Is log scaling pixl better for clustering and PCA?
### Data Preparation
_Provide in natural language a description of the data you are using for this analysis_
_Include a step-by-step description of how you prepare your data for analysis_
_If you're re-using dataframes prepared in another section, re-state what data you're using_
```{r, result03_data}
# Include all data processing code (if necessary), clearly commented
#First replace 0.0 entries with 0.00001 so they don't scale to inf
pixl.matrix[pixl.matrix == 0] <- 0.00001
#Apply log10 to every entry in pixl.matrix & get new scaled df
pixl.scaled <- log10(pixl.matrix)
```
### Analysis methods used
_Describe in natural language a statement of the analysis you're trying to do_
First, how does clustering differ between pixl.matrix and pixl.scaled?
_Provide clearly commented analysis code; include code for tables and figures!_
```{r, result03_analysis}
# Include all analysis code, clearly commented
# If not possible, screen shots are acceptable.
# If your contributions included things that are not done in an R-notebook,
# (e.g. researching, writing, and coding in Python), you still need to do
# this status notebook in R. Describe what you did here and put any products
# that you created in github. If you are writing online documents (e.g. overleaf
# or google docs), you can include links to the documents in this notebook
# instead of actual text.
#Create an elbow plot for both pixl.matrix & pixl.scaled
wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled')
wssplot(pixl.scaled, nc=8, seed=14, "Scaled")
#Do kmeans for both matrices
unscaled.kmeans <- kmeans(pixl.matrix, 3)
scaled.kmeans <- kmeans(pixl.scaled, 3)
#Produce heatmaps for both
pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl")
pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl")
#Do pca for both matrices
unscaled.pca <- prcomp(pixl.matrix)
scaled.pca <- prcomp(pixl.scaled)
#Make biplots
unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca,
labels = pixl.df$type,
groups = as.factor(unscaled.kmeans$cluster)) +
ggtitle("Unscaled Pixl")
scaled.plot <- ggbiplot::ggbiplot(scaled.pca,
labels = pixl.df$type,
groups = as.factor(scaled.kmeans$cluster)) +
ggtitle("Scaled Pixl")
#ggplotly(unscaled.plot)
#ggplotly(scaled.plot)
pheatmap(pixl.scaled, scale="none")
```
### Discussion of results
_Provide in natural language a clear discussion of your observations._
Both elbow plots suggest 3 clusters as the best choice, however the "quality" value for the unscaled data is much higher than with the scaled data. - update: quality matters relatively, not absolutely. Thus this point is unimportant
Looking at the two biplots, the most influential features are totally different. For unscaled, the samples appear more spread out and the features appear more balanced than for the scaled biplot.
My suggestion is to not cluster using a log10 scaled pixl matrix from the above observations.
## Summary and next steps
_Provide in natural language a clear summary and your proposed next steps._
I scaled a copy of the pixl matrix, and then compared the two through a series of analysis. My conclusion is the scaled copy is not as good for clustering and PCA.
Next steps involve looking at other solutions for scaling, including scale() and the logscale function I made.
We concluded pixl should not be scaled.
Potential organic matter predictor.
I will continue exploring the differences between campaigns and implementing these features into the 2d app.