diff --git a/AppRelated/CampaignQuestionFolder/CampaignQuestionNotebookAshton.Rmd b/AppRelated/CampaignQuestionFolder/CampaignQuestionNotebookAshton.Rmd index 19fecc6..379a6e0 100644 --- a/AppRelated/CampaignQuestionFolder/CampaignQuestionNotebookAshton.Rmd +++ b/AppRelated/CampaignQuestionFolder/CampaignQuestionNotebookAshton.Rmd @@ -16,23 +16,24 @@ Question - "What is the difference between samples from the two campaigns Delta Every sample from the Perserverance Rover is assigned a campaign. In this analysis, the two campaigns were Delta Front and Crater Floor. Title: Samples on map highlighted by campaign -![Mars Map Draft](marsmap.png) +![Mars Map](~/DAR-Mars-F24/AppRelated/CampaignQuestionFolder/Pics/marsSamples.png) +Description: A map of the Martian surface showing each sample site colored by campaign. Samples colored orange are apart of Crater Floor campaign, while samples colored blue are apart of Delta Front campaign. Title: Feature Count in Sherloc seperated by Campaign -![Sherloc Count](sherlocCount.png) +![Sherloc Count](~/DAR-Mars-F24/AppRelated/CampaignQuestionFolder/Pics/sherlocCount.png) Description: Sherloc was grouped by campaign and the total value of each feature across all samples was summed up and displayed in the above plot. If a feature had a value of zero for every sample, it was not included in the plot. Title: Sherloc Feature Distribution by Campaign -![Sherloc Boxplot](sherlocBox.png) +![Sherloc Boxplot](~/DAR-Mars-F24/AppRelated/CampaignQuestionFolder/Pics/sherlocBox.png) Description: Sherloc Features distribution shown with a series a boxplots, seperated by campaign. Title: Pixl Feature Distribution by Campaign -![pixl distribution](pixlDistributionbyCampaign.png) +![pixl distribution](~/DAR-Mars-F24/AppRelated/CampaignQuestionFolder/Pics/pixlDistributionbyCampaign.png) Description: Box plots showing the distribution of each feature in Pixl, seperated by Campaign. Title: Libs clustering with Pixl data seperated by campaign on Ternairy Diagram -![LibsandPixl](LibsandPixlTern.png) -Description: Libs data was clustered and ternairy plotted along with Pixl data seperated by campaign. The Pixl data points are labelled by name as well. +![LibsandPixl](~/DAR-Mars-F24/AppRelated/CampaignQuestionFolder/Pics/LibsandPixlTern.png) +Description: Libs data was clustered and ternairy plotted along with Pixl data separated by campaign. The Pixl data points are labelled by name as well. CODE: @@ -103,13 +104,17 @@ if(!require("ggtern")){ } ``` - +Prepare data ```{R} #Load in data ### # Load the saved lithology data with locations added lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds") + #lithology.df<- readRDS("~/DAR-Mars-F24/StudentData/v1_lithology.Rds") +pixl_pos.df<- readRDS("~/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds") +#Remove atmospheric sample +pixl_pos.df <- pixl_pos.df[2:16,] # Cast samples as numbers lithology.df$sample <- as.numeric(lithology.df$sample) @@ -200,163 +205,23 @@ wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluste ggtitle(title) } -#Make a scaled version of pixl using a log scale -#Make log scale function -#Applies log function to each column in given table, the z score of the column is the base of the log function applied to the column -#Not going to be used, for the moment -# logScale <- function(frame) { -# #Try converting frame to matrix -# try(frame <- as.matrix(frame), TRUE) -# #Center and absolute value frame -# frame <- frame %>% scale(center=TRUE,scale=FALSE) %>% abs() -# #Prepare data frame to take scaled columns of frame -# #Scaling goes through each column in frame, finds z score of each column and applies log base z to each respective column -# frame.scaled <- data.frame() -# for (i in 1:ncol(frame)) { -# frame.scaled[1:nrow(frame),colnames(frame)[i]] <- log(x=frame[,i],base = sd(frame[,i])) -# } -# #Produce scaled frame -# frame.scaled -# } -#Just do log10 on a matrix - seed <- 14 set.seed ``` - -##Before Questions, important notes --Lithology and Sherloc measure the exact same features, and a point in lithology is 1 if the same point in sherloc is non zero. So effectively, sherloc and lithology are the same, but sherloc provides more detail than lithology. - -The extra detail from sherloc is not very reliable, since it was derived from text descriptions of each measurement --The atmospheric sample is not being regarded alongside the other samples because it is fundamentally different and will confuse analysis of the other 15 samples --Samples 17 and 18 have been released --I'm not using Sherloc for simplicity for the moment -## Analysis: Question 1 (Clustering and Campaign) - -### Question being asked - -_Provide in natural language a statement of what question you're trying to answer_ -What does clustering reveal about Lithology and Pixl? Do certain clusters correlate to certain campaigns? - -### Data Preparation - -_Provide in natural language a description of the data you are using for this analysis_ - -_Include a step-by-step description of how you prepare your data for analysis_ - -_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ - -Perform elbow test on lithology and pixl to pick # of clusters - -```{r, result01_data} -# Include all data processing code (if necessary), clearly commented -#Do elbow method on each data set preparing for clustering -wssplot(lithology.matrix, nc=8, seed=14) -#4 clusters - -wssplot(pixl.matrix, nc=8, seed=14) -#3 clusters -``` - -So cluster Lithology to 4 clusters and pixl to 3 -### Analysis: Methods and results - -_Describe in natural language a statement of the analysis you're trying to do_ - -_Provide clearly commented analysis code; include code for tables and figures!_ -Perform kmeans on lithology and pixl, display results with table - -```{r, result01_analysis} -# Include all analysis code, clearly commented -#Data is binary, no need for scaling -lith.kmeans <- kmeans(lithology.matrix, 4) -#Add cluster # to litho matrix -lithology.df["Cluster"] <- lith.kmeans[["cluster"]] -lithology.df[c("Cluster","campaign")] -#Litho Results -table(lithology.df[c("campaign","Cluster")]) - -#Cluster pixl.scaled -#pixl.kmeans <- kmeans(pixl.matrix, 4) -pixl.kmeans <- kmeans(pixl.matrix, 3) -#Add cluster # to pixl matrix -pixl.df["Cluster"] <- pixl.kmeans[["cluster"]] -pixl.df[c("Cluster","campaign")] -#Litho Results -table(pixl.df[c("campaign","Cluster")]) -#Note I tried using kable, however couldn't find a way for it to display the total counts, instead it showed a longformat table -``` - -### Discussion of results - -_Provide in natural language a clear discussion of your observations._ - -Lithology: -Crater Floor contains clusters 1,2, & 4. -Delta Front contains clusters 2,3, & 4. - -Pixl.scaled: -Crater Floor contains clusters 1, 2, & 3. -Delta Front contains clusters 2 & 3 - -Across Lithology & Pixl, there are clusters present in Crater Floor but not in Delta Front! - -Additionally, I will make heat maps to show the distribution of features across each cluster - -```{R} -#Heat map for Lithology -rownames(lith.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4") -pheatmap(lith.kmeans$centers, scale="none", main="Lithology Feature Distribution by Cluster", fontsize = 12) - -#Heat map for Pixl -rownames(pixl.kmeans$centers) <- c("Cluster 1", "Cluster 2", "Cluster 3") -pheatmap(pixl.kmeans$centers, scale="none", main="Pixl Feature Distribution by Cluster", fontsize =12) +Used as base for map picture +```{r} +#Produce sample map with campaign differences +pixl_pos.df$campaign <- pixl.df$campaign +ggplot(pixl_pos.df, aes(x= Lat, y= Long, color=campaign, label=sample, size=1)) + + geom_point() + + theme_classic() ``` -From these we can conclude -Lithology: -Cluster 1 - -Uniquely high in Amorphous Silicate, Phosphate, Hydrated Ca Sulfate, Plagioclase, and FeTi Oxides -Cluster 2 - -Uniquely midlevel for Spinels, Zircon, Ilmenite, Chromite, apatite, and Hydrated Sulfates -Cluster 3 - -Uniquely high in Kaolinite, Hydrated MgFe Sulfate, FeMg Clay, and Mg Sulfate -Cluster 4 - -Uniquely high in Other Hydrated Phases & Phyllosilicates -Note some features are high across multiple clusters, which is significant as well - -Tying into Campaign, this means Crater Floor samples are uniquely high in the features described above for cluster 1, - and Delta Front is uniquely high in features described above for cluster 3. - -Pixl: -Cluster 1 - -Uniquely low in Cr2O3 -Cluster 2 - -High in SO3 -Cluster 3 - -Not much stands out - -Tying into Campaign, this means Crater Floor is uniquely low in Cr2O3 compared to Delta Front - -## Analysis: Question 2 (Provide short name) - -### Question being asked - -_Provide in natural language a statement of what question you're trying to answer_ -Compare feature distribution across campaigns via graphs - -### Data Preparation - -_Provide in natural language a description of the data you are using for this analysis_ -Lithology, pixl, dividing by campaign and plotting feature distribution by campaign - -_Include a step-by-step description of how you prepare your data for analysis_ - -_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ - +Make interactive plotly plot for Lithology ```{r, result02_data} # Include all data processing code (if necessary), clearly commented #Start with lithology #Group by campaign & remove metadata -lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion,Cluster)) +lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion)) #Turn into long form and only keep positive cases lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1) @@ -366,15 +231,9 @@ lithology.df.sorted <- lithology.df.sorted %>% count(Feature) #Sort, Crater Floor is High to low & Delta Front is added back in low to high lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n)) -``` - -### Analysis: Methods and Results - -_Describe in natural language a statement of the analysis you're trying to do_ -_Provide clearly commented analysis code; include code for tables and figures!_ -```{r, result02_analysis} +#Make interactive plot for lithology p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) + geom_col(position=position_dodge(preserve="total"), width=0.6) + theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) + @@ -385,6 +244,7 @@ p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% uni ggplotly(p, tooltip = c("campaign",'x', "n")) #Commented out to knit to pdf, picture at top of report ``` +Create similar plot for Sherloc ```{r} #Repeat for sherloc #Group by campaign & remove metadata @@ -409,10 +269,10 @@ p <- ggplot(sherloc.df.sorted, aes(x=factor(Feature, levels = (Feature %>% uniqu ggplotly(p, tooltip = c("campaign",'x', "n")) #Commented out to knit to pdf, picture at top of report ``` - +Make box plots for pixl and sherloc ```{R} #Make box plots -pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion, Cluster)) %>% pivot_longer(1:13) +pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion)) %>% pivot_longer(1:13) colnames(pixl.lf)<- c("campaign", "feature", "value") ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) + geom_boxplot() + @@ -429,169 +289,7 @@ ggplot(data = sherloc.lf, aes(x=feature, y=value, color = campaign)) + labs(x="", y="log10 scale from percent composition") + theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 10)) ``` -### Discussion of results - -_Provide in natural language a clear discussion of your observations._ - -Lithology: -Certain minerals are abundant in both campaigns, especially Crater Floor. - -Carbonate is common in both campaigns - -Organic Matter is also common in both campaigns - -Sulfate and Olivine are also common in both - -High in Crater Floor: - -Pyroxene and amorphous silicate are abundant in Crater Floor but sparse in Delta Front - -Fe_Mg_Clay, Hydrated_Mg_Fe_sulfate, Kaolinite, and Mg_sulfates are in 3 samples in Delta Front, but not at all in Crater Floor. - -There are 20 minerals that are exclusively in either Crater Floor or Delta Front. - -4 minerals have a count of zero, meaning they weren't detected in any campaign (Perchlorates, Na_Perchlorate, Hydrated_Carbonates, & Hydrated_Iron_Oxide). These minerals are present in the atmospheric sample, which is absent in this analysis. - -The pixl graph reveals some big differences between Crater Floor and Delta Front. Namely in Al2O3, CaO, Cr2O3, MgO, P2O5, SO3, & SiO2. - -During our presentation, Dr Roger noted that a predictor for Organic Matter would be very valuable, and also concluded Delta Front has some igneous components to it, contradicting the rock type on all Delta Front samples which says they are sedimentary. - - - -## Analysis: Question 3 (Provide short name) - -### Question being asked - -_Provide in natural language a statement of what question you're trying to answer_ - -The data in pixl is represented by percentages. Is log scaling pixl better for clustering and PCA? - - - -### Data Preparation - -_Provide in natural language a description of the data you are using for this analysis_ - -_Include a step-by-step description of how you prepare your data for analysis_ - -_If you're re-using dataframes prepared in another section, re-state what data you're using_ - -```{r, result03_data} -# Include all data processing code (if necessary), clearly commented -#First replace 0.0 entries with 0.00001 so they don't scale to inf -pixl.matrix[pixl.matrix == 0] <- 0.00001 -#Apply log10 to every entry in pixl.matrix & get new scaled df -pixl.scaled <- log10(pixl.matrix) -``` - -### Analysis methods used - -_Describe in natural language a statement of the analysis you're trying to do_ - -First, how does clustering differ between pixl.matrix and pixl.scaled? - -_Provide clearly commented analysis code; include code for tables and figures!_ - -```{r, result03_analysis} -# Include all analysis code, clearly commented -# If not possible, screen shots are acceptable. -# If your contributions included things that are not done in an R-notebook, -# (e.g. researching, writing, and coding in Python), you still need to do -# this status notebook in R. Describe what you did here and put any products -# that you created in github. If you are writing online documents (e.g. overleaf -# or google docs), you can include links to the documents in this notebook -# instead of actual text. -#Create an elbow plot for both pixl.matrix & pixl.scaled -wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled') -wssplot(pixl.scaled, nc=8, seed=14, "Scaled") - -#Do kmeans for both matrices -unscaled.kmeans <- kmeans(pixl.matrix, 3) -scaled.kmeans <- kmeans(pixl.scaled, 3) - -#Produce heatmaps for both -pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl") -pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl") - -#Do pca for both matrices -unscaled.pca <- prcomp(pixl.matrix) -scaled.pca <- prcomp(pixl.scaled) - -#Make biplots -unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca, - labels = pixl.df$type, - groups = as.factor(unscaled.kmeans$cluster)) + - ggtitle("Unscaled Pixl") - -scaled.plot <- ggbiplot::ggbiplot(scaled.pca, - labels = pixl.df$type, - groups = as.factor(scaled.kmeans$cluster)) + - ggtitle("Scaled Pixl") - -#ggplotly(unscaled.plot) -#ggplotly(scaled.plot) - -pheatmap(pixl.scaled, scale="none") -``` - -```{r} - -``` - -### Discussion of results - -_Provide in natural language a clear discussion of your observations._ -Both elbow plots suggest 3 clusters as the best choice, however the "quality" value for the unscaled data is much higher than with the scaled data. - update: quality matters relatively, not absolutely. Thus this point is unimportant - -Looking at the two biplots, the most influential features are totally different. For unscaled, the samples appear more spread out and the features appear more balanced than for the scaled biplot. - -My suggestion is to not cluster using a log10 scaled pixl matrix from the above observations. - - -## Summary and next steps - -_Provide in natural language a clear summary and your proposed next steps._ - -I scaled a copy of the pixl matrix, and then compared the two through a series of analysis. My conclusion is the scaled copy is not as good for clustering and PCA. -Next steps involve looking at other solutions for scaling, including scale() and the logscale function I made. -We concluded pixl should not be scaled. - -Potential organic matter predictor. - -I will continue exploring the differences between campaigns and implementing these features into the 2d app. - -############################################################################################################# -New start, working on essential features for 2d app - -Will include everything campaign we have worked on, but let's start with the most important results first. -#1 -Include a map highlighting which points are in what campaign - -Create interactive map with code from somewhere - or - -Include labelled image - ![pixl](Home/CampaignQuestionFolder/marsmap.JPG) - -#2 -Display lithology and pixl graphs showing exact differences in data between campaigns - -Code for both exists above -![Lithology Feature Count by Campaign](../../StudentNotebooks/Assignment04/LithologyFeatCountbyCampaign.png) - -![pixlDistributionbyCampaign](../../StudentNotebooks/Assignment05/pixlDistributionbyCampaign.png) - -Added sherloc bar plot and box plot with code above -#3 -Include statistical tests (p-value test on feature distribution differences) - -Get code from Evangeline - -From Evangeline: - "Si_Al: TheANOVAtestforSi_Alshowed a significant difference across campaigns (𝑝 = 0.0014), - indicating that the Si_Al composition varies meaningfully between campaigns. - • Fe_Mg: The Fe_Mg composition did not show significant variation across campaigns (𝑝 = - 0.0791), suggesting similar levels of Fe and Mg in the different campaigns. - • Ca_Na_K: For Ca_Na_K, a significant difference was found across campaigns (𝑝 = 0.0136), - indicating some compositional variance based on campaign location." - -Do tests for each feature in sherloc/ pixl - -#4 -Include ternairy plot like below: - -![pixlTernairyPlot](../../StudentNotebooks/CampaignQuestionFolder/ternairyplot.JPG) - -Get code from Aadi, can copy and paste from his notebook #cite Nicholas - +Code for ternairy plot ```{r} pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], @@ -671,10 +369,3 @@ ggtern(libs_ternplot2, ggtern::aes(x=x, y=y, z=z,cluster=cluster)) + size=3)) ``` -#5 -Briefly address whether or not data clusters coincide with campaign - -The answer is no, both me and Dana checked and there is only a weak correlation - -#6 -Conclude results - diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.Rmd b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.Rmd new file mode 100755 index 0000000..dc126b4 --- /dev/null +++ b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.Rmd @@ -0,0 +1,523 @@ +--- +title: "Data Analytics Research Individual Final Project Report Mars" +author: "Ashton Compton" +date: "Fall 2024" +output: + pdf_document: + toc: yes + toc_depth: '3' + html_notebook: default + html_document: + toc: yes + toc_depth: 3 + toc_float: yes + number_sections: yes + theme: united +--- + + + + + + +# DAR Project and Group Members + +* Project name: Mars +* Project team members: + - **Ashton Compton** + - Aadi Lahiri + - CJ Marino + - Nicolas Morawski + - Dante Mwatibo + - Charlotte Peterson + - Doña Roberts + - Margo VanEsselstyn + - David Walczyk + +# Instructions (DELETE BEFORE SUBMISSION) + +* The first goal of this notebook is to document your _major findings_ to convey them to your client (Dr. Rogers, Dr. Senveratne, or Mr. Neehal) and to preserve them for future use. + +* The second goal of this notebook is to document your _major findings_ with full scientific reproducibility. _Ideally someone should be able to go back years later and understand exactly what you did and reproduce your results._ + +* You can use the appendix to include additional results to improve the readability (for example extra plots) of your notebook or to show your work even if not really a major finding. + +* This is a scientific report written in complete sentences (i.e. not bullets) using good rules of grammar. It should be readable as a paper even if all the code is not shown, and if only the results of running your code are shown. + +* You should have sufficient details for scientific reproducibility including documentation of the code. You will need to describe the analysis methods that can be used together with the code to reproduce your work. This is especially important if you use several R files. + +* The rubric for grading is here [Rubric](https://docs.google.com/spreadsheets/d/e/2PACX-1vSeo5QZbboWwKnEZodmPQLnhr3hf5FrlzAqy4LydnOAsCw6V-YLWnAU8BzkLdmb9TP0zCpufAzI20XJ/pubhtml) + +* A suggested report structure is given below, but you can customize this to meet the needs of your project. For your draft notebook, you will design the stucture of your notebook and outline the contents. + +* Every student's final project notebook should be written individually or, in rare cases, as a small group. In many cases, you will discussing joint work located in other notebooks/locations. Talk with professor if you want to do joint notebook. + +* As noted above, your final notebook serves as a written presentation of your work this semester so it must be written like a written document. You should include code but feel free to use use proper R Markdown code chunk syntax to hide code chunks that don't need to be shown. You must describe what you are doing and the results outside of the code chunks. **You report should be readable and understandable by the readers without reading any code.** + +* The R code that executes the results should be embedded in this notebook if possible. + + It's also okay to "source" external scripts from within your notebook. + + You can also describe functionality code and results that are in other locations (like apps). + + PLEASE make sure all source code is in appropriate repository. +* Fall 2024 students may have work that is not appropriate to be embedded on your final notebook + + You should describe the work in the notebook and provide figures generated elsewhere (e.g. screen shots, graphs). + + Indicate if that work has been committed to github. If necessary put details in Appendix including the names of the committed files. +* Your writing style should be suitable for sharing with external partners/mentors and useful to future contributors. Do not assume that your reader is familiar with the technical details of your implementation and code. Again, write as if this is a research paper. +* Focus on results; please don't summarize everything you did this semester! + + Discuss only the *most important* aspects of your work. + + Ask yourself *what really matters?* +* **IMPORTANT:** Discuss any insights you found regarding your research. +* If there are limitations to your work, discuss, in detail. +* Include any **background** or **supporting evidence** for your work. + + For example, mention any relevant research articles you found -- and be sure to include references! + +## Things to check before you submit (DELETE BEFORE SUBMITTING) ## +* Have you done all the required components of the notebook in the format required? + +* Is your document readable as a research paper even if all the code is suppressed? + + Try suppressing all the code using hint below and see if this is true. +* Did you proofread your document? Does it use complete sentences and good grammar? +* Is every figure/table clearly labeled and titled? +* Does every figure serve a purpose? + + Does the figure/table have a useful title? **Hint:** What _question_ does the figure answer? + + You can put extra (non-essential) figures/tables in your **Appendix**. + + Is the figured/tables captioned? + + Are the figure/tables and its associated findings discussed in the text? + + Is it clear which figure/tables is being discussed? **Hint:** use captions! +* **CRITICAL:** Have you given enough information for someone to reproduce, understand and extend your results? + + Where can they *find* the data and code that you used? + + Have you *described* the data that used? + + Have you *documented* your code? + + Have you stated where code is located? + + Are your figures/tables *clearly labeled*? + + Did you *discuss each figure and your findings*? + + Did you use good grammar and *proofread* your results? + + Finally, have you *committed* your work to github and made a *pull request*? + +* Summarize ALL of your work that is worthy of being preserved in this notebook; Feel free to include work in the appendix at end. It will not be judged as being part of the research document but rather as additional information to be preserved. **if you don't show and/or link to your work here, it doesn't exist for us!** + + +* You **MUST** include figures and/or tables to illustrate your work. *Screen shots or pngs are okay for work generated outside the notebook*. + +* . You **MUST** include links to other important resources (knitted HTMl files, Shiny apps). See the guide below for help. + +5. Commit the source (`.Rmd`), pdf (`.pdf`) and knitted (`.html`) versions of your notebook and push to github. Turn in the pdf version to lms. + + +See LMS for guidance on how the contents of this notebook will be graded. + +**DELETE THE SECTIONS ABOVE!** + + +# 0.0 Preliminaries. + +*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:* + +This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report. Code blocks can be surpressed from output for readability using the command code `{R, echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show. + +```{r} +# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks +show <- TRUE +``` + + +Executing this R notebook requires some subset of the following packages: + +* `pandoc` +* `rmarkdown` +* `tidyverse` +* `stringr` +* `ggbiplot` +* `pheatmap` +* `knitr` +* `paletteer` +* `plotly` +* `GGally` + +These will be installed and loaded as necessary (code suppressed). + + +```{r, include=FALSE} +# This code will install required packages if they are not already installed +# ALWAYS INSTALL YOUR PACKAGES LIKE THIS! +if (!require("pandoc")) { + install.packages("pandoc") + library(pandoc) +} + +# Required packages for M20 LIBS analysis +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("ggbiplot")) { + install.packages("ggbiplot") + library(ggbiplot) +} + +if (!require("pheatmap")) { + install.packages("pheatmap") + library(pheatmap) +} + +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} + +if (!require("paletteer")) { + install.packages("paletteer") + library(paletteer) +} + +if (!require("plotly")) { + install.packages("plotly") + library(plotly) +} + +if (!require("GGally")) { + install.packages("GGally") + library(GGally) +} +``` + +# 1.0 Project Introduction + +_Describe your project and your approaches at a high level. Give enough information that a researcher examing your notebook can understand what this notebook is about. _ + +Our team had access to data from the first 16 samples from the Mars Perserverance rover. Each sample was assigned a campaign, either Crater Floor or Delta Front. I took on the task of finding differences between the two campaigns in the data. Selection in data combined with data visualization with graphs produced good results for finding significant differences between the campaigns. + +```{r } +# Code + +``` + +# 2.0 Organization of Report + +_Give report organization including list of major findings. Sample is provided. Please be sure to edit appropriately and remove this statement._ + +This report is organize as follows: + + +* Section 3.0. Finding 1: Provide short name and give brief description. We performed a comparison of ying versus yang items using three different approaches: blah1, blah2, and blah3. + + * Section 4.0: Finding 2: Short name and brief desciption. + +Repeat as necessary + +* Section (X).0 Finding X-2: Short name and brief description. + +* Section (X+1).0 Overall conclusions and suggestions + +* Section (X+2).0 Appendix This section describe the following additional works that may be helpful in the future work: *list subjects*. + + +# 3.0 Finding 1: Sedimentary versus Igneous + +_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_ + +Originally, data from Nasa implied that samples from Crater Floor were all igneous and samples from Delta Front were all sedimentary, however this was found to be false. Within Delta Front there are signs of igneous rock mixed with sedimentary rock. This is evident from mineral distributions within Delta Front samples. Likewise, sedimentary rock is present in Crater Floor. + +## 3.1 Data, Code, and Resources + +Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located. + +_Some examples you can replace. Note all these links must be clickable and live when document submitted. So make sure to do your commits and pull requests._ + +1. compta-assignment05_f24.Rmd Contains all the code I used for analysis +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd) + +2. compta-assignment05_f24.html +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.html](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.html) + +3. compta-assignment05_f24.pdf +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.pdf](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.pdf) + +4. mineral_data_static.Rds contains Lithology Data +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/mineral_data_static.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/mineral_data_static.Rds) + +5. samples_pixl_wide.Rds contains Pixl Data +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/samples_pixl_wide.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/samples_pixl_wide.Rds) + +6. abrasions_sherloc_samples.Rds contains Sherloc Data +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/abrasions_sherloc_samples.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/abrasions_sherloc_samples.Rds) + + + +*Describe the dataset and prepartion and/or preprocessing techniques ("data munging") you use. Put code here if not external. + + + +```{r } +# Code to read in data if appropriate. +#Load in data +### +# Load the saved lithology data with locations added +lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds") + +# Cast samples as numbers +lithology.df$sample <- as.numeric(lithology.df$sample) + +# Convert rest into factors +lithology.df[sapply(lithology.df, is.character)] <- + lapply(lithology.df[sapply(lithology.df, is.character)], + as.factor) + +# Keep only first 16 samples because the data for the rest of the samples is not available yet +#Also i'm getting rid of the atmospheric sample for now +lithology.df<-lithology.df[2:16,] +# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample. +lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1 + +### +# Load the saved PIXL data with locations added +pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") + +# Convert to factors +pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], + as.factor) + +#Get rid of atmospheric sample +pixl.df <- pixl.df[2:16,] + +# Make the matrix of just mineral percentage measurements +pixl.matrix <- pixl.df[,2:14] + +### +# Load the saved LIBS data with locations added +libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds") + +#Drop features that are not to be used in the analysis for this notebook +libs.df <- libs.df %>% + select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev, + MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total))) + +# Convert the points to numeric +libs.df$point <- as.numeric(libs.df$point) + +# Make the a matrix contain only the libs measurements for each mineral +libs.matrix <- as.matrix(libs.df[,6:13]) + +### +# Read in data as provided. +sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds") + +# Clean up data types +sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral) +sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)], + as.numeric) +# Transform NA's to 0 +sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0) + +# Reformat data so that rows are "abrasions" and columns list the presence of minerals. +# Do this by "pivoting" to a long format, and then back to the desired wide format. + +sherloc_long <- sherloc_abrasion_raw %>% + pivot_longer(!Mineral, names_to = "Name", values_to = "Presence") + +# Make abrasion a factor +sherloc_long$Name <- as.factor(sherloc_long$Name) + +# Make it a matrix +sherloc.matrix <- sherloc_long %>% + pivot_wider(names_from = Mineral, values_from = Presence) + +#Remove atmospheric sample +sherloc.matrix <- sherloc.matrix[2:16,] + +# Get sample information from PIXL and add to measurements -- assumes order is the same + +sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix) + +# Measurements are everything except first column +sherloc.matrix<-as.matrix(sherloc.matrix[,-1]) +``` + + +## 3.2 Contribution + +_State if this section is sole work or joint work. If joint work describe who you worked with and your contribution. You can also indicate any work by others that you reused._ + +Sole work + +I copy and pasted most of the above code from one of the first assignments in this course. + + + + +## 3.3 Methods Description + + +_Describe the data analytics methods you used and why you chose them. +Discuss your data analytics "pipeline" from *data preparation* and *experimental design*, to *methods*, to *results*. Were you able to use pre-existing implementations? If the techniques required user-specified parameters, how did you choose what parameter values to use?_ + +I created dataframes for Lithology, Pixl, Sherloc from the data. I removed the atmospheric sample because it's very flawed. For Lithology, I created a bar graph showing the distribution of each feature, with the samples separated by their campaign. + +## 3.4 Result and Discussion + + + +_For each result, state the method used. Run the code to perform it here (or state how it was run if run elsewhere) +Provide relvant visual illustrations of findings such as tables and graphs. +Then discuss the result. Repeat as necessary. Remember that readers will only read text and results and not code._ + +It can be determined that sedimentary and igneous minerals appear across both campaigns. The below graph shows the difference between mineral distributions across campaigns. + +```{r } +# Code +#Group by campaign & remove metadata +lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion)) + +#Turn into long form and only keep positive cases +lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1) + +#Count # of identical cases +lithology.df.sorted <- lithology.df.sorted %>% count(Feature) + +#Sort, Crater Floor is High to low & Delta Front is added back in low to high +lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n)) + +#Make graph +p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) + + geom_col(position=position_dodge(preserve="total"), width=0.6) + + theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) + + labs(x="", y="Count") + + ggtitle("Lithology Features Count by Campaign") + + scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus") + +ggplotly(p, tooltip = c("campaign",'x', "n")) +``` +This plot counts all the occurances of each feature in Lithology, with the samples grouped by campaign. Light blue is from Delta Front samples, and darker blue is samples from Crater Floor. +**Make sure all figures/tables are clearly labelled; always use meaningful titles (please) and provide captions! Provide legends as necessary.** + + +## 3.5 Conclusions, Limitations, and Future Work. + +**Discuss the significance of your finding. Discuss any limitations that should be addressed in the future. Give suggestions for future work.** + +We can say that minerals in Delta Front are predominantly sedimentary rock and minerals in Crater Floor are predominantly igneous, studying the graph above. This supports the theory that the delta fan was formed by moving water, since on Earth delta fans are formed by moving bodies of water depositing sediments on an ocean floor. +My knowledge of geology and especially planetary geology are undeveloped, so I can't interpret the findings of my graph nearly as well as a geologist could. +It follows that a geologist could possibly find more meaning out of my figures than I could. + + +# 4.0 Finding 1: Pixl should not be log scaled + +_These sections can be duplicated for each finding as needed._ + +## 4.1 Data, Code, and Resources + +1. compta-assignment05_f24.Rmd Contains all the code I used for analysis +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd) + +2. compta-assignment05_f24.html +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.html](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.html) + +3. compta-assignment05_f24.pdf +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.pdf](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.pdf) + +4. samples_pixl_wide.Rds contains Pixl Data +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/samples_pixl_wide.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/samples_pixl_wide.Rds) + +## 4.2 Contribution + +Solo + +## 4.3 Methods Description + +The pixl dataframe was scaled by the log10 function. This was done in an attempt to produce a better dataframe for machine learning work for Pixl. Clustering and PCA were performed on the new dataframe and the original Pixl dataframe for comparison. + +## 4.4 Result and Discussion + +No more work was done in trying to scale pixl, but ultimately that doesn't prevent machine learning work from being done on pixl, the results just might be less profitable than with other types of data distributions. + +## 4.5 Conclusions and Future Work. + +It was found the scaled frame did not produce better results for pixl than the original dataframe for pixl. Thus it was concluded log scaling pixl doesn't yield better results. +Another attempt at scaling pixl was done by Aadi, trying to use earth reference data to scale pixl. +At the moment, no other ideas for scaling pixl are worth exploring. + +# Bibliography +Provide a listing of references and other sources. + +* Citations from literature. Give each reference a unique name combining first author last name, year, and additional letter if required. e.g.[Bennett22a]. If there is no known author, make something reasonable up. +* Significant R packages used + + + + + +# Appendix + +*Include here whatever you think is relevant to support the main content of your notebook. For example, you may have only include example figures above in your main text but include additional ones here. Or you may have done a more extensive investigation, and want to put more results here to document your work in the semester. Be sure to divide appendix into appropriate sections and make the contents clear to the reader using approaches discussed above. * + +Below is a box plot for pixl, showing the feature distributions, with samples separated by campaign. +```{R} +#Make box plots +pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion)) %>% pivot_longer(1:13) +colnames(pixl.lf)<- c("campaign", "feature", "value") +ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) + + geom_boxplot() + + scale_y_log10() + + ggtitle("pixl distribution by campaign") + + labs(x="", y="log10 scale from percent composition") +``` + +Below is code for log scaling pixl +```{r, result03_data} +#Add in wss plot for elbow method clustering +wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") { + wss <- data.frame(cluster=1:nc, quality=c(0)) + for (i in 1:nc){ + set.seed(seed) + wss[i,2] <- kmeans(data, centers=i)$tot.withinss} + ggplot(data=wss,aes(x=cluster,y=quality)) + + geom_line() + + ggtitle(title) +} +# Include all data processing code (if necessary), clearly commented +#First replace 0.0 entries with 0.00001 so they don't scale to inf +pixl.matrix[pixl.matrix == 0] <- 0.00001 +#Apply log10 to every entry in pixl.matrix & get new scaled df +pixl.scaled <- log10(pixl.matrix) + +#Create an elbow plot for both pixl.matrix & pixl.scaled +wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled') +wssplot(pixl.scaled, nc=8, seed=14, "Scaled") + +#Do kmeans for both matrices +unscaled.kmeans <- kmeans(pixl.matrix, 3) +scaled.kmeans <- kmeans(pixl.scaled, 3) + +#Produce heatmaps for both +pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl") +pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl") + +#Do pca for both matrices +unscaled.pca <- prcomp(pixl.matrix) +scaled.pca <- prcomp(pixl.scaled) + +#Make biplots +unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca, + labels = pixl.df$type, + groups = as.factor(unscaled.kmeans$cluster)) + + ggtitle("Unscaled Pixl") + +scaled.plot <- ggbiplot::ggbiplot(scaled.pca, + labels = pixl.df$type, + groups = as.factor(scaled.kmeans$cluster)) + + ggtitle("Scaled Pixl") + +#ggplotly(unscaled.plot) +#ggplotly(scaled.plot) + +pheatmap(pixl.scaled, scale="none") +``` +``` \ No newline at end of file diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.html b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.html new file mode 100644 index 0000000..a8d549e --- /dev/null +++ b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.html @@ -0,0 +1,4082 @@ + + + + + + + + + + + + + + +Data Analytics Research Individual Final Project Report Mars + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + + + +
+

1 DAR Project and Group +Members

+
    +
  • Project name: Mars
  • +
  • Project team members: +
      +
    • Ashton Compton
    • +
    • Aadi Lahiri
    • +
    • CJ Marino
    • +
    • Nicolas Morawski
    • +
    • Dante Mwatibo
    • +
    • Charlotte Peterson
    • +
    • Doña Roberts
    • +
    • Margo VanEsselstyn
    • +
    • David Walczyk
    • +
  • +
+
+
+

2 Instructions (DELETE +BEFORE SUBMISSION)

+
    +
  • The first goal of this notebook is to document your major +findings to convey them to your client (Dr. Rogers, Dr. Senveratne, +or Mr. Neehal) and to preserve them for future use.

  • +
  • The second goal of this notebook is to document your major +findings with full scientific reproducibility. Ideally someone +should be able to go back years later and understand exactly what you +did and reproduce your results.

  • +
  • You can use the appendix to include additional results to improve +the readability (for example extra plots) of your notebook or to show +your work even if not really a major finding.

  • +
  • This is a scientific report written in complete sentences +(i.e. not bullets) using good rules of grammar. It should be readable as +a paper even if all the code is not shown, and if only the results of +running your code are shown.

  • +
  • You should have sufficient details for scientific reproducibility +including documentation of the code. You will need to describe the +analysis methods that can be used together with the code to reproduce +your work. This is especially important if you use several R +files.

  • +
  • The rubric for grading is here Rubric

  • +
  • A suggested report structure is given below, but you can +customize this to meet the needs of your project. For your draft +notebook, you will design the stucture of your notebook and outline the +contents.

  • +
  • Every student’s final project notebook should be written +individually or, in rare cases, as a small group. In many cases, you +will discussing joint work located in other notebooks/locations. Talk +with professor if you want to do joint notebook.

  • +
  • As noted above, your final notebook serves as a written +presentation of your work this semester so it must be written like a +written document. You should include code but feel free to use use +proper R Markdown code chunk syntax to hide code chunks that don’t need +to be shown. You must describe what you are doing and the results +outside of the code chunks. You report should be readable and +understandable by the readers without reading any +code.

  • +
  • The R code that executes the results should be embedded in this +notebook if possible.

    +
      +
    • It’s also okay to “source” external scripts from within your +notebook.
    • +
    • You can also describe functionality code and results that are in +other locations (like apps).
      +
    • +
    • PLEASE make sure all source code is in appropriate repository.
    • +
  • +
  • Fall 2024 students may have work that is not appropriate to be +embedded on your final notebook

    +
      +
    • You should describe the work in the notebook and provide figures +generated elsewhere (e.g. screen shots, graphs).
    • +
    • Indicate if that work has been committed to github. If necessary put +details in Appendix including the names of the committed files.
      +
    • +
  • +
  • Your writing style should be suitable for sharing with external +partners/mentors and useful to future contributors. Do not assume that +your reader is familiar with the technical details of your +implementation and code. Again, write as if this is a research +paper.

  • +
  • Focus on results; please don’t summarize everything you did this +semester!

    +
      +
    • Discuss only the most important aspects of your work.
    • +
    • Ask yourself what really matters?
    • +
  • +
  • IMPORTANT: Discuss any insights you found +regarding your research.

  • +
  • If there are limitations to your work, discuss, in +detail.

  • +
  • Include any background or supporting +evidence for your work.

    +
      +
    • For example, mention any relevant research articles you found – and +be sure to include references!
    • +
  • +
+
+

2.1 Things to check +before you submit (DELETE BEFORE SUBMITTING)

+
    +
  • Have you done all the required components of the notebook in the +format required?

  • +
  • Is your document readable as a research paper even if all the +code is suppressed?

    +
      +
    • Try suppressing all the code using hint below and see if this is +true.
    • +
  • +
  • Did you proofread your document? Does it use complete sentences +and good grammar?

  • +
  • Is every figure/table clearly labeled and titled?

  • +
  • Does every figure serve a purpose?

    +
      +
    • Does the figure/table have a useful title? Hint: +What question does the figure answer?
    • +
    • You can put extra (non-essential) figures/tables in your +Appendix.
    • +
    • Is the figured/tables captioned?
    • +
    • Are the figure/tables and its associated findings discussed in the +text?
    • +
    • Is it clear which figure/tables is being discussed? +Hint: use captions!
    • +
  • +
  • CRITICAL: Have you given enough information for +someone to reproduce, understand and extend your results?

    +
      +
    • Where can they find the data and code that you used?
    • +
    • Have you described the data that used?
    • +
    • Have you documented your code?
    • +
    • Have you stated where code is located?
    • +
    • Are your figures/tables clearly labeled?
    • +
    • Did you discuss each figure and your findings?
    • +
    • Did you use good grammar and proofread your results?
    • +
    • Finally, have you committed your work to github and made a +pull request?
    • +
  • +
  • Summarize ALL of your work that is worthy of being preserved in +this notebook; Feel free to include work in the appendix at end. It will +not be judged as being part of the research document but rather as +additional information to be preserved. if you don’t show and/or +link to your work here, it doesn’t exist for us!

  • +
  • You MUST include figures and/or tables to +illustrate your work. Screen shots or pngs are okay for work +generated outside the notebook.

  • +
  • . You MUST include links to other important +resources (knitted HTMl files, Shiny apps). See the guide below for +help.

  • +
+
    +
  1. Commit the source (.Rmd), pdf (.pdf) and +knitted (.html) versions of your notebook and push to +github. Turn in the pdf version to lms.
  2. +
+

See LMS for guidance on how the contents of this notebook will be +graded.

+

DELETE THE SECTIONS ABOVE!

+
+
+
+

3 0.0 Preliminaries.

+

R Notebooks are meant to be dynamic documents. Provide any +relevant technical guidance for users of your notebook. Also take care +of any preliminaries, such as required packages. Sample text:

+

This report is generated from an R Markdown file that includes all +the R code necessary to produce the results described and embedded in +the report. Code blocks can be surpressed from output for readability +using the command code {R, echo=show} in the code block +header. If show <- FALSE the code block will be +surpressed; if show <- TRUE then the code will be +show.

+
# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks 
+show <- TRUE
+ +

Executing this R notebook requires some subset of the following +packages:

+
    +
  • pandoc
  • +
  • rmarkdown
  • +
  • tidyverse
  • +
  • stringr
  • +
  • ggbiplot
  • +
  • pheatmap
  • +
  • knitr
  • +
  • paletteer
  • +
  • plotly
  • +
  • GGally
  • +
+

These will be installed and loaded as necessary (code +suppressed).

+ +
+
+

4 1.0 Project +Introduction

+

Describe your project and your approaches at a high level. Give +enough information that a researcher examing your notebook can +understand what this notebook is about.

+

Our team had access to data from the first 16 samples from the Mars +Perserverance rover. Each sample was assigned a campaign, either Crater +Floor or Delta Front. I took on the task of finding differences between +the two campaigns in the data. Selection in data combined with data +visualization with graphs produced good results for finding significant +differences between the campaigns.

+
# Code 
+
+
+

5 2.0 Organization of +Report

+

Give report organization including list of major findings. Sample +is provided. Please be sure to edit appropriately and remove this +statement.

+

This report is organize as follows:

+
    +
  • Section 3.0. Finding 1: Provide short name and give brief +description. We performed a comparison of ying versus yang items using +three different approaches: blah1, blah2, and blah3.

  • +
  • Section 4.0: Finding 2: Short name and brief desciption.

  • +
+

Repeat as necessary

+
    +
  • Section (X).0 Finding X-2: Short name and brief +description.

  • +
  • Section (X+1).0 Overall conclusions and suggestions

  • +
  • Section (X+2).0 Appendix This section describe the following +additional works that may be helpful in the future work: list +subjects.

  • +
+
+
+

6 3.0 Finding 1: +Sedimentary versus Igneous

+

Give a highlevel overview of the major finding. What questions +were your trying to address, what approaches did you employ, and what +happened?

+

Originally, data from Nasa implied that samples from Crater Floor +were all igneous and samples from Delta Front were all sedimentary, +however this was found to be false. Within Delta Front there are signs +of igneous rock mixed with sedimentary rock. This is evident from +mineral distributions within Delta Front samples. Likewise, sedimentary +rock is present in Crater Floor.

+
+

6.1 3.1 Data, Code, and +Resources

+

Here is a list data sets, codes, that are used in your work. Along +with brief description and URL where they are located.

+

Some examples you can replace. Note all these links must be +clickable and live when document submitted. So make sure to do your +commits and pull requests.

+
    +
  1. compta-assignment05_f24.Rmd Contains all the code I used for +analysis https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.Rmd

  2. +
  3. compta-assignment05_f24.html https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.html

  4. +
  5. compta-assignment05_f24.pdf https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment05/compta-assignment05_f24.pdf

  6. +
  7. mineral_data_static.Rds contains Lithology Data https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/mineral_data_static.Rds

  8. +
  9. samples_pixl_wide.Rds contains Pixl Data https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/samples_pixl_wide.Rds

  10. +
  11. abrasions_sherloc_samples.Rds contains Sherloc Data https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/abrasions_sherloc_samples.Rds

  12. +
+

*Describe the dataset and prepartion and/or preprocessing techniques +(“data munging”) you use. Put code here if not external.

+
# Code to read in data if appropriate.
+#Load in data
+###
+# Load the saved lithology data with locations added
+lithology.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/mineral_data_static.Rds")
+
+# Cast samples as numbers
+lithology.df$sample <- as.numeric(lithology.df$sample)
+
+# Convert rest into factors
+lithology.df[sapply(lithology.df, is.character)] <-
+  lapply(lithology.df[sapply(lithology.df, is.character)], 
+                                       as.factor)
+
+# Keep only first 16 samples because the data for the rest of the samples is not available yet
+#Also i'm getting rid of the atmospheric sample for now
+lithology.df<-lithology.df[2:16,]
+# Create a matrix containing only the numeric measurements.  The remaining features are metadata about the sample. 
+lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1
+
+###
+# Load the saved PIXL data with locations added
+pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
+
+# Convert to factors
+pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], 
+                                       as.factor)
+
+#Get rid of atmospheric sample
+pixl.df <- pixl.df[2:16,]
+
+# Make the matrix of just mineral percentage measurements
+pixl.matrix <- pixl.df[,2:14]
+
+###
+# Load the saved LIBS data with locations added
+libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")
+
+#Drop  features that are not to be used in the analysis for this notebook
+libs.df <- libs.df %>% 
+  select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
+             MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))
+
+# Convert the points to numeric
+libs.df$point <- as.numeric(libs.df$point)
+
+# Make the a matrix contain only the libs measurements for each mineral
+libs.matrix <- as.matrix(libs.df[,6:13]) 
+
+###
+# Read in data as provided.  
+sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")
+
+# Clean up data types
+sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
+sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)], 
+                                       as.numeric)
+# Transform NA's to 0
+sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)
+
+# Reformat data so that rows are "abrasions" and columns list the presence of minerals. 
+# Do this by "pivoting" to a long format, and then back to the desired wide format.  
+
+sherloc_long <- sherloc_abrasion_raw %>%
+  pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")
+
+# Make abrasion a factor 
+sherloc_long$Name <- as.factor(sherloc_long$Name)
+
+# Make it a matrix
+sherloc.matrix <- sherloc_long %>%
+  pivot_wider(names_from = Mineral, values_from = Presence)
+
+#Remove atmospheric sample
+sherloc.matrix <- sherloc.matrix[2:16,]
+
+# Get sample information from PIXL and add to measurements -- assumes order is the same
+
+sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
+
+# Measurements are everything except first column
+sherloc.matrix<-as.matrix(sherloc.matrix[,-1])
+
+
+

6.2 3.2 Contribution

+

State if this section is sole work or joint work. If joint work +describe who you worked with and your contribution. You can also +indicate any work by others that you reused.

+

Sole work

+

I copy and pasted most of the above code from one of the first +assignments in this course.

+
+
+

6.3 3.3 Methods +Description

+

Describe the data analytics methods you used and why you chose +them. Discuss your data analytics “pipeline” from data +preparation and experimental design, to methods, +to results. Were you able to use pre-existing implementations? +If the techniques required user-specified parameters, how did you choose +what parameter values to use?

+

I created dataframes for Lithology, Pixl, Sherloc from the data. I +removed the atmospheric sample because it’s very flawed. For Lithology, +I created a bar graph showing the distribution of each feature, with the +samples separated by their campaign.

+
+
+

6.4 3.4 Result and +Discussion

+ +

For each result, state the method used. Run the code to perform +it here (or state how it was run if run elsewhere) Provide relvant +visual illustrations of findings such as tables and graphs. Then discuss +the result. Repeat as necessary. Remember that readers will only read +text and results and not code.

+

It can be determined that sedimentary and igneous minerals appear +across both campaigns. The below graph shows the difference between +mineral distributions across campaigns.

+
# Code 
+#Group by campaign & remove metadata
+lithology.df.sorted <- lithology.df %>% group_by(campaign) %>% select(-c(sample,name,SampleType,abrasion))
+
+#Turn into long form and only keep positive cases
+lithology.df.sorted <- lithology.df.sorted %>% pivot_longer(2:ncol(lithology.df.sorted),names_to = "Feature", values_to="Factor") %>% filter(Factor == 1)
+
+#Count # of identical cases
+lithology.df.sorted <- lithology.df.sorted %>% count(Feature)
+
+#Sort, Crater Floor is High to low & Delta Front is added back in low to high
+lithology.df.sorted <- lithology.df.sorted %>% filter(campaign == "Crater Floor") %>% arrange(desc(n)) %>% ungroup() %>% add_row(lithology.df.sorted %>% filter(campaign == "Delta Front") %>% arrange(n))
+
+#Make graph
+p <- ggplot(lithology.df.sorted, aes(x=factor(Feature, levels = (Feature %>% unique())), y = n, fill = campaign)) + 
+  geom_col(position=position_dodge(preserve="total"), width=0.6) +
+  theme(panel.grid.major.x=element_blank(), axis.text.x = element_text(angle = 60, vjust = 1.0, hjust=1, size = 12)) +
+  labs(x="", y="Count") +
+  ggtitle("Lithology Features Count by Campaign") +
+  scale_fill_paletteer_d(palette = "fishualize::Cephalopholis_argus")
+
+ggplotly(p, tooltip = c("campaign",'x', "n"))
+
+ +

This plot counts all the occurances of each feature in Lithology, +with the samples grouped by campaign. Light blue is from Delta Front +samples, and darker blue is samples from Crater Floor. Make sure +all figures/tables are clearly labelled; always use meaningful titles +(please) and provide captions! Provide legends as +necessary.

+
+
+

6.5 3.5 Conclusions, +Limitations, and Future Work.

+

Discuss the significance of your finding. Discuss any +limitations that should be addressed in the future. Give suggestions for +future work.

+

We can say that minerals in Delta Front are predominantly sedimentary +rock and minerals in Crater Floor are predominantly igneous, studying +the graph above. This supports the theory that the delta fan was formed +by moving water, since on Earth delta fans are formed by moving bodies +of water depositing sediments on an ocean floor. My knowledge of geology +and especially planetary geology are undeveloped, so I can’t interpret +the findings of my graph nearly as well as a geologist could. It follows +that a geologist could possibly find more meaning out of my figures than +I could.

+
+
+
+

7 4.0 Finding 1: Pixl +should not be log scaled

+

These sections can be duplicated for each finding as +needed.

+ +
+

7.2 4.2 Contribution

+

Solo

+
+
+

7.3 4.3 Methods +Description

+

The pixl dataframe was scaled by the log10 function. This was done in +an attempt to produce a better dataframe for machine learning work for +Pixl. Clustering and PCA were performed on the new dataframe and the +original Pixl dataframe for comparison.

+
+
+

7.4 4.4 Result and +Discussion

+

No more work was done in trying to scale pixl, but ultimately that +doesn’t prevent machine learning work from being done on pixl, the +results just might be less profitable than with other types of data +distributions.

+
+
+

7.5 4.5 Conclusions and +Future Work.

+

It was found the scaled frame did not produce better results for pixl +than the original dataframe for pixl. Thus it was concluded log scaling +pixl doesn’t yield better results. Another attempt at scaling pixl was +done by Aadi, trying to use earth reference data to scale pixl. At the +moment, no other ideas for scaling pixl are worth exploring.

+
+
+
+

8 Bibliography

+

Provide a listing of references and other sources.

+
    +
  • Citations from literature. Give each reference a unique name +combining first author last name, year, and additional letter if +required. e.g.[Bennett22a]. If there is no known author, make something +reasonable up.
  • +
  • Significant R packages used
  • +
+
+
+

9 Appendix

+

Include here whatever you think is relevant to support the main +content of your notebook. For example, you may have only include example +figures above in your main text but include additional ones here. Or you +may have done a more extensive investigation, and want to put more +results here to document your work in the semester. Be sure to divide +appendix into appropriate sections and make the contents clear to the +reader using approaches discussed above.

+

Below is a box plot for pixl, showing the feature distributions, with +samples separated by campaign.

+
#Make box plots
+pixl.lf <- pixl.df %>% select(-c(sample, name, type, location, abrasion)) %>% pivot_longer(1:13)
+colnames(pixl.lf)<- c("campaign", "feature", "value")
+ggplot(data = pixl.lf, aes(x=feature, y=value, color = campaign)) +
+  geom_boxplot() +
+  scale_y_log10() +
+  ggtitle("pixl distribution by campaign") +
+  labs(x="", y="log10 scale from percent composition")
+
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
+
## Warning: Removed 5 rows containing non-finite outside the scale range
+## (`stat_boxplot()`).
+

+

Below is code for log scaling pixl

+
#Add in wss plot for elbow method clustering
+wssplot <- function(data, nc = 15, seed =10, title="Quality of k-means by Cluster") {
+  wss <- data.frame(cluster=1:nc, quality=c(0))
+  for (i in 1:nc){
+    set.seed(seed)
+    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+  ggplot(data=wss,aes(x=cluster,y=quality)) + 
+    geom_line() + 
+    ggtitle(title)
+}
+# Include all data processing code (if necessary), clearly commented
+#First replace 0.0 entries with 0.00001 so they don't scale to inf
+pixl.matrix[pixl.matrix == 0] <- 0.00001
+#Apply log10 to every entry in pixl.matrix & get new scaled df
+pixl.scaled <- log10(pixl.matrix)
+
+#Create an elbow plot for both pixl.matrix & pixl.scaled
+wssplot(pixl.matrix, nc=8, seed=14, 'Unscaled')
+

+
wssplot(pixl.scaled, nc=8, seed=14, "Scaled")
+

+
#Do kmeans for both matrices
+unscaled.kmeans <- kmeans(pixl.matrix, 3)
+scaled.kmeans <- kmeans(pixl.scaled, 3)
+
+#Produce heatmaps for both
+pheatmap(unscaled.kmeans$centers, scale="none", main="Unscaled Pixl")
+

+
pheatmap(scaled.kmeans$centers, scale="none", main="Scaled Pixl")
+

+
#Do pca for both matrices
+unscaled.pca <- prcomp(pixl.matrix)
+scaled.pca <- prcomp(pixl.scaled)
+
+#Make biplots
+unscaled.plot <- ggbiplot::ggbiplot(unscaled.pca,
+                   labels = pixl.df$type,
+                   groups = as.factor(unscaled.kmeans$cluster)) +
+                   ggtitle("Unscaled Pixl")
+
+scaled.plot <- ggbiplot::ggbiplot(scaled.pca,
+                   labels = pixl.df$type,
+                   groups = as.factor(scaled.kmeans$cluster)) +
+                   ggtitle("Scaled Pixl")
+
+#ggplotly(unscaled.plot)
+#ggplotly(scaled.plot)
+
+pheatmap(pixl.scaled, scale="none")
+

+```

+
+ + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.pdf b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.pdf new file mode 100644 index 0000000..d947cfe Binary files /dev/null and b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/compta_finalDraft.pdf differ