peterc_finalProjectF24_roughdraft.Rmd

---
title: "Data Analytics Research Individual Final Project Report - Mars"
author: "Charlotte Peterson"
date: "Fall 2024"
output:
  html_document:
    toc: yes
    toc_depth: 3
    toc_float: yes
    number_sections: yes
    theme: united
  html_notebook: default
  pdf_document:
    toc: yes
    toc_depth: '3'
---


# DAR Project and Group Members

* Project name: Mars
* GitHub ID: dar-peterc
* Project team members: Dante Mwatibo, Doña Roberts, David Walcyzk, Xuanting Wang, Ashton Compton, Margo VanEsselstyn, Nicolas Morawski, CJ Marino, Aadi Lahiri

# 0.0 Preliminaries.

*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:*

This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report.  Code blocks can be surpressed from output for readability using the command code `{R,  echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show.

```{r}
# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks
show <- TRUE
```

<!-- Expand this list as necessary for your notebook -->
Executing this R notebook requires some subset of the following packages:

* `ggplot2`
* `tidyverse`
* `pandoc`
* `rmarkdown`
* `stringr`
* `ggbiplot`
* `knitr`
* `rpart`
* `rpart.plot`
* `caret`
* `ggrepel`
* `ggtern`


These will be installed and loaded as necessary (code suppressed).

<!-- The `include=FALSE` option prevents your code from being shown at all -->
```{r, include=FALSE}
# This code will install required packages if they are not already installed
# ALWAYS INSTALL YOUR PACKAGES LIKE THIS!
if (!require("ggplot2")) {
   install.packages("ggplot2")
   library(ggplot2)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}

if (!require("pandoc")) {
  install.packages("pandoc")
  library(pandoc)
}

# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
  install.packages("rmarkdown")
  library(rmarkdown)
}

if (!require("stringr")) {
  install.packages("stringr")
  library(stringr)
}

if (!require("ggbiplot")) {
  install.packages("ggbiplot")
  library(ggbiplot)
}

if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}

if (!require("rpart")) {
  install.packages("rpart")
  library(rpart)
}

if (!require("rpart.plot")) {
  install.packages("rpart.plot")
  library(rpart)
}

if (!require("caret")) {
  install.packages("caret")
  library(caret)
}

if (!require("ggrepel")) {
  install.packages("ggrepel")
  library(ggrepel)
}

if (!require("geosphere")) {
  install.packages("geosphere")
  library(ggrepel)
}

if (!require("ggtern")) {
  install.packages("ggtern")
  library(ggrepel)
}

```

# 1.0 Project Introduction

The Mars Project is focused on data from the 2020 Mars Perseverance Rover. The goal of the mission is to look for microbial ancient life or forms of water on Mars (things that could suggest life). Perseverance uses multiple instruments, including PIXL (Planetary Instrument for X-Ray Lithochemistry), SHERLOC (Scanning Habitable Environments with Raman and Luminescence for Organics and Chemicals) and SUPERCAM. SUPERCAM has multiple instruments that measure spectroscopy to measure properties of materials on Mars, including LIBS (Laser-induced breakdown spectroscopy). This notebook will primarily focus on the data we have been given of PIXL and LIBS.

# 2.0 Organization of Report

This report is organize as follows:

* Section 3.0.  Finding 1: LIBS and PIXL Matching - We were able to combine the LIBS and PIXL data sets by picking a maximum distance variable from a PIXL abrasion and matching LIBS samples that were within the set distance of a PIXL abrasion.

* Section 4.0: Finding 2: Soil Composition Analysis - Using the LIBS and PIXL combined data set, I created a plot of the composition percentages of chemical compounds such as Si02, K20, etc. using log scaling to compare the compositions of a PIXL abrasion and the corresponding LIBS sample compositions (based on the LIBS samples for x distance away from a PIXL abrasion).

* Section 5.0 Finding 3: Analyzing Cation Combinations using LIBS and PIXL matched data: Using the LIBS and PIXL combined data set, we created a ternary plot to show the distribution of LIBS samples sorted by what PIXL abrasion they are closest to (based on a chosen distance variable).

* Section 6.0 Overall conclusions and suggestions

* Section 7.0 Appendix This section describe the following additional works that may be helpful in the future work: *list subjects*.


# 3.0 Finding 1: PIXL and LIBS Matching

_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_

Firstly, we will be taking a look at how PIXL and LIBS correspond. Our group found very early in our research that there wasn't a feature among them that can be used to match the data sets. For example, the columns of PIXL are organized by latitude and longitude as well as sample number (1-16), sample name, and abrasion name. Unfortunately, LIBS wasn't sorted the same way. LIBS was organized by the sol that the sample was taken at. LIBS is broken up into many different types of samples as well, including the fact it carries around earth reference data to be used in comparing with different sample sites. That being said, in order to match PIXL targets to corresponding LIBS samples, Margo and I created a new data set that added another metadata feature to PIXL (latitude and longitude coordinates) which we obtained from the Analyst's Notebook. Once this was added in, we realized that the longitude and latitude didn't really match. So Margo created a distance function to match LIBS samples to PIXL targets based on whatever distance a person specifies. Originally, we set it to be rounded to three thousandths and match based on that.

This helped answer the question of how can we correlate the LIBS and PIXL data sets to be able to plot them on the same axis of whatever plot is trying to be created. I was curious to see how close PIXL targets were to LIBS sample sites as well as how many LIBS samples would be associated with a PIXL target perhaps with a radius of 7 or 10 meters.

## 3.1 Data, Code, and Resources

Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located.

1. peterc-finalProjectF24.Rmd (with knit pdf and html) is this notebook.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd)

2. v1_libs_to_sample.Rds is the combined data set of PIXL and LIBS that includes the distance from a PIXL abrasion to a LIBS sample.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/v1_libs_to_sample.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/v1_libs_to_sample.Rds).

Firstly, we set the number of meters distance threshold between a PIXL abrasion and LIBS sample. Within the v1_libs_to_sample.Rds, which Margo and I collaborated on there is a distance variable that is set via a function that Margo created to measure the distance between a PIXL abrasion and LIBS sample using their latitude and longitude coordinates.
```{r}
meters <- 100
```

To prepare the data, we will load in the v1_libs_to_sample.Rds, group by latitude and longitude of LIBS, and filter out every LIBS sample that has a larger distance from its corresponding PIXL abrasion than specified in the chosen distance (meter) value. In order to make a scatter plot of the LIBS and PIXL points, we will create a new data frame of each unique PIXL abrasion and its coordinates. That is the unique_pixl data frame which will be used to plot the PIXL abrasion coordinates.

```{r }
libs_to_sample <- readRDS("~/DAR-Mars-F24/StudentData/v1_libs_to_sample.Rds")
#make a filtered data frame that picks the max point out of all libs samples at a certain target
# for simplicity
df_filtered <- libs_to_sample %>%
  group_by(Lat.libs, Lon.libs) %>%
  filter(Point.libs == max(Point.libs)) %>%
  ungroup()
df_distance_filter <- df_filtered[df_filtered$Distance <= meters,]

#make a data frame with the unique pixl coordinates since they are in pairs of identical lat/lon
unique_pixl <- df_filtered %>%
  select(Lat.pixl, Lon.pixl, Abrasion.pixl) %>% distinct()
```


## 3.2 Contribution

The logistics of filtering the original data set is my work. Previously, I had to do a lot more filtering in order to choose the distance and get unique LIBS points in order to not put too many points on the scatterplot. Margo and I worked together to create the data set that I use in this section (v1_libs_to_sample.Rds) by deciding how to match up LIBS to certain PIXL abrasions. Margo created the distance function to find the distance between PIXL abrasions and LIBS samples and added that column to the data set. Then, Dona fixed all the naming conventions in the data set in order to have consistency and make it easy to tell which variable was originally from each data set (ex. Name.pixl, Target.libs). I then used the data to create plots and analyze. Below is a scatterplot showcasing the distribution of PIXL abrasions and corresponding LIBS samples based on the specified max distance between them.


## 3.3 Methods Description

I chose to use ggplot to display the LIBS and PIXL data for easier analysis of seeing how many LIBS samples align with different PIXL abrasions. It was very interesting to change around the max distance and see which aligned with which abrasion. In terms of execution, it took me a bit of time to organize all of the thoughts Margo and I had on how to create and manage this data set. Originally, we had rounded the distances to the nearest thousandth to match them, and then were plotting that way. However, that left a lot of room for error and wasn't as accurate. Creating a distance function allows for the scientist or person using the Mars Mission Minder App to choose whatever distance they would like and allows for much more functionality. Modifying the data set more ended up being more efficient than adding small edits as I was making my plots which was originally making me crazy (as in changing variable types if they weren't what they were supposed to be). In the end, I learned a lot about data organization and that consistency and staying organized is key and saves a lot of time later on.


## 3.4 Result and Discussion

To create a plot of the LIBS and PIXL data organized by what LIBS samples align with what abrasions, first plotted the LIBS samples colored by what PIXL abrasion they were closest to, and then plotted th PIXL abrasions as red stars on the plot to show where the PIXL abrasions were relative to the LIBS samples.

```{r }
#plot of libs and pixl data by lat/lon
ggplot(data = df_distance_filter) +
  geom_point(mapping = aes(x = Lon.libs, y = Lat.libs, color = Abrasion.pixl)) +                 # Color by abrasion
  geom_point(mapping = aes(x = Lon.pixl, y = Lat.pixl), data = unique_pixl, color = "red", shape = 3, size = 3) + # Fixed color for unique_pixl points
  geom_text_repel(mapping = aes(x = Lon.pixl, y = Lat.pixl, label = Abrasion.pixl), data = unique_pixl,
                  vjust = 2, color = "red") +
  labs(title = paste("LIBS Samples and PIXL Abrasions within", meters, "meters"),
       x = "Longitude",
       y = "Latitude",
       color = "PIXL Abrasion",
       caption = "Data collected using LIBS and PIXL instruments on Perserverance rover.\n Shows PIXL abrasions plotted as red stars,\n and the corresponding LIBS samples colored by their closest PIXL abrasion.")+          # Label for the color legend
 # Center the caption on the left side
  theme(
    plot.caption = element_text(hjust = 0)  # Aligns caption to the left
  )

#add legend for PIXL idk why it's not working
```


## 3.5 Conclusions, Limitations, and Future Work.

I believe my findings make it very easy for researchers and scientists to have a visualization of PIXL and LIBS samples that they want to see based on what max distance they are focusing on when examining PIXL and LIBS together. For future work, I think as more coordinates and data is added to the LIBS and PIXL data sets as they become available from NASA this will continue to be built upon and although it isn't super complicated of a plot, it provides a very necessary context to visualize PIXL and LIBS.

- Add more about limitations?

# 4.0 Finding 2: Soil Composition Analysis
_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_

Using the LIBS and PIXL combined data set, I created a plot of the composition percentages of chemical compounds such as Si02, K20, etc. using log scaling to compare the compositions of a PIXL abrasion and the corresponding LIBS sample compositions (based on the LIBS samples for x distance away from a PIXL abrasion). The question I was trying to answer was how does the LIBS data of a certain area compare to the PIXL data of that area? By looking at the composition of the soil in certain locations, we can compare the differences in the PIXL abrasion and relating LIBS samples for a certain area utilizing the same data set (v1_libs_to_sample.Rmd). In order to accomplish this,


## 4.1 Data, Code, and Resources
Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located.

1. peterc_finalProjectF24.Rmd (with knit pdf and html) is this notebook.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd)

2. peterc_assignment5.Rmd (with knit pdf and html) which is my previous notebook.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/tree/main/StudentNotebooks/Assignment05/peterc_assignment05.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/tree/main/StudentNotebooks/Assignment05/peterc_assignment05.Rmd])

3. supercam_libs_moc_loc.Rds which is the original LIBS data given to our research group.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/supercam_libs_moc_loc.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/supercam_libs_moc_loc.Rds)

4. pixl_sol_coordinates.Rds, which is the data set containing the PIXL data, sol, and coordinates.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/pixl_sol_coordinates.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/pixl_sol_coordinates.Rds)

4. LIBS_training_set_quartiles.Rds is the data with earth quartile reference data.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/LIBS_training_set_quartiles.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/LIBS_training_set_quartiles.Rds).

To prepare the data, I start by loading in the LIBS data. Then, we drop the standard deviation columns and sum of percentage columns leaving us with just the weighted composition in terms of numerical data. We also remove the scct values, as those values are the ones that are earth reference samples that Perserverance carries with it. Therefore, they will not be very relevant when plotting the LIBS data as we are focused on the Mars soil compositions.

```{r}
#Earth quartiles
earthquartiles.df<-readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/LIBS_training_set_quartiles.Rds")
#Load in LIBS data
libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")
#Drop the standard deviation features, the sum of the percentages,
#the distance, and the total frequencies
libs.df <- libs.df %>%
  select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
             MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))
# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)
libs.df[,6:13] <- sapply(libs.df[,6:13],as.numeric)
#remove the scct/reference samples
libs.df<-libs.df%>%
  filter(!(grepl("scct", target)))
#add a column to indicate the nearest pixl
libs.df<-cbind(nearestpixl=0,libs.df)
#make a dataframe of just the LIBS Lat/Long and target name and remove duplicates
libstargets.df<-libs.df[,c(1,3,4,5)]
libstargets.df<-distinct(libstargets.df)
```

Set meters and chosen abrasion to act as a slider in the 2d app.
```{r}
#Choose max distance variable between PIXL and LIBS data
meters = 100
#Choose PIXL abrasion you want to look at
abrasion_name = "Berry Hollow"
```

Next, we load in the PIXL data. We remove the atmospheric sample and only select one PIXL sample of each abrasion.
```{r, data02}
#read in pixl data with lat/long
pixl.df<-readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds")
#include only pixl metadata
pixl.df<-pixl.df %>%
  select(c(1,2,19,20,22))
#convert Lat/Long to numeric
pixl.df$Lat <- as.numeric(pixl.df$Lat)
pixl.df$Long <- as.numeric(pixl.df$Long)
#remove rows so we only have one sample per abrasion and remove atmospheric sample
pixl.df<-pixl.df[c(2,4,6,8,10,12,14,16),]
```

Next, we will initialize a distance variable (to indicate distance between PIXL abrasion and LIBS target) and also initialize each PIXL abrasion, which will be used to mark which PIXL abrasion the LIBS sample is closest to by using a factor of 0 or 1.
```{r}
libstargets.df<-cbind(libstargets.df,"Distance"=0,"Bellegrade"=0,"Dourbes"=0,"Quartier"=0,"Alfalfa"=0,"ThorntonGap"=0,"Berry Hollow"=0,"Novarupta"=0,"Uganik Island"=0)
```

The distance function below will calculate the difference between LIBS target and all the PIXL abrasions, and pick the smallest distance to pick the cloest PIXL abrasion to that LIBS target.
```{r}
for(i in 1:nrow(libstargets.df)) {
    libstargets.df[i,c(6:13)]<-c(distHaversine(pixl.df[1,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[2,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[3,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[4,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[5,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[6,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[7,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[8,c(1,2)],libstargets.df[i,c(2,3)],r=3393169))

    libstargets.df[i,1]<-which.min(libstargets.df[i,c(6:13)])
    libstargets.df[i,5]<-min(libstargets.df[i,c(6:13)])
}
libstargets.df$nearestpixl<-as.factor(libstargets.df$nearestpixl)
levels(libstargets.df$nearestpixl)<-(c("Bellegrade","Dourbes","Quartier","Alfalfa","ThorntonGap","Berry Hollow","Novarupta","Uganik Island"))
```

Below is another initializer for the PIXL abrasion data. This sets the variables for each PIXL abrasion.
```{r}
Bellegrade<-libstargets.df[libstargets.df$nearestpixl=="Bellegrade",]$target
Dourbes<-libstargets.df[libstargets.df$nearestpixl=="Dourbes",]$target
Quartier<-libstargets.df[libstargets.df$nearestpixl=="Quartier",]$target
Alfalfa<-libstargets.df[libstargets.df$nearestpixl=="Alfalfa",]$target
ThorntonGap<-libstargets.df[libstargets.df$nearestpixl=="ThorntonGap",]$target
BerryHollow<-libstargets.df[libstargets.df$nearestpixl=="Berry Hollow",]$target
Novarupta<-libstargets.df[libstargets.df$nearestpixl=="Novarupta",]$target
UganikIsland<-libstargets.df[libstargets.df$nearestpixl=="Uganik Island",]$target
```

Next, we filter out the LIBS targets that are not within the specified distance variable. Then, we merge the LIBS data with the respective PIXL abrasion by mutating and adding an abrasion column that has the abrasion name closest to each LIBS target. We also add a column, LIBS or PIXL, which denotes if the row of data is from the PIXL and LIBS data sets.
```{r}
included.libs<-(libstargets.df%>%
  filter(Distance<meters))$target
libs.matrix <-libs.df %>%
  filter(target %in% included.libs)
libs.matrix <- libs.matrix[,c(5,7:14)]
libs.matrix<-libs.matrix[,c(1:2,4:9,3)]
libs.matrix<-cbind("Abrasion"=0,libs.matrix)
libs.matrix<-libs.matrix%>%
  mutate(Abrasion = ifelse(target%in%Alfalfa,"Alfalfa",
                    ifelse(target %in% Bellegrade, "Belegrade",
                    ifelse(target %in% BerryHollow, "Berry Hollow",
                    ifelse(target %in% Dourbes, "Dourbes",
                    ifelse(target %in% Novarupta, "Novarupta",
                    ifelse(target %in% Quartier, "Quartier",
                    ifelse(target %in% ThorntonGap, "ThorntonGap",
                    ifelse(target %in% UganikIsland, "Uganik Island",Abrasion)))))))))
libs.matrix<-cbind(libsorpixl=1,libs.matrix)
```

Next, we will read in the PIXL data. We will remove the atmospheric sample (first sample) and only choose one of each PIXL sample in as each abrasion has two samples (only one will be necessary for the plot).

```{r, data03}
#read in pixl data with lat/long
pixl.df<-readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds")
pixl.df<-pixl.df %>%
  select(c(5:8,12:14,17,19,18,22))
#reorder pixl columns so that it matches libs data organization
pixl.df<-pixl.df[,c(11,10,4,3,8,2,6,1,5,7)]
#remove atmospheric sample
pixl.df<-pixl.df[2:16,]
pixl.df<-cbind(libsorpixl=0,pixl.df)
```

Finally, we merge the LIBS and PIXL data sets we have modified thus far for a combined LIBS and PIXL data frame suitable for a soil composition line plot.
```{r}
colnames(pixl.df)<-colnames(libs.matrix)
pixllibs.df<-rbind(pixl.df,libs.matrix)
```

## 4.2 Contribution

Some of the data manipulating work was Margo's, such as the distance function. In terms of pivoting the data frame and the other steps of the preprocessing is my own work. The manipulating below to plot the line soil composition plots is my own.

## 4.3 Methods Description

When deciding how to approach the concept of building soil composition plots of each PIXL abrasion and the corresponding LIBS targets within a certain distance maximum, I decided the best way was to start with the original data sets and modify them as needed. For the actual plot, the best way to format the data correctly is to pivot it, as I need the x axis to be the column names in the current data frame we have (SiO2 and other compositions) and the y axis to be the weighted composition values. We also need an indicator of if the data is from PIXL or LIBS, which also is helpful for building the line plots.

Users will have to set the distance variable in order to choose the max distance between PIXL abrasions and LIBS targets. This can vastly change the number of lines on the plots which can help prevent overcrowded plots. Users also can set a variable to choose a specific PIXL abrasion and corresponding LIBS targets, which is easier to interpret as plotting all of the LIBS and PIXL composition information on line plots leads to very condensed graphs that are hard to read.

## 4.4 Result and Discussion

First, we will turn the earth quartile information into a long data frame (meaning pivoting the columns into the values).
```{r}
# Earth quartiles
earthquartiles_long <- earthquartiles.df %>%
  pivot_longer(cols = starts_with("SiO2"):last_col(), names_to = "Compound", values_to = "Percentage")

earthquartiles_long <- earthquartiles_long %>% rename(Quartiles = `Training set Quartiles`)
```

Then, the data will be filtered to only include the data from a specific PIXL abrasion chosen by the user. The data is pivoted into a long format, and the columns are reordered to mimic similar plots from NASA papers.
```{r}
# Filter for the specific abrasion sample, e.g., "Alfalfa"
pixllibs_filtered <- pixllibs.df %>%
  filter(Abrasion == abrasion_name)

# Pivot the data to longer format for ggplot
pixllibs_long <- pixllibs_filtered %>%
  pivot_longer(cols = starts_with("SiO2"):last_col(), names_to = "Compound", values_to = "Percentage")

desired_order <- c("SiO2", "Al2O3", "FeOT", "MgO", "CaO", "Na2O", "K2O", "TiO2")  # Specify your custom order here
pixllibs_long$Compound <- factor(pixllibs_long$Compound, levels = desired_order)
```

For the plot, we use ggplot to plot the pixllibs_long data frame we created. The plot is colored by if the line is a PIXL abrasion's composition or a LIBS target's composition. We also add a layer with the earth quartile information, which is the dotted lines.
```{r}
# Map the PIXL/LIBS column to color and use target_name to differentiate lines
ggplot(pixllibs_long, aes(x = Compound, y = Percentage, color = as.factor(libsorpixl), group = target)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(trans='log10') +
  # Add Earth quartile lines using earthquartiles_long
  geom_line(data = earthquartiles_long, aes(x = Compound, y = Percentage, linetype = Quartiles, group = Quartiles),
            color = "black", linetype = "dotted") +
  labs(title = paste("Soil Composition for PIXL abrasion",abrasion_name,"and LIBS within", meters, "meters", sep = " "),
       x = "Chemical Compound",
       y = "Weight Percentage",
       color = "Measurement Type",
       linetype = "Earth Quartiles") +
  scale_color_manual(values = c("0" = "blue", "1" = "red"), labels = c("PIXL", "LIBS")) +
  theme_minimal()
```
I still plan to update and try using the ggplotlay feature to incorporate all the abrasions and data onto one grid of line plots. I also am going to add plots where the mean is taken of all the LIBS targets that correspond to a PIXL abrasion so the plot will only have one LIBS line and one PIXL line (along with the references), this is all just for the draft. I also need to only add certain quartile information and label them, this is just a placeholder of the previous plot. Also maybe will add in SCCT values as references, not sure if they are super relevant or how to sort them.

## 4.5 Conclusions and Future Work
This finding can be used by geologists to analyze what different soil compositions around different PIXL abrasions can mean for life on Mars. For example, oxide presence doesn't necessarily indicate life, but it could indicate biological or chemical life processes. For example, CaO can indicate the presence of old biological material like shells or fossils.
-Add more about future work, not sure what else to include

# 5.0 Finding 3: Analyzing Cation Combinations using LIBS and PIXL matched data
Using the LIBS and PIXL combined data set, we created a ternary plot to show the distribution of LIBS samples sorted by what PIXL abrasion they are closest to (based on a chosen distance variable). Much of the data preprocessing is similar to Finding 1 which we will repeat here. However, the goal here is to analyze how different groups of LIBS samples (colored by matching PIXL abrasion) differ by cation composition.

## 5.1 Data, Code, and Resources
Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located.

1. peterc-finalProjectF24.Rmd (with knit pdf and html) is this notebook.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/peterc_finalProjectF24_roughdraft.Rmd)

2. supercam_libs_moc_loc.Rds which is the original LIBS data given to our research group.
[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/supercam_libs_moc_loc.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/Data/supercam_libs_moc_loc.Rds)


First, we set a distance variable which can be used as a slider bar in the app. Changing this variable sets the maximum distance between a PIXL target and LIBS sample for them to be classified together.
```{r}
#set distance variable which can be used as a toggle tool
distance=50
```

To prepare the data, I start by loading in the LIBS data. Then, we drop the standard deviation columns and sum of percentage columns leaving us with just the weighted composition in terms of numerical data. We also remove the scct values, as those values are the ones that are earth reference samples that Perserverance carries with it. Therefore, they will not be very relevant when plotting the LIBS data as we are focused on the cation combinations and therefore only need the weighted compositions.
```{r}
#Load in LIBS data
libs.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds")
#Drop the standard deviation features, the sum of the percentages,
#the distance, and the total frequencies
libs.df <- libs.df %>%
  select(!(c(distance_mm,Tot.Em.,SiO2_stdev,TiO2_stdev,Al2O3_stdev,FeOT_stdev,
             MgO_stdev,Na2O_stdev,CaO_stdev,K2O_stdev,Total)))
# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)
libs.df[,6:13] <- sapply(libs.df[,6:13],as.numeric)
#remove the scct/reference samples
libs.df<-libs.df%>%
  filter(!(grepl("scct", target)))
#add a column to indicate the nearest pixl
libs.df<-cbind(nearestpixl=0,libs.df)
#make a dataframe of just the LIBS Lat/Long and target name and remove duplicates
libstargets.df<-libs.df[,c(1,3,4,5)]
libstargets.df<-distinct(libstargets.df)
```

Load in PIXL data
```{r}
#read in pixl data with lat/long
pixl.df<-readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds")
#include only pixl metadata
pixl.df<-pixl.df %>%
  select(c(1,2,19,20,22))
#convert Lat/Long to numeric
pixl.df$Lat <- as.numeric(pixl.df$Lat)
pixl.df$Long <- as.numeric(pixl.df$Long)
#remove rows so we only have one sample per abrasion and remove atmospheric sample
pixl.df<-pixl.df[c(2,4,6,8,10,12,14,16),]
```

Next, we will initialize a distance variable (to indicate distance between PIXL abrasion and LIBS target) and also initialize each PIXL abrasion, which will be used to mark which PIXL abrasion the LIBS sample is closest to by using a factor of 0 or 1.
```{r}
#LIBS target data frame with distance variable as well
libstargets.df<-cbind(libstargets.df,"Distance"=0,"Bellegrade"=0,"Dourbes"=0,"Quartier"=0,"Alfalfa"=0,"ThorntonGap"=0,"BerryHollow"=0,"Novarupta"=0,"UganikIsland"=0)
```

The distance function below will calculate the difference between LIBS target and all the PIXL abrasions, and pick the smallest distance to pick the closest PIXL abrasion to that LIBS target.
```{r}
#Distance function
for(i in 1:nrow(libstargets.df)) {
    libstargets.df[i,c(6:13)]<-c(distHaversine(pixl.df[1,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[2,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[3,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[4,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[5,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[6,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[7,c(1,2)],libstargets.df[i,c(2,3)],r=3393169),
                                 distHaversine(pixl.df[8,c(1,2)],libstargets.df[i,c(2,3)],r=3393169))

    libstargets.df[i,1]<-which.min(libstargets.df[i,c(6:13)])
    libstargets.df[i,5]<-min(libstargets.df[i,c(6:13)])
}
libstargets.df$nearestpixl<-as.factor(libstargets.df$nearestpixl)
levels(libstargets.df$nearestpixl)<-(c("Bellegrade","Dourbes","Quartier","Alfalfa","ThorntonGap","BerryHollow","Novarupta","UganikIsland"))
```

Below is another initializer for the PIXL abrasion data. This sets the variables for each PIXL abrasion.
```{r}
#Sets each nearest PIXL variable for future use in deciding which target is closest to a LIBS sample
Bellegrade<-libstargets.df[libstargets.df$nearestpixl=="Bellegrade",]$target
Dourbes<-libstargets.df[libstargets.df$nearestpixl=="Dourbes",]$target
Quartier<-libstargets.df[libstargets.df$nearestpixl=="Quartier",]$target
Alfalfa<-libstargets.df[libstargets.df$nearestpixl=="Alfalfa",]$target
ThorntonGap<-libstargets.df[libstargets.df$nearestpixl=="ThorntonGap",]$target
BerryHollow<-libstargets.df[libstargets.df$nearestpixl=="BerryHollow",]$target
Novarupta<-libstargets.df[libstargets.df$nearestpixl=="Novarupta",]$target
UganikIsland<-libstargets.df[libstargets.df$nearestpixl=="UganikIsland",]$target
```

Next, we filter out the LIBS targets that are not within the specified distance variable. Then, we merge the LIBS data with the respective PIXL abrasion by mutating and adding an abrasion column that has the abrasion name closest to each LIBS target. We also add a column, LIBS or PIXL, which denotes if the row of data is from the PIXL and LIBS data sets. We also set up the libs.tern matrix which will
```{r}
included.libs<-(libstargets.df%>%
  filter(Distance<meters))$target
libs.matrix <-libs.df %>%
  filter(target %in% included.libs)
#set LIBS matrix and ternary plot by adding in cation components and mutating
libs.matrix <- libs.matrix[,c(5,7:14)]
libs.tern <- as.data.frame(libs.matrix) %>%
  mutate(x=(SiO2+Al2O3)/100,y=(FeOT+MgO)/100,z=(CaO+Na2O+K2O)/100) %>%
  select(-c(SiO2,Al2O3,FeOT,MgO,CaO,Na2O,K2O,TiO2))
libs.tern<-cbind("Abrasion"=0,libs.tern)
#Set what abrasion goes with the respective LIBS sample it matches with
libs.tern<-libs.tern%>%
  mutate(Abrasion = ifelse(target%in%Alfalfa,"Alfalfa",
                    ifelse(target %in% Bellegrade, "Belegrade",
                    ifelse(target %in% BerryHollow, "BerryHollow",
                    ifelse(target %in% Dourbes, "Dourbes",
                    ifelse(target %in% Novarupta, "Novarupta",
                    ifelse(target %in% Quartier, "Quartier",
                    ifelse(target %in% ThorntonGap, "ThorntonGap",
                    ifelse(target %in% UganikIsland, "UganikIsland",Abrasion)))))))))
#summary of LIBS data including distance parameter, number of LIBS targets, and number of LIBS points
kabledf<-rbind("Distance (m)"=meters,"Targets"=length(included.libs),"Points"=nrow(libs.tern))
kable(kabledf, caption ="LIBS # of Targets and Points within Specified Distance")
```


## 5.2 Contribution

This work was also a combination of me and Margo. The data set creation was both of us in our brainstorming as this utilizes the data set we created of latitude and longitude for PIXL.
- Add more here
- Very similar to other two sections add in later, Margo and I worked on most of this together

## 5.3 Methods Description

- Started with filtering original data to get rid of SCCT values
- then set up ternary data frame by mutating by cation compositions
- then mutated to make the PIXL abrasions the key, so the LIBS targets are colored by the closest PIXL abrasion, creating a form of clusters.

## 5.4 Result and Discussion
Using all of the manipulation done for the creation of the ternary plot, we then plot using the ggtern command. We will color by abrasion to see the distribution of composition between different abrasions. This should help us be able to draw different conclusions about how abrasions relate or don't relate. The max distance between the PIXL target and LIBS sample can be modified however desired.
```{r}
ggtern(libs.tern, ggtern::aes(x=x,y=y,z=z)) +
  geom_point(data=libs.tern,aes(color=Abrasion,alpha=0.5)) +
  theme_rgbw() +
  labs(title=paste("Mars LIBS Data Within",distance,"meters of PIXL",sep=" "),
       x="Si+Al",
       y="Fe+Mg",
       z="Ca+Na+K")+theme(legend.position="right") +
  guides(alpha="none")
```

## 5.5 Conclusions and Future Work
Based on this ternary plot, we can see Alfalfa and Belegrade are higher in Si+Al and Uganik Island is an outlier. As this was the last piece of the PIXL data in the data set and it was missing a pair since every other abrasion was made up of two samples, it is included in here but until the data set is updated there is not enough context to explain why it is so vastly different. I would assume it is due to how the robot is traveling and the location of the UganikIsland abrasion is very different than the other 7 abrasions.
- Future work, would include more data to gain broader context

# Bibliography
Provide a listing of references and other sources.

* Analyst's Notebook
* Not sure what else is relevant, most of what I used was similar to Analyst's Notebook

# Appendix

*Include here whatever you think is relevant to support the main content of your notebook. For example, you may have only include example figures above in your main text but include additional ones here. Or you may have done a more extensive investigation, and want to put more results here to document your work in the semester. Be sure to divide appendix into appropriate sections and make the contents clear to the reader using approaches discussed above. *
Should I add more examples here of soil composition plots of different abrasions? I also could add heat map analysis to this notebook as a finding but wasn't sure if it was really relevant.