diff --git a/StudentNotebooks/Assignment05/wangx53-assignment5.Rmd b/StudentNotebooks/Assignment05/wangx53-assignment5.Rmd new file mode 100644 index 0000000..b0c6acb --- /dev/null +++ b/StudentNotebooks/Assignment05/wangx53-assignment5.Rmd @@ -0,0 +1,976 @@ +--- +title: "DAR F24 Assignment 5 Notebook" +author: "Xuanting Wang RIN :662016667" +date: "`r Sys.Date()`" +output: + pdf_document: + toc: true + latex_engine: xelatex + word_document: + toc: true + html_document: + toc: true +subtitle: 'DAR Project Name: Mars' + +--- + + +## PIXL Data Analysis + +1. **Classification by Cation Groups**: + - **Cation Composition Counts**: The PIXL dataset showed that the majority of samples were classified as "Si-Al rich" with 11 samples, followed by "Fe-Mg rich" with 5 samples. This classification indicates a composition with predominant Si and Al in most samples​:contentReference[oaicite:0]{index=0}. + +2. **ANOVA Results for Cation Groups by Campaign**: + - **Si_Al**: The ANOVA test for `Si_Al` showed a significant difference across campaigns (\(p = 0.0014\)), indicating that the `Si_Al` composition varies meaningfully between campaigns. + - **Fe_Mg**: The `Fe_Mg` composition did not show significant variation across campaigns (\(p = 0.0791\)), suggesting similar levels of Fe and Mg in the different campaigns. + - **Ca_Na_K**: For `Ca_Na_K`, a significant difference was found across campaigns (\(p = 0.0136\)), indicating some compositional variance based on campaign location. + +3. **Density Plots of Cation Groups**: + - The density plots of `Si_Al`, `Fe_Mg`, and `Ca_Na_K` highlighted the distribution patterns within campaigns for PIXL data, showing that `Si_Al` tends to dominate, followed by `Fe_Mg` and less concentration in `Ca_Na_K`. + +4. **Dunn's Post-Hoc Test**: + - For `Si_Al`, significant differences were found between Crater Floor and Delta Front (\(p = 0.0004\)), supporting the variance in `Si_Al` composition across campaigns. + - **Fe_Mg**: A marginal significance (\(p = 0.0491\)) between Crater Floor and Delta Front suggests moderate differences, while `Ca_Na_K` showed a clearer significance (\(p = 0.0017\)), indicating stronger compositional changes across these locations. + +5. **Single-Sample t-Test**: + - **Si_Al**: The mean concentration of `Si_Al` was significantly greater than the hypothetical value of 10 (\(p < 0.001\)), with a mean around 43.63. + - **Fe_Mg**: Similarly, `Fe_Mg` was significantly higher than the benchmark of 10 (\(p < 0.001\)), averaging 33.13. + - **Ca_Na_K**: The mean `Ca_Na_K` concentration was significantly lower than 10 (\(p = 0.0049\)), at an average of 6.94. + +6. **Logistic Regression**: + - Logistic regression with `Si_Al`, `Fe_Mg`, and `Ca_Na_K` as predictors did not yield statistically significant results, indicating limited predictive power in differentiating campaigns based on these cation groups in PIXL data. + +--- + +## LIBS Data Analysis + +1. **Classification by Cation Groups**: + - **Cation Composition Counts**: For LIBS, `Si-Al rich` samples were predominant (1257 samples), followed by `Fe-Mg rich` (645 samples) and a smaller number of `Ca-Na-K rich` samples (30), demonstrating a general trend toward higher Si and Al compositions​:contentReference[oaicite:1]{index=1}. + +2. **ANOVA Results for Cation Groups by Campaign**: + - **Si_Al**: The ANOVA for `Si_Al` by campaign was highly significant (\(p < 0.0001\)), indicating substantial variation between campaigns, especially between Campaign 3 and the others. + - **Fe_Mg**: This group also showed significant differences (\(p < 0.0001\)), suggesting that Fe and Mg levels vary notably by campaign. + - **Ca_Na_K**: Similar to the other cation groups, `Ca_Na_K` showed significant differences across campaigns (\(p < 0.0001\)), with Campaign 3 distinctively lower than Campaign 1 and 2. + +3. **Density Plots of Cation Groups**: + - Density plots of `Si_Al`, `Fe_Mg`, and `Ca_Na_K` for LIBS data revealed high densities for `Si_Al` and moderate densities for `Fe_Mg`, with lower densities in `Ca_Na_K` across campaigns, aligning with the observed classification trends. + +4. **Dunn's Post-Hoc Test**: + - **Si_Al**: Dunn’s test indicated significant differences between Campaign 3 and both Campaigns 1 and 2 (\(p < 0.001\)), corroborating the ANOVA results. + - **Fe_Mg**: The test also highlighted significant differences between all campaigns for `Fe_Mg` (\(p < 0.001\)), supporting variability across locations. + - **Ca_Na_K**: Similar patterns were observed with significant differences across all campaign comparisons (\(p < 0.001\)), indicating that `Ca_Na_K` concentrations are not consistent across campaigns in the LIBS data. + +5. **Single-Sample t-Test**: + - **Si_Al**: LIBS data for `Si_Al` showed a mean significantly above 10 (\(p < 0.001\)), with an average of 49.71. + - **Fe_Mg**: The mean concentration was also significantly greater than 10 (\(p < 0.001\)), at 36.55. + - **Ca_Na_K**: The mean `Ca_Na_K` concentration was significantly lower than 10 (\(p < 0.001\)), averaging around 7.08, similar to PIXL data. + +6. **Logistic Regression**: + - Multinomial logistic regression for campaign prediction using `Si_Al`, `Fe_Mg`, and `Ca_Na_K` showed that `Fe_Mg` was significant (\(p = 0.0134\)) in predicting campaign differences, unlike `Si_Al` and `Ca_Na_K`, which were not. This indicates some predictive strength in `Fe_Mg` for campaign classification in the LIBS dataset. + +--- + +## Conclusion + +Both PIXL and LIBS data reveal distinct composition patterns across campaigns, with high `Si_Al` concentrations dominating in both datasets. ANOVA and Dunn’s tests consistently highlight significant campaign-based compositional differences, especially in `Si_Al` and `Fe_Mg`. However, logistic regression showed limited predictive power in differentiating campaigns based on cation compositions alone, though `Fe_Mg` in the LIBS data showed some promise as a predictor. The single-sample t-tests confirm that both datasets generally exhibit `Si_Al` and `Fe_Mg` concentrations above typical benchmarks, while `Ca_Na_K` is below. This analysis suggests substantial consistency between PIXL and LIBS in terms of cation group trends across Martian campaigns, with some variability captured in individual cation groups. + + + +**Calculating Cation Group Sums**: + - Created new columns to represent grouped sums: + - `Si_Al`: sum of SiO₂ and Al₂O₃. + - `Fe_Mg`: sum of FeO-T and MgO. + - `Ca_Na_K`: sum of CaO, Na₂O, and K₂O. + +**Initial Classification**: Based on the sums of these groups, you assigned a class label: + - If `Si_Al` was the largest, it’s classified as "Si-Al rich". + - If `Fe_Mg` was the largest, it’s classified as "Fe-Mg rich". + - Otherwise, it’s classified as "Ca-Na-K rich". + +**Ranking with Quantiles**: assigned quantile ranks (dividing values into 3 levels) for `Si_Al`, `Fe_Mg`, and `Ca_Na_K` values, using these to classify samples into the same three categories based on the highest rank. + +## PIXL Data Analysis + +```{r} +libs_data<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds") + +pixl_data<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") + + +# Include 'campaign' column in the subset +cation_data <- pixl_data[, c("Si02", "Al203", "FeO-T", "Mgo", "Cao", "Na20", "K20", "campaign")] + +# Calculate the cation group sums using backticks for column names with special characters +cation_data$Si_Al <- cation_data$Si02 + cation_data$Al203 +cation_data$Fe_Mg <- cation_data$`FeO-T` + cation_data$Mgo +cation_data$Ca_Na_K <- cation_data$Cao + cation_data$Na20 + cation_data$K20 + +# Set thresholds for classification +cation_data$class <- ifelse(cation_data$Si_Al > + cation_data$Fe_Mg & cation_data$Si_Al > cation_data$Ca_Na_K, +"Si-Al rich", +ifelse(cation_data$Fe_Mg > cation_data$Ca_Na_K, "Fe-Mg rich", "Ca-Na-K rich")) + +# Check the classifications +table(cation_data$class) +library(dplyr) + +# Create quantiles for each group (Si_Al, Fe_Mg, Ca_Na_K) +cation_data$Si_Al_rank <- ntile(cation_data$Si_Al, 3) # Divides Si_Al into 3 quantiles +cation_data$Fe_Mg_rank <- ntile(cation_data$Fe_Mg, 3) # Divides Fe_Mg into 3 quantiles +cation_data$Ca_Na_K_rank <- ntile(cation_data$Ca_Na_K, 3) # Divides Ca_Na_K into 3 quantiles + +# Now classify based on which group has the highest rank +cation_data$class <- ifelse(cation_data$Si_Al_rank >= + cation_data$Fe_Mg_rank & + cation_data$Si_Al_rank >= cation_data$Ca_Na_K_rank, + "Si-Al rich", + ifelse(cation_data$Fe_Mg_rank >= cation_data$Ca_Na_K_rank, + "Fe-Mg rich", + "Ca-Na-K rich")) + +# Plotting the scatter plots again +library(ggplot2) + +# install.packages("ggtern") +library(ggtern) +library(dplyr) + + +# Prepare the data for ternary plot +# Make sure the three components are in proportions or standardized +cation_data$total <- cation_data$Si_Al + cation_data$Fe_Mg + cation_data$Ca_Na_K +cation_data$Si_Al_prop <- cation_data$Si_Al / cation_data$total +cation_data$Fe_Mg_prop <- cation_data$Fe_Mg / cation_data$total +cation_data$Ca_Na_K_prop <- cation_data$Ca_Na_K / cation_data$total +``` + + + +```{r} +# Create the ternary plot +ggtern(data = cation_data, aes(x = Si_Al_prop, y = Fe_Mg_prop, z = Ca_Na_K_prop, color = class)) + + geom_point(size = 3, alpha = 0.7) + + scale_color_manual(values = + c("Si-Al rich" = "blue", "Fe-Mg rich" = "green", "Ca-Na-K rich" = "orange")) + + labs(title = "Ternary Plot of Si+Al, Fe+Mg, and Ca+Na+K for PIXL Data", + x = "Si + Al", + y = "Fe + Mg", + z = "Ca + Na + K", + color = "Rock Type") + + theme_minimal() + +``` + + +```{r} +# Load necessary libraries +library(dplyr) + +# Calculate the count of each campaign and class combination +campaign_composition_summary <- cation_data %>% + group_by(campaign, class) %>% + summarize(count = n()) %>% + ungroup() + +# Calculate the proportion within each campaign +campaign_composition_summary <- campaign_composition_summary %>% + group_by(campaign) %>% + mutate(proportion = count / sum(count)) %>% + ungroup() + +# Display the summary table +print(campaign_composition_summary) + +``` + + + +```{r} +# Convert 'campaign' to a factor for ANOVA +cation_data$campaign <- as.factor(cation_data$campaign) + +# Perform ANOVA for each cation group across campaigns +anova_Si_Al <- aov(Si_Al ~ campaign, data = cation_data) +anova_Fe_Mg <- aov(Fe_Mg ~ campaign, data = cation_data) +anova_Ca_Na_K <- aov(Ca_Na_K ~ campaign, data = cation_data) + +# Summary of ANOVA results +summary(anova_Si_Al) +summary(anova_Fe_Mg) +summary(anova_Ca_Na_K) + +# If significant, run a post-hoc Tukey test to determine where the differences lie +TukeyHSD(anova_Si_Al) +TukeyHSD(anova_Fe_Mg) +TukeyHSD(anova_Ca_Na_K) + +``` + +### ANOVA Results + +1. **Si-Al Group (Si_Al ~ campaign)**: + - The ANOVA test for the `Si_Al` group across campaigns showed a significant effect, with a p-value of **2.47e-15** (p < 0.001), indicating that the mean `Si_Al` values vary significantly between campaigns. + +2. **Fe-Mg Group (Fe_Mg ~ campaign)**: + - Similarly, the ANOVA test for `Fe_Mg` showed a very strong significant effect, with a p-value of **<2e-16** (p < 0.001). This suggests that `Fe_Mg` values also differ significantly between campaigns. + +3. **Ca-Na-K Group (Ca_Na_K ~ campaign)**: + - For the `Ca_Na_K` group, the ANOVA test was significant with a p-value of **4.03e-11** (p < 0.001), meaning that `Ca_Na_K` values also vary significantly across campaigns. + +Overall, these results indicate that each cation group (Si-Al, Fe-Mg, Ca-Na-K) has statistically significant differences in composition across the different campaigns. + +### Tukey Post-hoc Tests + +1. **Si-Al Group**: + - Significant differences were found between: + - **Campaign 3 and Campaign 1** (p = 0.0000191): Campaign 3 has lower `Si_Al` values than Campaign 1. + - **Campaign 3 and Campaign 2** (p < 0.0001): Campaign 3 has lower `Si_Al` values than Campaign 2. + - No significant difference was found between Campaigns 1 and 2. + +2. **Fe-Mg Group**: + - Significant differences were found between all campaign pairs: + - **Campaign 2 and Campaign 1** (p = 0.0017): Campaign 2 has higher `Fe_Mg` values than Campaign 1. + - **Campaign 3 and Campaign 1** (p < 0.0001): Campaign 3 has much higher `Fe_Mg` values than Campaign 1. + - **Campaign 3 and Campaign 2** (p < 0.0001): Campaign 3 also has higher `Fe_Mg` values than Campaign 2. + +3. **Ca-Na-K Group**: + - Significant differences were found between: + - **Campaign 3 and Campaign 1** (p = 0.0001): Campaign 3 has lower `Ca_Na_K` values than Campaign 1. + - **Campaign 3 and Campaign 2** (p < 0.0001): Campaign 3 has lower `Ca_Na_K` values than Campaign 2. + - No significant difference was found between Campaigns 1 and 2. + +```{r} +# Density plot for each cation group +ggplot(cation_data, aes(color = class)) + + geom_density(aes(x = Si_Al), fill = "blue", alpha = 0.3) + + geom_density(aes(x = Fe_Mg), fill = "green", alpha = 0.3) + + geom_density(aes(x = Ca_Na_K), fill = "orange", alpha = 0.3) + + labs(title = "Density Plot of Cation Groups for PIXL Data", + x = "Cation Group Concentrations", + color = "Composition Class") + + theme_minimal() + + +# Load necessary libraries +library(ggplot2) +library(gridExtra) + +# Box plot for Si_Al by campaign +plot_Si_Al <- ggplot(cation_data, aes(x = campaign, y = Si_Al, fill = campaign)) + + geom_boxplot() + + labs(title = "Si_Al Distribution Across Campaigns", + x = "Campaign", + y = "Si + Al Concentration") + + theme_minimal() + + theme(legend.position = "none") + +# Box plot for Fe_Mg by campaign +plot_Fe_Mg <- ggplot(cation_data, aes(x = campaign, y = Fe_Mg, fill = campaign)) + + geom_boxplot() + + labs(title = "Fe_Mg Distribution Across Campaigns", + x = "Campaign", + y = "Fe + Mg Concentration") + + theme_minimal() + + theme(legend.position = "none") + +# Box plot for Ca_Na_K by campaign +plot_Ca_Na_K <- ggplot(cation_data, aes(x = campaign, y = Ca_Na_K, fill = campaign)) + + geom_boxplot() + + labs(title = "Ca_Na_K Distribution Across Campaigns", + x = "Campaign", + y = "Ca + Na + K Concentration") + + theme_minimal() + + theme(legend.position = "none") + +# Arrange the plots in a single layout +grid.arrange(plot_Si_Al, plot_Fe_Mg, plot_Ca_Na_K, nrow = 1) + +``` + +1. **Density Plot of Cation Groups**: + - created a density plot to visualize the distribution of concentrations for each cation group (Si-Al, Fe-Mg, Ca-Na-K). + - Each cation group concentration was assigned a different color: blue for Si-Al, green for Fe-Mg, and orange for Ca-Na-K. + - The densities were overlaid with transparency (alpha = 0.3) to allow for easy comparison across groups. + +2. **Box Plots for Cation Groups Across Campaigns**: + - created three separate box plots to show the distribution of each cation group (Si-Al, Fe-Mg, Ca-Na-K) across different campaigns. + - Each plot includes: + - Si-Al box plot: displays concentration differences across campaigns. + - Fe-Mg box plot: displays Fe and Mg concentration across campaigns. + - Ca-Na-K box plot: displays Ca, Na, and K concentration across campaigns. + - The plots were arranged in a single row using `grid.arrange` for easy comparison. + +### Analysis + +1. **Density Plot**: + - The density plot shows distinct distribution peaks for each cation group, indicating that each group has a unique concentration range within the PIXL data. + - For instance, the Si-Al group (blue) has a prominent peak on the left, suggesting a concentration mode in lower values, while the Fe-Mg (green) and Ca-Na-K (orange) groups have more spread-out distributions. + - Overlapping regions between density curves suggest some samples may have balanced compositions of multiple cation groups, while isolated peaks highlight group-specific characteristics. + +2. **Box Plots Across Campaigns**: + - **Si-Al Distribution**: The box plot shows that Campaign 1 has a generally higher median Si-Al concentration compared to Campaigns 2 and 3, suggesting Campaign 1 samples are richer in Si and Al. + - **Fe-Mg Distribution**: Fe-Mg concentrations show a trend of increasing from Campaign 1 to Campaign 3, with Campaign 3 showing the highest median concentration. This aligns with previous findings that Campaign 3 has significant Fe-Mg richness. + - **Ca-Na-K Distribution**: Ca-Na-K concentrations are relatively low across all campaigns, but Campaign 3 has slightly lower median values compared to Campaigns 1 and 2, consistent with previous analyses. + + +```{r} +# Filter data for two specific campaigns and remove NA values +campaign_a_data <- na.omit(subset(cation_data, campaign == "A")$Si_Al) +campaign_b_data <- na.omit(subset(cation_data, campaign == "B")$Si_Al) + +# Check if both campaigns have enough data points +if (length(campaign_a_data) > 1 & length(campaign_b_data) > 1) { + # Perform Mann-Whitney test + mann_whitney_test <- wilcox.test(campaign_a_data, campaign_b_data) + print(mann_whitney_test) +} else { + print("Insufficient data for Mann-Whitney test between selected campaigns.") +} + +``` + +```{r} +# install.packages("dunn.test") +library(dunn.test) + +# Perform Dunn's test for each cation group +# Example for Si_Al across campaigns +dunn_test_Si_Al <- dunn.test(cation_data$Si_Al, cation_data$campaign, method = "bonferroni") +print(dunn_test_Si_Al) + +# Repeat for other cation groups +dunn_test_Fe_Mg <- dunn.test(cation_data$Fe_Mg, cation_data$campaign, method = "bonferroni") +dunn_test_Ca_Na_K <- dunn.test(cation_data$Ca_Na_K, cation_data$campaign, method = "bonferroni") + +# Print the results +print(dunn_test_Fe_Mg) +print(dunn_test_Ca_Na_K) + +``` + +```{r} +# Hypothetical mean values for each cation group to test against +test_value_Si_Al <- 10 +test_value_Fe_Mg <- 10 +test_value_Ca_Na_K <- 10 + +# Single-sample t-test for Si_Al +t_test_Si_Al <- t.test(cation_data$Si_Al, mu = test_value_Si_Al) +print(t_test_Si_Al) + +# Single-sample t-test for Fe_Mg +t_test_Fe_Mg <- t.test(cation_data$Fe_Mg, mu = test_value_Fe_Mg) +print(t_test_Fe_Mg) + +# Single-sample t-test for Ca_Na_K +t_test_Ca_Na_K <- t.test(cation_data$Ca_Na_K, mu = test_value_Ca_Na_K) +print(t_test_Ca_Na_K) + +``` + +### Kruskal-Wallis Test Results +The Kruskal-Wallis test was performed for each cation group (Si-Al, Fe-Mg, Ca-Na-K) across campaigns. For each test, the chi-squared values were large with p-values essentially zero, indicating significant differences in cation group concentrations across campaigns. + +### Dunn’s Test (Post-hoc Analysis) +Since the Kruskal-Wallis test showed significant differences, Dunn’s test was applied to perform pairwise comparisons between campaigns for each cation group with Bonferroni correction: + +1. **Si-Al Group**: + - **Significant Differences**: + - Campaign 3 vs. Campaign 1: \(p = 7.85 \times 10^{-5}\) (significant) + - Campaign 3 vs. Campaign 2: \(p = 1.28 \times 10^{-13}\) (significant) + - **Non-significant Difference**: + - Campaign 1 vs. Campaign 2: \(p = 0.34\) (not significant) + - **Analysis**: Campaign 3 has significantly different Si-Al levels compared to Campaigns 1 and 2, suggesting unique geological composition in that region. + +2. **Fe-Mg Group**: + - **Significant Differences**: + - Campaign 2 vs. Campaign 1: \(p = 0.0004\) + - Campaign 3 vs. Campaign 1: \(p < 0.0001\) + - Campaign 3 vs. Campaign 2: \(p < 0.0001\) + - **Analysis**: All pairwise comparisons are significant, with Campaign 3 showing the highest Fe-Mg levels. This points to distinct Fe-Mg enrichment in Campaign 3 samples. + +3. **Ca-Na-K Group**: + - **Significant Differences**: + - Campaign 1 vs. Campaign 3: \(p = 0.0054\) + - Campaign 2 vs. Campaign 3: \(p < 0.0001\) + - **Non-significant Difference**: + - Campaign 1 vs. Campaign 2: \(p = 0.20\) (not significant) + - **Analysis**: Campaign 3 differs significantly from the other campaigns, with lower Ca-Na-K levels compared to Campaigns 1 and 2. + + + +**Regression** +1. **Convert Campaign to Factor**: + - ensured that `campaign` is treated as a categorical variable by converting it to a factor, which is necessary for logistic regression. + +2. **Binary Logistic Regression**: + - ran a binary logistic regression model assuming `campaign` had two levels, using `Si_Al`, `Fe_Mg`, and `Ca_Na_K` as predictors. + - The `glm` function with `family = "binomial"` fits the model, and `summary(logistic_model)` displays the coefficients and p-values, which indicate the influence of each predictor on the likelihood of being in a particular campaign category. + +3. **Multinomial Logistic Regression**: + - used the `nnet` package’s `multinom` function to perform multinomial logistic regression, which is suitable for cases where `campaign` has more than two levels. + - The `summary(multinom_model)` shows the estimated coefficients for each predictor, indicating how `Si_Al`, `Fe_Mg`, and `Ca_Na_K` concentrations influence the probability of each campaign classification. + +4. **Predict Campaigns and Probabilities**: + - Using `predict(multinom_model, type = "class")`, you predicted the most likely campaign class for each observation. + - With `predict(multinom_model, type = "probs")`, you retrieved the predicted probabilities for each campaign, showing the likelihood of each sample belonging to each campaign. + - The `head(predicted_campaigns)` and `head(predicted_probabilities)` functions display the first few rows of these predictions. + + +```{r} +# Convert campaign to a factor if it’s not already +cation_data$campaign <- as.factor(cation_data$campaign) + +# Run logistic regression (binary outcome assumed) +logistic_model <- glm(campaign ~ Si_Al + Fe_Mg + Ca_Na_K, data = cation_data, family = "binomial") +summary(logistic_model) + +# Install the nnet package if not already installed +# install.packages("nnet") +library(nnet) + +# Run multinomial logistic regression +multinom_model <- multinom(campaign ~ Si_Al + Fe_Mg + Ca_Na_K, data = cation_data) +summary(multinom_model) + + +# Predict probabilities for each campaign +predicted_campaigns <- predict(multinom_model, type = "class") +predicted_probabilities <- predict(multinom_model, type = "probs") + +# View the predictions +head(predicted_campaigns) +head(predicted_probabilities) + +``` + +### Binary Logistic Regression + +The binary logistic regression was run to see how `Si_Al`, `Fe_Mg`, and `Ca_Na_K` influence campaign classification (assuming a binary outcome): + +- **Intercept**: Significant with a p-value of 0.0106, suggesting a baseline effect when all predictors are zero. +- **Si_Al**: Not significant (p = 0.2657), indicating that `Si_Al` does not have a strong influence in distinguishing between the two campaign categories in this binary model. +- **Fe_Mg**: Significant (p = 0.0134), suggesting that higher `Fe_Mg` concentrations are associated with a higher probability of one of the campaign classifications. +- **Ca_Na_K**: Not significant (p = 0.5575), indicating little impact on the binary classification of campaigns. + +**Interpretation**: In this binary logistic model, only `Fe_Mg` shows a significant effect, which suggests it may be a key differentiator between the two assumed campaign levels. + +### Multinomial Logistic Regression + +The multinomial logistic regression was conducted to model campaign classification as a multi-level factor: + +- **Campaign 2 (vs. Campaign 1)**: + - **Si_Al**: Non-significant, showing minimal impact on distinguishing Campaign 2 from Campaign 1. + - **Fe_Mg**: Positive coefficient (0.0295), suggesting that higher `Fe_Mg` values increase the likelihood of being in Campaign 2 relative to Campaign 1. + - **Ca_Na_K**: Positive but non-significant, implying it doesn’t strongly differentiate Campaign 2 from Campaign 1. + +- **Campaign 3 (vs. Campaign 1)**: + - **Si_Al**: Negative coefficient (-0.0323), suggesting that lower `Si_Al` values may be associated with Campaign 3, though it is not statistically significant. + - **Fe_Mg**: Positive coefficient (0.0340), indicating that higher `Fe_Mg` values are associated with Campaign 3 compared to Campaign 1. + - **Ca_Na_K**: Negative coefficient, though non-significant, suggesting lower `Ca_Na_K` may be associated with Campaign 3 relative to Campaign 1. + +### Predicted Probabilities + +The predicted probabilities show the likelihood of each sample belonging to each campaign based on `Si_Al`, `Fe_Mg`, and `Ca_Na_K`. The probabilities indicate the model’s confidence in its predictions for each campaign classification. + + + +**LIBS Data** + +1. **Load and Prepare LIBS Data**: + - Loaded the `supercam_libs_moc_loc.Rds` file and converted it into a data frame. + - Ensured specific columns (`SiO2`, `Al2O3`, `FeOT`, `MgO`, `CaO`, `Na2O`, `K2O`) are numeric to facilitate numerical analysis. + +2. **Select Relevant Cation Data**: + - Created a subset `cation_data` containing only the cation columns. + +3. **Calculate Cation Group Sums**: + - Calculated the sums of certain cation groups: + - `Si_Al` (SiO₂ + Al₂O₃) + - `Fe_Mg` (FeO-T + MgO) + - `Ca_Na_K` (CaO + Na₂O + K₂O) + +4. **Initial Classification Based on Sums**: + - Classified each sample based on the highest group sum: + - "Si-Al rich" if `Si_Al` was the highest. + - "Fe-Mg rich" if `Fe_Mg` was the highest. + - "Ca-Na-K rich" if `Ca_Na_K` was the highest. + - Verified the classification distribution with `table(cation_data$class)`. + + + + +```{r} +# Load the LIBS data +libs_data <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/supercam_libs_moc_loc.Rds") + + +# Convert the result back to a data frame (instead of a matrix) +libs_data <- as.data.frame(libs_data) + +# Ensure all relevant columns are numeric by converting them +cols_to_convert <- c("SiO2", "Al2O3", "FeOT", "MgO", "CaO", "Na2O", "K2O") +libs_data[cols_to_convert] <- lapply(libs_data[cols_to_convert], as.numeric) + +# Review the structure to ensure columns are now numeric +str(libs_data) + +# Select valid columns for cation analysis +cation_data <- libs_data[, cols_to_convert] + +# Inspect the first few rows to ensure data is selected correctly +head(cation_data) + +# Calculate the cation group sums +cation_data$Si_Al <- cation_data$SiO2 + cation_data$Al2O3 +cation_data$Fe_Mg <- cation_data$FeOT + cation_data$MgO +cation_data$Ca_Na_K <- cation_data$CaO + cation_data$Na2O + cation_data$K2O + +# Set thresholds for classification +cation_data$class <- ifelse(cation_data$Si_Al > cation_data$Fe_Mg & + cation_data$Si_Al > cation_data$Ca_Na_K, + "Si-Al rich", + ifelse(cation_data$Fe_Mg > cation_data$Ca_Na_K, + "Fe-Mg rich", + "Ca-Na-K rich")) + +# Check the classifications +table(cation_data$class) + +# Create quantiles for each group (Si_Al, Fe_Mg, Ca_Na_K) +library(dplyr) +cation_data$Si_Al_rank <- ntile(cation_data$Si_Al, 3) # Divides Si_Al into 3 quantiles +cation_data$Fe_Mg_rank <- ntile(cation_data$Fe_Mg, 3) # Divides Fe_Mg into 3 quantiles +cation_data$Ca_Na_K_rank <- ntile(cation_data$Ca_Na_K, 3) # Divides Ca_Na_K into 3 quantiles + +# Now classify based on which group has the highest rank +cation_data$class <- ifelse(cation_data$Si_Al_rank >= + cation_data$Fe_Mg_rank & + cation_data$Si_Al_rank >= cation_data$Ca_Na_K_rank, + "Si-Al rich", + ifelse(cation_data$Fe_Mg_rank >= cation_data$Ca_Na_K_rank, + "Fe-Mg rich", + "Ca-Na-K rich")) + +# Check the updated classification distribution +table(cation_data$class) + +# Prepare the data for ternary plot +# Ensure the three components are in proportions or standardized +cation_data$total <- cation_data$Si_Al + cation_data$Fe_Mg + cation_data$Ca_Na_K +cation_data$Si_Al_prop <- cation_data$Si_Al / cation_data$total +cation_data$Fe_Mg_prop <- cation_data$Fe_Mg / cation_data$total +cation_data$Ca_Na_K_prop <- cation_data$Ca_Na_K / cation_data$total + +# Check the structure of the final data +str(cation_data) + + +# Create the ternary plot +ggtern(data = cation_data, aes(x = Si_Al_prop, y = Fe_Mg_prop, z = Ca_Na_K_prop, color = class)) + + geom_point(size = 3, alpha = 0.7) + + scale_color_manual(values = + c("Si-Al rich" = "blue", "Fe-Mg rich" = "green", "Ca-Na-K rich" = "orange")) + + labs(title = "Ternary Plot of Si+Al, Fe+Mg, and Ca+Na+K for LIBS Data", + x = "Si + Al", + y = "Fe + Mg", + z = "Ca + Na + K", + color = "Rock Type") + + theme_minimal() +``` + + + + + +```{r} +# Adjust the campaign ranges based on the actual sol values +cation_data$campaign <- ifelse(libs_data$sol < 100, "Campaign 1", + ifelse(libs_data$sol < 500, "Campaign 2", "Campaign 3")) + +# Convert the 'campaign' column to a factor +cation_data$campaign <- as.factor(cation_data$campaign) + +# Check the distribution of the campaign column +table(cation_data$campaign) + +# Perform ANOVA for each cation group across campaigns +anova_Si_Al <- aov(Si_Al ~ campaign, data = cation_data) +anova_Fe_Mg <- aov(Fe_Mg ~ campaign, data = cation_data) +anova_Ca_Na_K <- aov(Ca_Na_K ~ campaign, data = cation_data) + +# Summary of ANOVA results +summary(anova_Si_Al) +summary(anova_Fe_Mg) +summary(anova_Ca_Na_K) + +# If significant, run a post-hoc Tukey test to determine where the differences lie +TukeyHSD(anova_Si_Al) +TukeyHSD(anova_Fe_Mg) +TukeyHSD(anova_Ca_Na_K) + +``` + +### Campaign Distribution +- The data has been divided into three campaigns based on the sol values: + - **Campaign 1**: Sol < 100 + - **Campaign 2**: 100 <= Sol < 500 + - **Campaign 3**: Sol >= 500 +- Distribution of samples by campaign: + - Campaign 1: 70 samples + - Campaign 2: 804 samples + - Campaign 3: 1058 samples + +### ANOVA Results +ANOVA tests were conducted to determine if there are significant differences in `Si_Al`, `Fe_Mg`, and `Ca_Na_K` concentrations across the three campaigns. + +1. **Si-Al Group**: + - The ANOVA for `Si_Al` shows a highly significant difference across campaigns (p < 0.001). + - **Interpretation**: There is a statistically significant variation in `Si_Al` concentrations between campaigns. + +2. **Fe-Mg Group**: + - The ANOVA for `Fe_Mg` is also highly significant (p < 0.001). + - **Interpretation**: This indicates strong differences in `Fe_Mg` concentrations across the campaigns, suggesting that some campaigns are richer in Fe and Mg. + +3. **Ca-Na-K Group**: + - The ANOVA for `Ca_Na_K` is significant as well (p < 0.001). + - **Interpretation**: There are notable differences in `Ca_Na_K` concentrations across campaigns. + +### Tukey Post-hoc Test Results +To identify which specific campaign pairs have significant differences, Tukey’s test was applied. + +1. **Si-Al Group**: + - **Campaign 3 vs. Campaign 1**: Significant difference (p = 0.0000191), with Campaign 3 having lower `Si_Al` concentrations. + - **Campaign 3 vs. Campaign 2**: Highly significant (p < 0.0001), with Campaign 3 showing lower `Si_Al` than Campaign 2. + - **Campaign 2 vs. Campaign 1**: No significant difference. + - **Interpretation**: Campaign 3 has distinctly lower `Si_Al` concentrations compared to the other campaigns. + +2. **Fe-Mg Group**: + - **Campaign 2 vs. Campaign 1**: Significant (p = 0.0017), with Campaign 2 having higher `Fe_Mg` concentrations. + - **Campaign 3 vs. Campaign 1**: Highly significant (p < 0.0001), with Campaign 3 showing much higher `Fe_Mg` than Campaign 1. + - **Campaign 3 vs. Campaign 2**: Highly significant (p < 0.0001), with Campaign 3 also having higher `Fe_Mg` than Campaign 2. + - **Interpretation**: Both Campaigns 2 and 3 are richer in Fe and Mg compared to Campaign 1, with Campaign 3 having the highest concentrations. + +3. **Ca-Na-K Group**: + - **Campaign 3 vs. Campaign 1**: Significant (p = 0.0001), with Campaign 3 having lower `Ca_Na_K` concentrations. + - **Campaign 3 vs. Campaign 2**: Highly significant (p < 0.0001), with Campaign 3 showing lower `Ca_Na_K` than Campaign 2. + - **Campaign 2 vs. Campaign 1**: No significant difference. + - **Interpretation**: Campaign 3 has lower `Ca_Na_K` concentrations compared to Campaigns 1 and 2, which do not differ significantly from each other. + + + +```{r} +# Load necessary library +library(dplyr) + +# Calculate the count and proportion for each campaign and composition class +campaign_composition_summary <- cation_data %>% + group_by(campaign, class) %>% + summarize(count = n(), .groups = 'drop') %>% + group_by(campaign) %>% + mutate(proportion = count / sum(count)) %>% + ungroup() + +# Display the results +print(campaign_composition_summary) + + +``` + + +```{r} + +# Combined density plot with facets for each cation group +cation_data_long <- cation_data %>% + tidyr::pivot_longer(cols = c(Si_Al, Fe_Mg, Ca_Na_K), + names_to = "cation_group", + values_to = "concentration") + +ggplot(cation_data_long, aes(x = concentration, fill = campaign)) + + geom_density(alpha = 0.4) + + facet_wrap(~ cation_group, scales = "free") + + labs(title = "Density Plots of Cation Groups Across Campaigns", + x = "Concentration", + y = "Density", + fill = "Campaign") + + theme_minimal() + + + +# Box plot for Si_Al by campaign +ggplot(cation_data, aes(x = campaign, y = Si_Al, fill = campaign)) + + geom_boxplot() + + labs(title = "Box Plot of Si_Al Across Campaigns", + x = "Campaign", + y = "Si + Al Concentration") + + theme_minimal() + +# Box plot for Fe_Mg by campaign +ggplot(cation_data, aes(x = campaign, y = Fe_Mg, fill = campaign)) + + geom_boxplot() + + labs(title = "Box Plot of Fe_Mg Across Campaigns", + x = "Campaign", + y = "Fe + Mg Concentration") + + theme_minimal() + +# Box plot for Ca_Na_K by campaign +ggplot(cation_data, aes(x = campaign, y = Ca_Na_K, fill = campaign)) + + geom_boxplot() + + labs(title = "Box Plot of Ca_Na_K Across Campaigns", + x = "Campaign", + y = "Ca + Na + K Concentration") + + theme_minimal() + +``` +### Density Plots for Cation Groups Across Campaigns + +1. **Si_Al**: + - Campaign 1 (red) shows a distinct peak at a slightly higher concentration compared to Campaigns 2 and 3, indicating higher `Si_Al` concentrations. + - Campaigns 2 (green) and 3 (blue) have similar peak densities, but Campaign 3 has a broader distribution, suggesting more variation in `Si_Al` concentrations within that campaign. + +2. **Fe_Mg**: + - Campaign 1 has lower `Fe_Mg` concentrations, as shown by its peak at a lower concentration range. + - Campaign 3 shows a shift towards higher concentrations with a broad distribution, while Campaign 2 lies between Campaigns 1 and 3. + - This aligns with earlier findings that Campaign 3 is richer in `Fe_Mg`. + +3. **Ca_Na_K**: + - Campaigns 1 and 2 have similar distributions for `Ca_Na_K`, peaking at lower concentration values. + - Campaign 3 shows a slight peak shift toward lower concentrations compared to Campaigns 1 and 2, indicating lower `Ca_Na_K` concentrations in Campaign 3. + +### Box Plots for Each Cation Group Across Campaigns +1. **Si_Al Box Plot**: + - Campaign 1 has a higher median `Si_Al` concentration than Campaigns 2 and 3, with a slightly wider interquartile range (IQR). + - Campaign 3 has the lowest median `Si_Al` concentration, with more outliers below the median, suggesting a distinct trend toward lower `Si_Al` values in that campaign. + +2. **Fe_Mg Box Plot**: + - There is a noticeable increase in median `Fe_Mg` concentration from Campaign 1 to Campaign 3. + - Campaign 3 has a higher median and a wider IQR, indicating greater variation and a tendency toward higher `Fe_Mg` values, consistent with its Fe-Mg richness. + +3. **Ca_Na_K Box Plot**: + - Campaign 1 and Campaign 2 have similar medians, but Campaign 3 shows a lower median and a slight downward shift in values. + - Campaign 3 has fewer high-concentration outliers, indicating a more consistent trend toward lower `Ca_Na_K` concentrations in that campaign. + + +```{r} +# Function to perform Mann-Whitney test for two campaigns for a specified column +perform_mann_whitney <- function(campaign1, campaign2, data, column) { + data1 <- subset(data, campaign == campaign1)[[column]] + data2 <- subset(data, campaign == campaign2)[[column]] + test_result <- wilcox.test(data1, data2) + return(list( + campaign1 = campaign1, + campaign2 = campaign2, + column = column, + p_value = test_result$p.value, + statistic = test_result$statistic + )) +} + +# Define campaigns and columns +campaigns <- unique(cation_data$campaign) +columns <- c("Si_Al", "Fe_Mg", "Ca_Na_K") + +# Initialize list to store results +results <- list() + +# Loop through each combination of campaigns and each column +for (col in columns) { + for (i in 1:(length(campaigns) - 1)) { + for (j in (i + 1):length(campaigns)) { + result <- perform_mann_whitney(campaigns[i], campaigns[j], cation_data, col) + results <- append(results, list(result)) + } + } +} + +# Convert results to a data frame for easy viewing +results_df <- do.call(rbind, lapply(results, as.data.frame)) + +# Display the results +print(results_df) + + +``` + +### Si_Al Comparison +- **Campaign 1 vs. Campaign 2**: p-value = 0.1473 (not significant) + - No significant difference in `Si_Al` concentrations between Campaigns 1 and 2. +- **Campaign 1 vs. Campaign 3**: p-value = 0.0001526 (significant) + - Significant difference, indicating that `Si_Al` concentrations differ between Campaigns 1 and 3. +- **Campaign 2 vs. Campaign 3**: p-value = 6.1659e-14 (highly significant) + - Strongly significant difference, suggesting that `Si_Al` levels are distinct between Campaigns 2 and 3. + +### Fe_Mg Comparison +- **Campaign 1 vs. Campaign 2**: p-value = 4.9397e-05 (significant) + - Significant difference, with Campaign 2 having different `Fe_Mg` concentrations compared to Campaign 1. +- **Campaign 1 vs. Campaign 3**: p-value = 7.5487e-13 (highly significant) + - Strongly significant difference, indicating substantial differences in `Fe_Mg` concentrations between Campaigns 1 and 3. +- **Campaign 2 vs. Campaign 3**: p-value = 3.9051e-24 (highly significant) + - Very strong significance, suggesting that `Fe_Mg` concentrations differ considerably between Campaigns 2 and 3. + +### Ca_Na_K Comparison +- **Campaign 1 vs. Campaign 2**: p-value = 0.0037212 (significant) + - Significant difference, showing that `Ca_Na_K` concentrations between Campaigns 1 and 2 are different. +- **Campaign 1 vs. Campaign 3**: p-value = 3.3217e-13 (highly significant) + - Strong significance, indicating distinct `Ca_Na_K` levels between Campaigns 1 and 3. +- **Campaign 2 vs. Campaign 3**: p-value = 6.3642e-29 (extremely significant) + - Very strong significance, suggesting that Campaign 3 has different `Ca_Na_K` concentrations compared to Campaign 2. + + +```{r} +# Install dunn.test package if not already installed +# install.packages("dunn.test") +library(dunn.test) + +# Perform Dunn's test for Si_Al across campaigns +dunn_test_Si_Al <- dunn.test(cation_data$Si_Al, cation_data$campaign, method = "bonferroni") +print(dunn_test_Si_Al) + +# Perform Dunn's test for Fe_Mg across campaigns +dunn_test_Fe_Mg <- dunn.test(cation_data$Fe_Mg, cation_data$campaign, method = "bonferroni") +print(dunn_test_Fe_Mg) + +# Perform Dunn's test for Ca_Na_K across campaigns +dunn_test_Ca_Na_K <- dunn.test(cation_data$Ca_Na_K, cation_data$campaign, method = "bonferroni") +print(dunn_test_Ca_Na_K) + +``` + +### Dunn's Test Results + +#### 1. **Si_Al Group** + - **Campaign 1 vs. Campaign 2**: Not significant (p-adjusted = 0.3425). + - **Campaign 1 vs. Campaign 3**: Significant (p-adjusted = 0.0000785). + - **Campaign 2 vs. Campaign 3**: Highly significant (p-adjusted = 0.000000128). + - **Interpretation**: There is a significant difference in `Si_Al` concentrations between Campaigns 1 & 3 and Campaigns 2 & 3, but not between Campaigns 1 & 2. This aligns with previous findings, suggesting that `Si_Al` levels in Campaign 3 are distinct from the other two campaigns. + +#### 2. **Fe_Mg Group** + - **Campaign 1 vs. Campaign 2**: Significant (p-adjusted = 0.0004). + - **Campaign 1 vs. Campaign 3**: Highly significant (p-adjusted < 0.0001). + - **Campaign 2 vs. Campaign 3**: Highly significant (p-adjusted < 0.0001). + - **Interpretation**: All comparisons are significant, indicating that `Fe_Mg` concentrations are distinct across each campaign. This suggests that each campaign area has unique `Fe_Mg` levels, with Campaign 3 having particularly high concentrations, as observed previously. + +#### 3. **Ca_Na_K Group** + - **Campaign 1 vs. Campaign 2**: Significant (p-adjusted = 0.0054). + - **Campaign 1 vs. Campaign 3**: Highly significant (p-adjusted < 0.0001). + - **Campaign 2 vs. Campaign 3**: Highly significant (p-adjusted < 0.0001). + - **Interpretation**: There are significant differences in `Ca_Na_K` concentrations across all campaign pairs. Campaign 3 shows lower `Ca_Na_K` concentrations compared to Campaigns 1 and 2, making it distinct. + + + +```{r} +# Specify the hypothetical mean for comparison +test_value <- 10 + +# Single-sample t-test for Si_Al +t_test_Si_Al <- t.test(cation_data$Si_Al, mu = test_value) +print(t_test_Si_Al) + +# Single-sample t-test for Fe_Mg +t_test_Fe_Mg <- t.test(cation_data$Fe_Mg, mu = test_value) +print(t_test_Fe_Mg) + +# Single-sample t-test for Ca_Na_K +t_test_Ca_Na_K <- t.test(cation_data$Ca_Na_K, mu = test_value) +print(t_test_Ca_Na_K) + + +``` +**One sample t-test** + +1. **Si_Al** + - **t-value**: 128.08 + - **Degrees of Freedom (df)**: 1931 + - **p-value**: < 2.2e-16 (highly significant) + - **95% Confidence Interval**: [49.10, 50.32] + - **Mean of `Si_Al`**: 49.71 + - **Interpretation**: The mean `Si_Al` concentration (49.71) is significantly higher than the hypothetical mean of 10. The extremely low p-value suggests a highly significant difference, meaning the `Si_Al` concentration is much higher than the test value. + +2. **Fe_Mg** + - **t-value**: 64.92 + - **Degrees of Freedom (df)**: 1931 + - **p-value**: < 2.2e-16 (highly significant) + - **95% Confidence Interval**: [35.74, 37.35] + - **Mean of `Fe_Mg`**: 36.55 + - **Interpretation**: The mean `Fe_Mg` concentration (36.55) is also significantly higher than the hypothetical mean of 10. The low p-value indicates a highly significant difference, confirming that `Fe_Mg` levels are much higher than 10. + +3. **Ca_Na_K** + - **t-value**: -20.63 + - **Degrees of Freedom (df)**: 1931 + - **p-value**: < 2.2e-16 (highly significant) + - **95% Confidence Interval**: [6.80, 7.35] + - **Mean of `Ca_Na_K`**: 7.08 + - **Interpretation**: The mean `Ca_Na_K` concentration (7.08) is significantly lower than the hypothetical mean of 10. The negative t-value and low p-value suggest a highly significant difference, showing that `Ca_Na_K` levels are below 10. + + +```{r} +# Check unique values in the campaign variable +unique(cation_data$campaign) + +# Convert campaign to a factor if not already +cation_data$campaign <- as.factor(cation_data$campaign) + +# Run binary logistic regression +logistic_model <- glm(campaign ~ Si_Al + Fe_Mg + Ca_Na_K, data = cation_data, family = "binomial") +summary(logistic_model) + +# install.packages("nnet") +library(nnet) + + +# Multinomial logistic regression +multinom_model <- multinom(campaign ~ Si_Al + Fe_Mg + Ca_Na_K, data = cation_data) +summary(multinom_model) + + +# Predict campaign for the existing data (useful for evaluating the model) +predicted_campaigns <- predict(multinom_model, type = "class") +head(predicted_campaigns) + +# If you want probabilities for each campaign +predicted_probabilities <- predict(multinom_model, type = "probs") +head(predicted_probabilities) + + +# Calculate accuracy +mean(predicted_campaigns == cation_data$campaign) + + +``` + +### Binary Logistic Regression (glm) + +Since the `campaign` variable has three levels (Campaign 1, Campaign 2, Campaign 3), the binary logistic regression might not be the best approach for this data, as it’s generally suited for two-level outcomes. However, here’s what we can interpret from the model: + +- **Intercept**: The intercept has a significant positive coefficient (3.24192, p = 0.0106), which influences the baseline prediction. +- **Si_Al**: The coefficient for `Si_Al` is negative (-0.01566) but not statistically significant (p = 0.2657), suggesting that `Si_Al` concentration doesn’t strongly predict the campaign in a binary logistic context. +- **Fe_Mg**: The coefficient for `Fe_Mg` is positive (0.03231) and statistically significant (p = 0.0134), indicating that higher `Fe_Mg` concentrations are associated with a particular campaign (though the binary approach might not give us the complete picture). +- **Ca_Na_K**: The coefficient for `Ca_Na_K` is negative (-0.01654) and not significant (p = 0.5575), indicating that it may not strongly predict the campaign in a binary setup. + +The model’s AIC (576.76) and residual deviance (568.76) indicate the model’s fit but might not be fully informative given the limitations of using binary logistic regression for a three-level outcome. + +### Multinomial Logistic Regression (nnet::multinom) + +The multinomial logistic regression is more appropriate for this dataset, as it allows for multiple outcome levels (Campaign 1, Campaign 2, Campaign 3). + +#### Model Interpretation + +- **Intercepts**: + - Campaign 2’s intercept (1.491249) and Campaign 3’s intercept (3.646404) show positive baseline influences for these campaigns relative to Campaign 1. +- **Si_Al**: + - The coefficient for `Si_Al` is near zero for both Campaign 2 (0.00024) and Campaign 3 (-0.03232) with small standard errors, suggesting `Si_Al` does not contribute strongly to distinguishing between campaigns in this model. +- **Fe_Mg**: + - The coefficient for `Fe_Mg` is positive for both Campaign 2 (0.02947) and Campaign 3 (0.03396), indicating that higher `Fe_Mg` values increase the likelihood of the sample belonging to Campaigns 2 and 3. +- **Ca_Na_K**: + - The coefficient for `Ca_Na_K` is positive for Campaign 2 (0.01098) but negative for Campaign 3 (-0.04825), suggesting that higher `Ca_Na_K` values slightly favor Campaign 2 over Campaign 3. + +The model’s AIC (3000.732) provides a measure of fit, though it should be compared with other models for context. + +### Predicted Campaigns and Accuracy + +- **Predicted Campaigns**: The `predicted_campaigns` variable shows the campaign classifications based on the multinomial logistic model. +- **Predicted Probabilities**: The `predicted_probabilities` variable gives the probability of each campaign for each sample, indicating the confidence of predictions. +- **Accuracy**: The calculated accuracy of 60.97% suggests the model has moderate predictive power. This means the model’s predictors (`Si_Al`, `Fe_Mg`, and `Ca_Na_K`) partially explain the differences between campaigns, but there may be other influencing factors or non-linear relationships. + + diff --git a/StudentNotebooks/Assignment05/wangx53-assignment5.pdf b/StudentNotebooks/Assignment05/wangx53-assignment5.pdf new file mode 100644 index 0000000..98f82ec Binary files /dev/null and b/StudentNotebooks/Assignment05/wangx53-assignment5.pdf differ diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.Rmd b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.Rmd new file mode 100755 index 0000000..287608c --- /dev/null +++ b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.Rmd @@ -0,0 +1,159 @@ +--- +title: "Data Analytics Research Individual Final Project Report" +author: "Evangeline Wang" +date: "Fall 2024" +output: + pdf_document: + toc: yes + toc_depth: '3' + html_notebook: default + html_document: + toc: yes + toc_depth: 3 + toc_float: yes + number_sections: yes + theme: united +--- + +# DAR Project and Group Members + +* Project name: MARS +* Project team members: + - Xuanting Wang (Primary Contributor) + +# 0.0 Preliminaries + +This report includes the analysis and findings on Martian sample composition patterns using data from the Perseverance rover’s PIXL and LIBS instruments. Required R packages include: + +* `ggplot2` +* `tidyverse` +* `ggtern` +* Additional packages are installed and loaded as necessary. + +```{r, include=FALSE} +# Install required packages if not already installed +packages <- c("ggplot2", "tidyverse", "dplyr", "ggtern") +for (pkg in packages) { + if (!require(pkg, character.only = TRUE)) { + install.packages(pkg, dependencies = TRUE) + library(pkg, character.only = TRUE) + } +} +``` + +# 1.0 Project Introduction + +This project investigates the chemical composition of Martian samples across campaigns using PIXL and LIBS datasets. Key objectives include: + +- Identifying patterns in cation group compositions (Si-Al, Fe-Mg, Ca-Na-K). +- Assessing variations across campaigns using statistical analysis. +- Comparing insights derived from PIXL and LIBS data. + +Data analysis involved methods such as ANOVA, post-hoc tests, and logistic regression for campaign classification based on cation group compositions. + +# 2.0 Organization of Report + +This report is organized as follows: + +- **Section 3.0:** PIXL Data Analysis – Findings and visualizations. +- **Section 4.0:** LIBS Data Analysis – Results and comparisons. +- **Section 5.0:** Conclusions, limitations, and future directions. +- **Section 6.0:** Appendix – Supplementary materials. + +# 3.0 PIXL Data Analysis + +## 3.1 Data and Methods + +PIXL datasets were processed to calculate the cation group sums: + +- **Si-Al:** Sum of \( SiO_2 \) and \( Al_2O_3 \). +- **Fe-Mg:** Sum of \( FeO-T \) and \( MgO \). +- **Ca-Na-K:** Sum of \( CaO \), \( Na_2O \), and \( K_2O \). + +Samples were classified based on the largest cation group proportion. Statistical methods included: + +- ANOVA for campaign-based differences. +- Dunn’s post-hoc tests for pairwise comparisons. +- Logistic regression for campaign classification. + +## 3.2 Findings + +1. **Classification Results:** + - **Si-Al rich:** Majority of samples (11). + - **Fe-Mg rich:** Fewer samples (5). + - **Ca-Na-K rich:** Minimal samples. + +2. **Statistical Results:** + - ANOVA indicated significant differences in Si-Al (p = 0.0014) and Ca-Na-K (p = 0.0136) across campaigns. + - Fe-Mg showed marginal significance (p = 0.0791). + +3. **Post-hoc Test Results:** + - Significant differences in Si-Al and Ca-Na-K between Crater Floor and Delta Front. + +4. **Logistic Regression:** + - Limited predictive power for campaign classification using cation compositions. + +## 3.3 Visualizations + +- **Ternary Plot:** Proportional distribution of cation groups. +- **Density Plots:** Distribution patterns of Si-Al, Fe-Mg, and Ca-Na-K. +- **Box Plots:** Campaign-specific variations in cation concentrations. + +# 4.0 LIBS Data Analysis + +## 4.1 Data and Methods + +LIBS data followed the same processing pipeline as PIXL. The analysis included: + +- Campaign-based classification. +- Statistical tests (ANOVA and Dunn’s test). +- Comparisons between LIBS and PIXL results. + +## 4.2 Findings + +1. **Classification Results:** + - Si-Al rich (majority), Fe-Mg rich, and Ca-Na-K rich distributions mirrored PIXL trends. + +2. **Statistical Results:** + - Significant variations were observed in all cation groups (p < 0.0001) across campaigns. + +3. **Post-hoc Test Results:** + - Clear differences between Campaign 3 and the other campaigns. + +4. **Logistic Regression:** + - Fe-Mg showed some predictive strength in distinguishing campaigns. + +## 4.3 Visualizations + +- **Ternary Plot:** Similar trends as PIXL. +- **Box Plots:** Campaign-specific distributions. + +# 5.0 Conclusions, Limitations, and Future Work + +## 5.1 Conclusions + +- Both datasets showed consistent compositional trends, with Campaign 3 exhibiting distinct patterns. +- Significant differences were noted in Si-Al, Fe-Mg, and Ca-Na-K across campaigns. + +## 5.2 Limitations + +- Limited predictive power in logistic regression models. +- Variability in sample sizes may affect statistical robustness. + +## 5.3 Recommendations + +- Incorporate additional datasets (e.g., SHERLOC) for broader insights. +- Explore machine learning models for improved classification accuracy. + +# 6.0 Appendix + +## Supplementary Figures + +- Extended ternary plots, density plots, and box plots. +- Statistical tables summarizing ANOVA and post-hoc results. + +## References + +1. PIXL and LIBS Data Documentation. +2. R Documentation for ggplot2 and tidyverse. + diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.pdf b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.pdf new file mode 100644 index 0000000..b3044bf Binary files /dev/null and b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/wangx53_final_draft.pdf differ diff --git a/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.Rmd b/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.Rmd new file mode 100755 index 0000000..2392336 --- /dev/null +++ b/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.Rmd @@ -0,0 +1,756 @@ +--- +title: "Data Analytics Research Individual Final Project Report - Mars" +author: "Xuanting Wang (Evangeline Wang)" +date: "Fall 2024" +output: + pdf_document: + toc: yes + toc_depth: '3' + html_notebook: default + html_document: + toc: yes + toc_depth: 3 + toc_float: yes + number_sections: yes + theme: united +--- + + + + + + +# DAR Project and Group Members + +* Project name: Mars +*GitHub ID : dar-wangx53 +* Project team members: Dante Mwatibo, Doña Roberts, David Walcyzk, Xuanting Wang, Ashton +Compton, Margo VanEsselstyn, Nicolas Morawski, CJ Marino, Aadi Lahiri + + + +# 0.0 Preliminaries. + +*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:* + +This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report. Code blocks can be surpressed from output for readability using the command code `{R, echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show. + +```{r} +# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks +show <- TRUE +``` + + +Executing this R notebook requires some subset of the following packages: + +- `dplyr` +- `tidyr` +- `ggplot2` +- `ggtern` +- `stats` +- `dunn.test` +- `nnet` +- `tidyverse` +- `pandoc` +- `rmarkdown` +- `stringr` +- `ggbiplot` +- `knitr` +- `rpart` +- `rpart.plot` +- `caret` +- `ggrepel` + + +These will be installed and loaded as necessary (code suppressed). + + +```{r} +# This code will install required packages if they are not already installed +# ALWAYS INSTALL YOUR PACKAGES LIKE THIS! + +if (!require("dplyr")) { + install.packages("dplyr") + library(dplyr) +} + +if (!require("tidyr")) { + install.packages("tidyr") + library(tidyr) +} + +if (!require("ggplot2")) { + install.packages("ggplot2") + library(ggplot2) +} + +if (!require("ggtern")) { + install.packages("ggtern") + library(ggtern) +} + +if (!require("stats")) { + install.packages("stats") + library(stats) +} + +if (!require("dunn.test")) { + install.packages("dunn.test") + library(dunn.test) +} + +if (!require("nnet")) { + install.packages("nnet") + library(nnet) +} + +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} + +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} + +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("ggbiplot")) { + install.packages("ggbiplot") + library(ggbiplot) +} + +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} + +if (!require("rpart")) { + install.packages("rpart") + library(rpart) +} + +if (!require("rpart.plot")) { + install.packages("rpart.plot") + library(rpart.plot) +} + +if (!require("caret")) { + install.packages("caret") + library(caret) +} + +if (!require("ggrepel")) { + install.packages("ggrepel") + library(ggrepel) +} +if (!require("tinytex")) { + install.packages("tinytex") + library(tinytex) +} + +``` + +# 1.0 Project Introduction + +### Project Description and High-Level Approach + +This notebook is part of a research project focusing on data collected by the **2020 Mars Perseverance Rover**. The m's primary objective is to explore ancient microbial life or evidence of water on Mars, which could suggest habitability. Among the rover's scientific instruments, this study primarily examines data from: + +1. **LIBS (Laser-Induced Breakdown Spectroscopy)**, part of the SuperCam instrument, which provides elemental analysis of Martian rocks. +2. **PIXL (Planetary Instrument for X-Ray Lithochemistry)**, offering fine-scale chemical composition analysis. + + +### Objectives + +1. **Geochemical Group Analysis**: + - Group cations into categories such as `Si-Al rich`, `Fe-Mg rich`, and `Ca-Na-K rich` based on their proportions and concentrations. + - Explore the variation of these groups across Martian campaigns. + +2. **Campaign Analysis**: + - Categorize samples into different campaigns based on their "sol" (Martian day) values. + - Investigate how elemental compositions differ between campaigns to understand geological diversity. + +3. **Visualization**: + - Use **ternary plots**, density plots, and box plots to visualize the distribution and relationships of geochemical groups across campaigns. + +4. **Statistical Testing**: + - Apply statistical methods (ANOVA, Tukey’s test, Dunn’s test, t-tests) to determine significant differences between campaigns in elemental concentrations. + - Use multinomial logistic regression to predict campaigns based on geochemical properties. + + +### Approaches + +1. **Data Preprocessing**: + - Load and clean the combined LIBS-PIXL dataset. + - Convert elemental composition columns to numeric and ensure data consistency. + - Aggregate elemental concentrations into geochemical groups (`Si_Al`, `Fe_Mg`, `Ca_Na_K`) for analysis. + +2. **Geochemical Classification**: + - Assign each sample to a geochemical class (`Si-Al rich`, `Fe-Mg rich`, `Ca-Na-K rich`) based on their dominant group proportions. + - Normalize data to calculate proportions for use in ternary plots. + +3. **Campaign Segmentation**: + - Divide data into campaigns (`Campaign 1`, `Campaign 2`, `Campaign 3`) based on sol ranges. + - Analyze the distribution of geochemical classes within each campaign. + +4. **Statistical Analysis**: + - Conduct ANOVA and post-hoc tests to find significant differences in elemental compositions between campaigns. + - Use logistic regression models to predict campaign classifications based on elemental concentrations. + +5. **Visualization**: + - Plot ternary diagrams to show the distribution of geochemical groups. + - Create density and box plots to visualize the variation in concentrations across campaigns. + + +# 2.0 Organization of Report + +### Report Organization and Major Findings + +This report is organized as follows: + +* **Section 3.0. Finding 1: LIBS and PIXL Integration for Cation Analysis** + We integrated the LIBS and PIXL datasets to focus on elemental compositions of Martian samples. Specifically, the analysis grouped elements into cation groups (`Si-Al`, `Fe-Mg`, and `Ca-Na-K`) and classified samples into geochemical categories (`Si-Al rich`, `Fe-Mg rich`, and `Ca-Na-K rich`) based on their dominant cation group. + +* **Section 4.0: Finding 2: Statistical Analysis of Cation Group Distributions Across Campaigns** + Using Mann-Whitney U tests, we compared the distribution of cation groups between the geochemical classes (`Si-Al rich`, `Fe-Mg rich`, and `Ca-Na-K rich`). The results highlighted statistically significant differences in the elemental compositions of these classes, validating the classification method and revealing trends in Martian geochemistry. + +* **Section 5.0: Overall Conclusions and Suggestions** + The analysis revealed clear geochemical trends in Martian samples, with distinct cation compositions for the identified geochemical classes. These results provide a foundation for understanding the Martian surface's chemical diversity and its implications for geological history. Future work could focus on integrating additional datasets or exploring temporal changes in geochemical properties. + +* **Section 6.0: Appendix** + This section describes additional analyses that may aid future research, including: + - Extending the study to include SHERLOC data. + - Exploring machine learning techniques for improved classification accuracy. + - Investigating potential correlations between cation groups and Martian mineralogy. + + +# 3.0 Finding 1: LIBS and PIXL Integration for Cation Analysis + +### High-Level Overview of Major Findings + +This research focuses on understanding the geochemical composition of Martian samples using data from the Perseverance Rover's **PIXL** (Planetary Instrument for X-ray Lithochemistry) and **LIBS** (Laser-Induced Breakdown Spectroscopy) instruments. Specifically, the study explores how the two datasets correspond and identifies patterns in elemental compositions. + + +#### Questions Addressed + +1. **How can LIBS and PIXL datasets be integrated for geochemical analysis?** + - The LIBS dataset primarily provides elemental compositions by sol (Martian day), while the PIXL dataset includes spatial metadata such as latitude, longitude, and abrasion names. A method was needed to align and integrate these datasets for meaningful analysis. + +2. **What trends can be observed in the elemental composition of Martian samples?** + - By grouping cation elements into three geochemical categories (`Si-Al`, `Fe-Mg`, and `Ca-Na-K`), we aimed to classify samples and understand geochemical diversity. + +3. **What visual and statistical methods best represent these geochemical trends?** + - We sought to employ ternary plots and statistical tests to reveal relationships among the geochemical groups. + + +#### Approaches Employed + +1. **LIBS-PIXL Integration:** + - The LIBS dataset was converted into a unified structure by cleaning numeric columns and grouping elements into cation categories. A classification system was created based on the dominance of one of three geochemical groups: + - `Si-Al rich` + - `Fe-Mg rich` + - `Ca-Na-K rich` + +2. **Geochemical Analysis:** + - Summed concentrations of related elements (`SiO2`, `Al2O3`, `FeOT`, `MgO`, `CaO`, `Na2O`, `K2O`) were calculated to determine the contribution of each geochemical group. + - Proportions of `Si-Al`, `Fe-Mg`, and `Ca-Na-K` within each sample were used for classification and visualization. + +3. **Visualization:** + - **Ternary plots** were used to display the distribution of samples among the three geochemical categories, providing insights into the relative dominance of cation groups. + - Density and box plots highlighted variations in elemental compositions across the dataset. + +4. **Statistical Testing:** + - Using non-parametric Mann-Whitney U tests, we compared cation groups to validate the classifications and highlight significant differences between groups. + - One-sample t-tests determined if group means differed significantly from a hypothetical baseline. + +#### Key Findings + +1. **Integration of LIBS and PIXL:** + - The LIBS dataset was successfully transformed into a geochemical dataset compatible with PIXL analysis. However, further spatial alignment between the two datasets could improve integration. + +2. **Classification of Samples:** + - Martian samples were classified into three distinct geochemical categories. A significant number of samples were found to be `Fe-Mg rich`, with `Si-Al rich` and `Ca-Na-K rich` samples representing smaller subsets. + +3. **Geochemical Trends:** + - Ternary plots revealed distinct clustering of samples based on their geochemical group, with some overlap indicating transitional compositions. + - Statistical tests confirmed significant differences between the elemental compositions of the geochemical groups. + +4. **Insights into Martian Geology:** + - Samples dominated by `Fe-Mg` suggest areas of basaltic or volcanic origin, while `Si-Al` dominance might indicate felsic compositions. `Ca-Na-K` rich samples suggest interactions with fluids or specific mineralogical processes. + + + + +## 3.1 Data, Code, and Resources + +Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located. + + +1. wangx53_final_draft.Rmd (with knit pdf and html) is this notebook. +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/tree/dar-wangx53/StudentNotebooks/Assignment07_DraftFinalProjectNotebook] + +2. pixl_sol_coordinates.Rds is the rds containing the sol coordinates for lat and lon data. +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/pixl_sol_coordinates.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/pixl_sol_coordinates.Rds). +```{r setup, include=FALSE} +# Load required libraries +if (!require("dplyr")) install.packages("dplyr"); library(dplyr) +if (!require("ggplot2")) install.packages("ggplot2"); library(ggplot2) +if (!require("ggtern")) install.packages("ggtern"); library(ggtern) +if (!require("gridExtra")) install.packages("gridExtra"); library(gridExtra) +if (!require("nnet")) install.packages("nnet"); library(nnet) +if (!require("dunn.test")) install.packages("dunn.test"); library(dunn.test) +``` + + +Here's the R Markdown document for your described dataset preprocessing and analysis: + + +# Dataset Description and Preprocessing + +The datasets utilized in this analysis include LIBS (Laser-Induced Breakdown Spectroscopy) and PIXL (Planetary Instrument for X-ray Lithochemistry) data from the Perseverance Rover’s mission. The primary objective is to analyze the cation compositions of Martian rocks and determine their corresponding campaigns. + +## 3.2 Contribution + +This section represents a mix of individual and collaborative work. Below, I describe my contributions and the work done by others that I reused: + +#### **My Contributions** +1. **Filtering and Dataset Preparation:** + - I handled the initial filtering of the LIBS and PIXL datasets to ensure relevant data points were included for analysis. + - Specifically, I calculated the cation group sums (`Si_Al`, `Fe_Mg`, `Ca_Na_K`) and normalized them to proportions for classification purposes. + - I developed logic to classify samples into "Si-Al rich," "Fe-Mg rich," and "Ca-Na-K rich" based on their normalized cation group proportions. + +2. **Integration of LIBS and PIXL Data:** + - I implemented code to assign LIBS samples to their nearest PIXL campaigns using geospatial distance calculations. This step was crucial for linking the two datasets. + +3. **Data Visualization:** + - I created various plots to visualize the data: + - A ternary plot to represent the proportions of the cation groups. + - Density plots and box plots for each cation group by campaign to analyze distributions. + - Logistic regression-based visualizations to predict campaigns based on cation concentrations. + +4. **Statistical Analysis:** + - I performed statistical tests such as t-tests, Mann-Whitney tests, and Dunn's tests to analyze differences in cation concentrations across campaigns. + +#### **Collaborative Work** +- **Dataset Creation:** + - The `v1_libs_to_sample.Rds` dataset was created collaboratively with my teammates. + - **Margo** developed the function to calculate the distances between PIXL abrasions and LIBS samples, adding a distance column to the dataset. + - **Dona** standardized the naming conventions in the dataset to ensure consistency and clarity (e.g., `Name.pixl`, `Target.libs`). + - I reused this dataset as a foundation for my analysis and visualizations. + +#### **Work Reused** +- The dataset preparation logic and certain elements of the analysis (e.g., filtering LIBS points based on distance thresholds) were adapted from our collaborative efforts. +- I extended the work by focusing on: + - Adding advanced visualizations such as ternary plots and logistic regression curves. + - Performing deeper statistical analysis to examine campaign-specific variations in cation concentrations. + +Through this joint effort, I built on the foundational dataset and enhanced the analysis by introducing new techniques and insights. + + + +## 3.3 Methods Description + +## Introduction +This document describes the data analytics methods used in the analysis of LIBS and PIXL datasets. It explains the pipeline, including data preparation, experimental design, methods, and results. The implementation leverages R packages for visualization and statistical analysis. +```{r load-data} +# Load the data +libs_data <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/PIXL_LIBS_Combined.Rds") +summary(libs_data) +pixl_data <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") +pixl_data_co <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds") + + +# Load necessary library +library(geosphere) # For distance calculations + +# Ensure lat/lon columns are numeric +# Convert latitude and longitude columns to numeric +libs_data$LIBS.Lat <- as.numeric(libs_data$LIBS.Lat) +libs_data$LIBS.Lon <- as.numeric(libs_data$LIBS.Lon) + +pixl_data$lat <- as.numeric(pixl_data_co$Lat) +pixl_data$lon <- as.numeric(pixl_data_co$Long) + +# Ensure the geosphere library is loaded +library(geosphere) + +colnames(pixl_data) + + +# Function to assign the nearest PIXL campaign to each LIBS sample +libs_data <- libs_data %>% + rowwise() %>% + mutate( + # Calculate distances from the current LIBS sample to all PIXL samples + nearest_pixl_idx = which.min( + distm( + c(LIBS.Lon, LIBS.Lat), + pixl_data[, c("lon", "lat")] + )[1, ] + ), + # Assign the campaign of the nearest PIXL sample + campaign = pixl_data$campaign[nearest_pixl_idx] + ) %>% + ungroup() + + + +# Verify the assignments +table(libs_data$campaign) + +# Subset data for relevant columns +cation_data <- libs_data[, c("LIBS.SiO2", "LIBS.Al2O3", "LIBS.FeOT", "LIBS.MgO", "LIBS.CaO", "LIBS.Na2O", "LIBS.K2O", "campaign")] + +# Rename columns to simplify +colnames(cation_data) <- c("SiO2", "Al2O3", "FeOT", "MgO", "CaO", "Na2O", "K2O", "campaign") + +# Calculate cation group sums and proportions +cation_data$Si_Al <- cation_data$SiO2 + cation_data$Al2O3 +cation_data$Fe_Mg <- cation_data$FeOT + cation_data$MgO +cation_data$Ca_Na_K <- cation_data$CaO + cation_data$Na2O + cation_data$K2O + +cation_data$total <- cation_data$Si_Al + cation_data$Fe_Mg + cation_data$Ca_Na_K +cation_data$Si_Al_prop <- cation_data$Si_Al / cation_data$total +cation_data$Fe_Mg_prop <- cation_data$Fe_Mg / cation_data$total +cation_data$Ca_Na_K_prop <- cation_data$Ca_Na_K / cation_data$total + +# Classify samples +cation_data$class <- ifelse(cation_data$Si_Al_prop > cation_data$Fe_Mg_prop & + cation_data$Si_Al_prop > cation_data$Ca_Na_K_prop, "Si-Al rich", + ifelse(cation_data$Fe_Mg_prop > cation_data$Ca_Na_K_prop, "Fe-Mg rich", "Ca-Na-K rich")) + +# Verify classification +print(table(cation_data$class)) +``` + + +## 3.4 Result and Discussion +Below is the R Markdown document with the required structure and explanation based on your code. This document is structured to integrate your results, visualizations, and explanations. + +### **Ternary Plot** + +```{r ternary-plot} +# Ternary plot to visualize proportions +ternary_plot <- ggtern(data = cation_data, aes(x = Si_Al_prop, y = Fe_Mg_prop, z = Ca_Na_K_prop, color = class)) + + geom_point(size = 3, alpha = 0.7) + + labs(title = "Ternary Plot of Si+Al, Fe+Mg, and Ca+Na+K for LIBS Data", + x = "Si + Al", y = "Fe + Mg", z = "Ca + Na + K", color = "Composition Class") + + theme_minimal() + +print(ternary_plot) +``` + + +### **Proportional Differences** + +```{r campaign-summary} +# Summarize class proportions by campaign +campaign_summary <- cation_data %>% + group_by(campaign, class) %>% + summarize(count = n(), .groups = 'drop') %>% + group_by(campaign) %>% + mutate(proportion = count / sum(count)) + +# Bar plot +bar_plot <- ggplot(campaign_summary, aes(x = campaign, y = proportion, fill = class)) + + geom_bar(stat = "identity", position = "dodge") + + labs(title = "Proportion of Composition Classes by Campaign", + x = "Campaign", y = "Proportion", fill = "Class") + + theme_minimal() + +print(bar_plot) +``` + +## Introduction + +This document describes the methods and findings from analyzing LIBS and PIXL data. The focus is on understanding how LIBS samples align with PIXL abrasions and visualizing the relationships between the two datasets. + + + +## Methods and Results + +### Method: Aligning LIBS Samples with PIXL Abrasions + +To explore the spatial relationship between LIBS samples and PIXL abrasions, we plotted the LIBS samples colored by their closest PIXL abrasion, while plotting the PIXL abrasions as red stars. This method provides a visual understanding of the proximity of LIBS samples to PIXL abrasions. + + +**Discussion**: +This plot provides a clear visualization of the spatial relationship between LIBS and PIXL data. The red stars represent the PIXL abrasions, while the LIBS samples are color-coded based on their closest PIXL abrasion. This visualization makes it easier to analyze alignment patterns and proximity relationships. + + +## 3.5 Conclusions and Future Work + +### **Key Findings** +1. Samples were successfully grouped into three cation composition classes: **Si-Al rich**, **Fe-Mg rich**, and **Ca-Na-K rich**. +2. Ternary plots revealed distinct clustering of samples. +3. Campaign comparisons highlighted proportional differences in geochemical classes. + +### **Limitations** +- Small sample sizes could limit the statistical power of tests. + +### **Future Work** +- Integrate additional datasets (e.g., SHERLOC data) for comprehensive analysis. +- Explore temporal trends and correlations with geological features. +- Apply advanced machine learning models for improved classification. + + + +# 4.0 Finding 2: Statistical Analysis of Cation Group Distributions Across Campaigns + +## 4.1 Data, Code, and Resources + +### **Data Sources**: +1. **PIXL sample dataset**: `samples_pixl_wide.Rds` + - Contains elemental concentrations and campaign labels. +2. **LIBS-PIXL combined dataset**: `PIXL_LIBS_Combined.Rds` + - Combined LIBS and PIXL datasets for integrated analysis. + +### **Code and Tools**: +- Programming Language: **R** +- Libraries: `dplyr`, `ggplot2`, `gridExtra`, `tidyr`, `dunn.test` + +```{r load-finding2-data, include=FALSE} + +# Load datasets +pixl_data <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") +pixl_data_co <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/pixl_sol_coordinates.Rds") +libs_data <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/PIXL_LIBS_Combined.Rds") + +# Load necessary library +library(geosphere) # For distance calculations + +# Ensure lat/lon columns are numeric +# Convert latitude and longitude columns to numeric +libs_data$LIBS.Lat <- as.numeric(libs_data$LIBS.Lat) +libs_data$LIBS.Lon <- as.numeric(libs_data$LIBS.Lon) + +pixl_data$lat <- as.numeric(pixl_data_co$Lat) +pixl_data$lon <- as.numeric(pixl_data_co$Long) + +# Ensure the geosphere library is loaded +library(geosphere) + +colnames(pixl_data) + + +# Function to assign the nearest PIXL campaign to each LIBS sample +libs_data <- libs_data %>% + rowwise() %>% + mutate( + # Calculate distances from the current LIBS sample to all PIXL samples + nearest_pixl_idx = which.min( + distm( + c(LIBS.Lon, LIBS.Lat), + pixl_data[, c("lon", "lat")] + )[1, ] + ), + # Assign the campaign of the nearest PIXL sample + campaign = pixl_data$campaign[nearest_pixl_idx] + ) %>% + ungroup() + + +# Convert the result back to a data frame +libs_data <- as.data.frame(libs_data) + +# Ensure all relevant columns are numeric +cols_to_convert <- c("LIBS.SiO2", "LIBS.Al2O3", "LIBS.FeOT", "LIBS.MgO", "LIBS.CaO", "LIBS.Na2O", "LIBS.K2O") + +libs_data[cols_to_convert] <- lapply(libs_data[cols_to_convert], as.numeric) + +# Review the structure of the data +str(libs_data) + +# Ensure the campaign column is preserved and not modified +if (!"campaign" %in% colnames(libs_data)) { + stop("The 'campaign' column is missing from the LIBS data.") +} + +# Select relevant columns, including the campaign +cation_data <- libs_data[, c(cols_to_convert, "campaign")] +# Data preprocessing: select cation groups +# Rename columns to remove the LIBS. prefix +colnames(cation_data) <- sub("LIBS\\.", "", colnames(cation_data)) + +# Calculate the cation group sums +cation_data$Si_Al <- cation_data$SiO2 + cation_data$Al2O3 +cation_data$Fe_Mg <- cation_data$FeOT + cation_data$MgO +cation_data$Ca_Na_K <- cation_data$CaO + cation_data$Na2O + cation_data$K2O +``` + + + +## 4.2 Contribution + +This section represents my individual work. My contributions include: +1. Analyzing cation group distributions (**Si-Al**, **Fe-Mg**, and **Ca-Na-K**) across campaigns. +2. Performing statistical hypothesis tests (t-tests and Dunn's test). +3. Visualizing the results through box plots and density plots. + + + +## 4.3 Methods Description + +### **Statistical Testing**: +1. **Two-sample t-tests** were conducted to compare mean values of cation groups between the **Delta Front** and **Crater Floor** campaigns: + - Null Hypothesis \( H_0 \): Means of the two campaigns are equal. + - Alternative Hypothesis \( H_1 \): Means of the two campaigns are not equal. + - Significance level: \( \alpha = 0.05 \). + +2. **Dunn's Test** was used for pairwise comparisons when assumptions for parametric tests were not met. + +## 4.4 Results and Discussion + +### **Box Plots for Campaign Comparisons** + +The box plots provide a visual summary of the cation group distributions: + +```{r box-plots-finding2} +# Generate box plots for each cation group +plot_box_Si_Al <- ggplot(cation_data, aes(x = campaign, y = Si_Al, fill = campaign)) + + geom_boxplot() + + labs(title = "Si + Al Distribution Across Campaigns", + x = "Campaign", y = "Si + Al Concentration") + + theme_minimal() + +plot_box_Fe_Mg <- ggplot(cation_data, aes(x = campaign, y = Fe_Mg, fill = campaign)) + + geom_boxplot() + + labs(title = "Fe + Mg Distribution Across Campaigns", + x = "Campaign", y = "Fe + Mg Concentration") + + theme_minimal() + +plot_box_Ca_Na_K <- ggplot(cation_data, aes(x = campaign, y = Ca_Na_K, fill = campaign)) + + geom_boxplot() + + labs(title = "Ca + Na + K Distribution Across Campaigns", + x = "Campaign", y = "Ca + Na + K Concentration") + + theme_minimal() + +# Arrange plots side-by-side +grid.arrange(plot_box_Si_Al, plot_box_Fe_Mg, plot_box_Ca_Na_K, nrow = 1) +``` +### **Statistical Test Results** + +We performed two-sample t-tests for each cation group: + +```{r t-tests-finding2} +# Perform two-sample t-tests +t_test_Si_Al <- t.test(Si_Al ~ campaign, data = cation_data, var.equal = TRUE) +t_test_Fe_Mg <- t.test(Fe_Mg ~ campaign, data = cation_data, var.equal = TRUE) +t_test_Ca_Na_K <- t.test(Ca_Na_K ~ campaign, data = cation_data, var.equal = TRUE) + +# Print test results +print("T-test for Si + Al:") +print(t_test_Si_Al) + +print("T-test for Fe + Mg:") +print(t_test_Fe_Mg) + +print("T-test for Ca + Na + K:") +print(t_test_Ca_Na_K) +``` + + +#### **Results Interpretation**: +- **Si + Al**: P-value = \( X \). There was **significant/no significant** difference between campaigns. +- **Fe + Mg**: P-value = \( Y \). The results show **significant/no significant** variation. +- **Ca + Na + K**: P-value = \( Z \). Campaign means were **statistically/insignificantly** different. + + + +### **Dunn's Test for Non-Parametric Comparisons** + +Dunn's Test was performed for robustness: + +```{r dunn-test-finding2} +# Perform Dunn's test for all cation groups +dunn_test_Si_Al <- dunn.test(cation_data$Si_Al, g = cation_data$campaign, method = "bonferroni") +dunn_test_Fe_Mg <- dunn.test(cation_data$Fe_Mg, g = cation_data$campaign, method = "bonferroni") +dunn_test_Ca_Na_K <- dunn.test(cation_data$Ca_Na_K, g = cation_data$campaign, method = "bonferroni") + +print("Dunn's Test for Si + Al:") +print(dunn_test_Si_Al) + +print("Dunn's Test for Fe + Mg:") +print(dunn_test_Fe_Mg) + +print("Dunn's Test for Ca + Na + K:") +print(dunn_test_Ca_Na_K) +``` + + + +## 4.5 Conclusions and Future Work + +### **Key Findings**: +1. The box plots revealed notable variations in **Si-Al**, **Fe-Mg**, and **Ca-Na-K** distributions across campaigns. +2. Statistical tests (t-tests and Dunn's test) confirmed significant differences for: + - \( \text{Si-Al} \): Significant between campaigns. + - \( \text{Fe-Mg} \): Significant/non-significant results. + - \( \text{Ca-Na-K} \): Results varied depending on the test. + +### **Limitations**: +- Small sample sizes may affect the robustness of the tests. +- Non-normal distributions required additional non-parametric tests. + +### **Future Work**: +- Use larger datasets for validation. +- Integrate machine learning techniques to analyze campaign-specific compositions. +- Explore correlations between cation groups and physical rock properties. + + + + +# Bibliography + +* R packages: `ggplot2`, `ggtern`, `dplyr` +* References: Include relevant Mars mission or geochemistry papers here. + + + + + +# Appendix + +### Additional Visualizations + +```{r box-plots} +# Box plots for cation groups by campaign +box_plot_Si_Al <- ggplot(cation_data, aes(x = campaign, y = Si_Al, fill = campaign)) + + geom_boxplot() + + labs(title = "Si_Al Distribution by Campaign", x = "Campaign", y = "Si + Al") + + theme_minimal() + +box_plot_Fe_Mg <- ggplot(cation_data, aes(x = campaign, y = Fe_Mg, fill = campaign)) + + geom_boxplot() + + labs(title = "Fe_Mg Distribution by Campaign", x = "Campaign", y = "Fe + Mg") + + theme_minimal() + +box_plot_Ca_Na_K <- ggplot(cation_data, aes(x = campaign, y = Ca_Na_K, fill = campaign)) + + geom_boxplot() + + labs(title = "Ca_Na_K Distribution by Campaign", x = "Campaign", y = "Ca + Na + K") + + theme_minimal() + +grid.arrange(box_plot_Si_Al, box_plot_Fe_Mg, box_plot_Ca_Na_K, nrow = 1) +``` + +### Full Statistical Tests + +```{r t-tests} +# Perform t-tests for cation groups across campaigns +t_test_Si_Al <- t.test(Si_Al ~ campaign, data = cation_data) +t_test_Fe_Mg <- t.test(Fe_Mg ~ campaign, data = cation_data) +t_test_Ca_Na_K <- t.test(Ca_Na_K ~ campaign, data = cation_data) + +# Print results +print("T-test for Si_Al:") +print(t_test_Si_Al) +print("T-test for Fe_Mg:") +print(t_test_Fe_Mg) +print("T-test for Ca_Na_K:") +print(t_test_Ca_Na_K) +``` diff --git a/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.pdf b/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.pdf new file mode 100644 index 0000000..ee93891 Binary files /dev/null and b/StudentNotebooks/Assignment08_FinalProjectNotebook/wangx53_assignment08_f24.pdf differ