diff --git a/StudentNotebooks/Assignment06/mwatid-assignment06.Rmd b/StudentNotebooks/Assignment06/mwatid-assignment06.Rmd new file mode 100644 index 0000000..743e498 --- /dev/null +++ b/StudentNotebooks/Assignment06/mwatid-assignment06.Rmd @@ -0,0 +1,177 @@ +--- +title: "DAR F24 Project Status Notebook Template" +author: "Dante Mwatibo" +date: "`r Sys.Date()`" +output: + pdf_document: + toc: yes + html_document: + toc: yes +subtitle: "Mars" +--- +## Weekly Work Summary + +* RCS ID: mwatid +* Project Name: Mars +* Summary of work since last week + + Did Logistic Regression on PIXL to attempt to guess the campaign + +* Summary of github issues added and worked + + * Logistic Regression on PIXL + +* Summary of github commits + + * branch: dar-mwatid + * commit links: + +* List of presentations, papers, or other outputs + + * N/A + +* List of references (if necessary) N/A +* Indicate any use of group shared code base N/A +* Indicate which parts of your described work were done by you or as part of joint efforts N/A + +## Personal Contribution + +* Clearly defined, unique contribution(s) done by you: code, ideas, writing... All code and analysis done by myself +* Include github issues you've addressed if any: Logistic Regression on PIXL #147d + +## Analysis: Logistic Regression (PIXL) + +Is it possible to use logistic regression to determine whether or not there is any correlation in the amount of a subset of minerals in PIXL and the campaign of a rock sample. + +### Data Preparation + +The data I will be using for this analysis is a subset of minerals in the PIXL data. + +1.) Load in the PIXL and SHERLOC data +2.) Scale the PIXL data +```{r, result01_data} +# loading the proper libraries +library(ggplot2) +library(ggtern) +library(magrittr) +library(dbplyr) +library(tidyr) +library(CCA) + +# Load the saved PIXL data with locations added +pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") + +# Convert to factors +pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], as.factor) + +# Make the matrix of just mineral percentage measurements +pixl.matrix <- pixl.df[,2:14] %>% scale() + +## LOADING IN THE SHERLOC DATA +# Read in data as provided. +sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds") + +# Clean up data types +sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral) +sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)], + as.numeric) +# Transform NA's to 0 +sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0) + +# Reformat data so that rows are "abrasions" and columns list the presence of minerals. +# Do this by "pivoting" to a long format, and then back to the desired wide format. + +sherloc_long <- sherloc_abrasion_raw %>% + pivot_longer(!Mineral, names_to = "Name", values_to = "Presence") + +# Make abrasion a factor +sherloc_long$Name <- as.factor(sherloc_long$Name) + +# Make it a matrix +sherloc.matrix <- sherloc_long %>% + pivot_wider(names_from = Mineral, values_from = Presence) + +# Get sample information from PIXL and add to measurements -- assumes order is the same +sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix) +``` + +### Analysis: Methods and results + +I'll be performing Logistic Regression using the PIXL data. The goal of this analysis is to see whether or not it is possible to predict the campaign of a sample based purely on the PIXL mineral data. + + +```{r, result01_analysis} +# create the linear regression model +pixl.lm <- glm(campaign ~., data=pixl.df[,c(2:14,17)], family=binomial) +# get the coefficients for that model +coef.df <- data.frame(elements = names(pixl.lm$coefficients[2:9]), + coeff = abs(pixl.lm$coefficients[2:9])) +# graph all relevant features +ggplot(coef.df, aes(x=elements, y=coeff)) + + geom_bar(stat = "identity", fill = "lightblue") + + coord_flip() + # Flip coordinates for horizontal bars + labs(title = "Logistic Regression Coefficients", + x = "Predictors", + y = "Coefficient Value") + + theme_minimal() + +# using linear combination on the results of the linear regression +lm.results <- data.frame(campaign = as.matrix(pixl.df$campaign), + linear_combinations = (as.matrix(pixl.df[,2:9]) %*% as.matrix(coef.df[,2])) + pixl.lm$coefficients[1]) + +# plotting the results +ggplot(lm.results) + + geom_point(aes(x=linear_combinations, y=campaign)) +``` + +### Discussion of results + +The goal of this analysis was to see if I could take any rock sample and derive the campaign from said rock sample based on the value returned from the matrix multiplication of the rock sample's PIXL mineral data and the linear regression results. If the value is above or below certain threshold you could reasonably assume the sample to be from a certain campaign. Considering CCA worked well previously, I expected the linear regression madel to outperform the CCA model, however that didn't seem to be the case as there was very little separation between the two campaigns as CCA. + + +## Analysis: Logistic Regression & PCA + +### Question being asked + +Is it possible to use canonical correlation analysis to determine whether or not there is any correlation in the amount of a subset of minerals in PIXL when using PCA pre-processing and the campaign of a rock sample. + +### Data Preparation + +1.) Reuse the PIXL dataset that has been loaded in before +2.) perform PCA on the dataset + +```{r, result02_data} +# perform PCA on the dataset +pixl.pca <- prcomp(pixl.df[,2:14]) + +# create a separate dataset that is the partial component transformation +pixl.pca.df <- data.frame(campaign = pixl.df$campaign, pixl.pca$x) +``` + +### Analysis: Methods and Results + +I'll be performing Logistical Regression on the partial components of the PIXL dataset. The goal of this analysis is to see whether using PCA would get better results than using the raw PIXL data for logistical analysis. + +```{r, result02_analysis} +# create the linear regression model +pixl.lm.pca <- glm(campaign ~., data=pixl.pca.df, family=binomial) +# get the coefficients for that model +coef.pca.df <- data.frame(elements = names(pixl.lm.pca$coefficients[2:6]), + coeff = abs(pixl.lm.pca$coefficients[2:6])) + +# using linear combination on the results of the linear regression +lm.pca.results <- data.frame(campaign = as.matrix(pixl.df$campaign), + linear_combinations = (as.matrix(pixl.pca.df[,2:6]) %*% as.matrix(coef.pca.df[,2])) + pixl.lm.pca$coefficients[1]) +#lm.pca.results["linear_combinations"] <- round(lm.pca.results["linear_combinations"], 4) +# plotting the results +ggplot(lm.pca.results) + + geom_point(aes(x=linear_combinations, y=campaign)) +``` + +### Discussion of results +It appears that performing PCA as a form of preprocessing does not significantly increase the predictive ability of the logistic regression model. + + +## Summary and next steps +In summary it does not appear possible to use logistic regression to predict campaign for the PIXL data. Next steps would be to try this with the combined PIXL and SHERLOC data and see if it yields different results. + diff --git a/StudentNotebooks/Assignment06/mwatid-assignment06.pdf b/StudentNotebooks/Assignment06/mwatid-assignment06.pdf new file mode 100644 index 0000000..b2b945b Binary files /dev/null and b/StudentNotebooks/Assignment06/mwatid-assignment06.pdf differ