Skip to content

Assignment 05, tested Logistic Regression #175

Merged
merged 1 commit into from Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
177 changes: 177 additions & 0 deletions StudentNotebooks/Assignment06/mwatid-assignment06.Rmd
@@ -0,0 +1,177 @@
---
title: "DAR F24 Project Status Notebook Template"
author: "Dante Mwatibo"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
html_document:
toc: yes
subtitle: "Mars"
---
## Weekly Work Summary

* RCS ID: mwatid
* Project Name: Mars
* Summary of work since last week

Did Logistic Regression on PIXL to attempt to guess the campaign

* Summary of github issues added and worked

* Logistic Regression on PIXL

* Summary of github commits

* branch: dar-mwatid
* commit links:

* List of presentations, papers, or other outputs

* N/A

* List of references (if necessary) N/A
* Indicate any use of group shared code base N/A
* Indicate which parts of your described work were done by you or as part of joint efforts N/A

## Personal Contribution

* Clearly defined, unique contribution(s) done by you: code, ideas, writing... All code and analysis done by myself
* Include github issues you've addressed if any: Logistic Regression on PIXL #147d

## Analysis: Logistic Regression (PIXL)

Is it possible to use logistic regression to determine whether or not there is any correlation in the amount of a subset of minerals in PIXL and the campaign of a rock sample.

### Data Preparation

The data I will be using for this analysis is a subset of minerals in the PIXL data.

1.) Load in the PIXL and SHERLOC data
2.) Scale the PIXL data
```{r, result01_data}
# loading the proper libraries
library(ggplot2)
library(ggtern)
library(magrittr)
library(dbplyr)
library(tidyr)
library(CCA)
# Load the saved PIXL data with locations added
pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
# Convert to factors
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], as.factor)
# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14] %>% scale()
## LOADING IN THE SHERLOC DATA
# Read in data as provided.
sherloc_abrasion_raw <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/abrasions_sherloc_samples.Rds")
# Clean up data types
sherloc_abrasion_raw$Mineral<-as.factor(sherloc_abrasion_raw$Mineral)
sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)] <- lapply(sherloc_abrasion_raw[sapply(sherloc_abrasion_raw, is.character)],
as.numeric)
# Transform NA's to 0
sherloc_abrasion_raw <- sherloc_abrasion_raw %>% replace(is.na(.), 0)
# Reformat data so that rows are "abrasions" and columns list the presence of minerals.
# Do this by "pivoting" to a long format, and then back to the desired wide format.
sherloc_long <- sherloc_abrasion_raw %>%
pivot_longer(!Mineral, names_to = "Name", values_to = "Presence")
# Make abrasion a factor
sherloc_long$Name <- as.factor(sherloc_long$Name)
# Make it a matrix
sherloc.matrix <- sherloc_long %>%
pivot_wider(names_from = Mineral, values_from = Presence)
# Get sample information from PIXL and add to measurements -- assumes order is the same
sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
```

### Analysis: Methods and results

I'll be performing Logistic Regression using the PIXL data. The goal of this analysis is to see whether or not it is possible to predict the campaign of a sample based purely on the PIXL mineral data.


```{r, result01_analysis}
# create the linear regression model
pixl.lm <- glm(campaign ~., data=pixl.df[,c(2:14,17)], family=binomial)
# get the coefficients for that model
coef.df <- data.frame(elements = names(pixl.lm$coefficients[2:9]),
coeff = abs(pixl.lm$coefficients[2:9]))
# graph all relevant features
ggplot(coef.df, aes(x=elements, y=coeff)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() + # Flip coordinates for horizontal bars
labs(title = "Logistic Regression Coefficients",
x = "Predictors",
y = "Coefficient Value") +
theme_minimal()
# using linear combination on the results of the linear regression
lm.results <- data.frame(campaign = as.matrix(pixl.df$campaign),
linear_combinations = (as.matrix(pixl.df[,2:9]) %*% as.matrix(coef.df[,2])) + pixl.lm$coefficients[1])
# plotting the results
ggplot(lm.results) +
geom_point(aes(x=linear_combinations, y=campaign))
```

### Discussion of results

The goal of this analysis was to see if I could take any rock sample and derive the campaign from said rock sample based on the value returned from the matrix multiplication of the rock sample's PIXL mineral data and the linear regression results. If the value is above or below certain threshold you could reasonably assume the sample to be from a certain campaign. Considering CCA worked well previously, I expected the linear regression madel to outperform the CCA model, however that didn't seem to be the case as there was very little separation between the two campaigns as CCA.


## Analysis: Logistic Regression & PCA

### Question being asked

Is it possible to use canonical correlation analysis to determine whether or not there is any correlation in the amount of a subset of minerals in PIXL when using PCA pre-processing and the campaign of a rock sample.

### Data Preparation

1.) Reuse the PIXL dataset that has been loaded in before
2.) perform PCA on the dataset

```{r, result02_data}
# perform PCA on the dataset
pixl.pca <- prcomp(pixl.df[,2:14])
# create a separate dataset that is the partial component transformation
pixl.pca.df <- data.frame(campaign = pixl.df$campaign, pixl.pca$x)
```

### Analysis: Methods and Results

I'll be performing Logistical Regression on the partial components of the PIXL dataset. The goal of this analysis is to see whether using PCA would get better results than using the raw PIXL data for logistical analysis.

```{r, result02_analysis}
# create the linear regression model
pixl.lm.pca <- glm(campaign ~., data=pixl.pca.df, family=binomial)
# get the coefficients for that model
coef.pca.df <- data.frame(elements = names(pixl.lm.pca$coefficients[2:6]),
coeff = abs(pixl.lm.pca$coefficients[2:6]))
# using linear combination on the results of the linear regression
lm.pca.results <- data.frame(campaign = as.matrix(pixl.df$campaign),
linear_combinations = (as.matrix(pixl.pca.df[,2:6]) %*% as.matrix(coef.pca.df[,2])) + pixl.lm.pca$coefficients[1])
#lm.pca.results["linear_combinations"] <- round(lm.pca.results["linear_combinations"], 4)
# plotting the results
ggplot(lm.pca.results) +
geom_point(aes(x=linear_combinations, y=campaign))
```

### Discussion of results
It appears that performing PCA as a form of preprocessing does not significantly increase the predictive ability of the logistic regression model.


## Summary and next steps
In summary it does not appear possible to use logistic regression to predict campaign for the PIXL data. Next steps would be to try this with the combined PIXL and SHERLOC data and see if it yields different results.

Binary file not shown.