Skip to content

assignment02 #115

Merged
merged 1 commit into from Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
218 changes: 121 additions & 97 deletions StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.Rmd
Expand Up @@ -12,27 +12,37 @@ output:
---
```{r setup, include=FALSE}
# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
# This section install packages if they are not already installed.
#Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
#This section install packages if they are not already installed.
# This block will not be shown in the knit file.
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_chunk$set(echo = TRUE)
# Set the default CRAN repository
local({r <- getOption("repos")
r["CRAN"] <- "http://cran.r-project.org"
options(repos=r)
})
if (!require("kableExtra")) {
install.packages("kableExtra")
library(kableExtra)
}
if (!require("pandoc")) {
install.packages("pandoc")
library(pandoc)
}
if (!require("reshape2")) {
install.packages("reshape2")
library(reshape2)
}
# Required packages for M20 LIBS analysis
if (!require("rmarkdown")) {
install.packages("rmarkdown")
library(rmarkdown)
}
library(rmarkdown)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
Expand All @@ -51,7 +61,10 @@ if (!require("pheatmap")) {
install.packages("pheatmap")
library(pheatmap)
}
if (!require("randomForest")) {
install.packages("randomForest")
library(randomForest)
}
```

# DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook
Expand Down Expand Up @@ -186,14 +199,9 @@ lithology.df[sapply(lithology.df, is.character)] <-
# Keep only first 16 samples because the data for the rest of the samples is not available yet
lithology.df<-lithology.df[1:16,]
# Look at summary of cleaned data frame
summary(lithology.df)
# Create a matrix containing only the numeric measurements. The remaining features are metadata about the sample.
lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1
# Review the structure of our matrix
str(lithology.matrix)
```


Expand All @@ -209,14 +217,8 @@ pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
as.factor)
# Review our dataframe
summary(pixl.df)
# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14] %>% scale()
# Review the structure
str(pixl.matrix)
```

## Data Set C: Load the LIBS Data
Expand All @@ -235,14 +237,8 @@ libs.df <- libs.df %>%
# Convert the points to numeric
libs.df$point <- as.numeric(libs.df$point)
# Review what we have
summary(libs.df)
# Make the a matrix contain only the libs measurements for each mineral
libs.matrix <- as.matrix(libs.df[,6:13])
# Check to see scaling
str(libs.matrix)
```


Expand Down Expand Up @@ -282,116 +278,144 @@ sherloc.matrix <- sherloc_long %>%
sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
# Review what we have
summary(sherloc.df)
# Measurements are everything except first column
sherloc.matrix<-as.matrix(sherloc.matrix[,-1])
# Sherlock measurement matrix
# Review the structure
str(sherloc.matrix)
```
## Data Set E: PIXL + Sherloc
```{r}
# Combine PIXL and SHERLOC dataframes
pixl_sherloc.df <- cbind(pixl.df,sherloc.df )
# Review what we have
summary(pixl_sherloc.df)
# Combine PIXL and SHERLOC matrices
pixl_sherloc.matrix<-cbind(pixl.matrix,sherloc.matrix)
# Review the structure of our matrix
str(pixl_sherloc.matrix)
```


## Data Set F: PIXL + Lithography
## Data Set H: Sherloc + Lithology + PIXL

Create data and matrix from prior datasets
Create data frame and matrix from prior datasets by making on appropriate combinations.

```{r}
# Combine our PIXL and Lithology dataframes
pixl_lithology.df <- cbind(pixl.df,lithology.df )
# Review what we have
summary(pixl_lithology.df)
# Combine the Lithology and SHERLOC dataframes
sherloc_lithology_pixl.df <- cbind(sherloc.df, lithology.df, pixl.df )
# Combine PIXL and Lithology matrices
pixl_lithology.matrix<-cbind(pixl.matrix,lithology.matrix)
# Combine the Lithology, SHERLOC and PIXLmatrices
sherloc_lithology_pixl.matrix <- cbind(sherloc.matrix,lithology.matrix,pixl.matrix)
# Review the structure
str(pixl_lithology.matrix)
# Z-score scaling on pixl data
pixl.matrix_z <- 1 / (1+ exp(-pixl.matrix))
sherloc_lithology_pixl_z.matrix <- cbind(sherloc.matrix,lithology.matrix,pixl.matrix_z)
```

## Data Set G: Sherloc + Lithology
# Analysis of Data (Part 3)

Create Data and matrix from prior datasets by taking on appropriate combinations.
Dataset H: PIXL + Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.)

```{r}
# Combine the Lithology and SHERLOC dataframes
sherloc_lithology.df <- cbind(sherloc.df,lithology.df )
1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features? Which features are measurements and which features are metadata about the samples? (3 pts)

# Review what we have
summary(sherloc_lithology.df)
In the data set H, there are 16 rows and 99 features in the data frame.

# Combine the Lithology and SHERLOC matrices
sherloc_lithology.matrix<-cbind(sherloc.matrix,lithology.matrix)
Measurement features: Chemical Oxides like "Na20", "Mgo", "Si02", etc. Mineral/Compound phase like "Sulfate", "Quartz", "Halite", etc.

# Review the resulting matrix
str(sherloc_lithology.matrix)
Metadata features: "sample", "type", "location", etc.

```
## Data Set H: Sherloc + Lithology + PIXL
2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose a scaling method or to not scale. (3 pts)

Create data frame and matrix from prior datasets by making on appropriate combinations.
I choose to scale the PIXL data but not the Sherloc or Lithography data because only PIXL data is non-binary. For the scaling I use the Z-score scaling in order to make range of the PIXL data goes between 0 and 1. For Sherloc or Lithography data, scaling is not necessary since they are binary data. I still will go through all the process with the non-scaled data and the scaled data to compare the difference.

```{r}
# Combine the Lithology and SHERLOC dataframes
sherloc_lithology_pixl.df <- cbind(sherloc.df,lithology.df, pixl.df )
3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_ Describe how you picked the best number of clusters. Indicate the number of points in each clusters. (6 pts)

# Review what we have
summary(sherloc_lithology_pixl.df)
I used K-means clustering. I picked K based on the "elbow" on the wssplot. With the scaled data, I chose K for 7. The number of points from cluster 1 to 7 is "2 2 2 4 3 2 1". Each For the data set without the scaling, I chose K for 6. The number of points from cluster 1 to 6 is "1 3 7 1 2 2". You can see from this two clustering. Obviously the one with the 7 cluster groups spread more evenly. The final answer is that K = 7.

# Combine the Lithology, SHERLOC and PIXLmatrices
sherloc_lithology_pixl.matrix<-cbind(sherloc.matrix,lithology.matrix,pixl.matrix)

# Review the resulting matrix
str(sherloc_lithology_pixl.matrix)
```{r}
set.seed(400)
# insert wssplot function
wssplot <- function(data, nc=15, seed=55){
wss <- data.frame(cluster=1:nc, quality=c(0))
for (i in 1:nc){
set.seed(seed)
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
ggplot(data=wss,aes(x=cluster,y=quality)) +
geom_line() +
ggtitle("Quality of k-means by Cluster")
}
# Apply `wssplot()` to our data (z-score on PIXL only)
wssplot(sherloc_lithology_pixl_z.matrix, nc=11, seed=400)
wssplot(sherloc_lithology_pixl.matrix, nc=11, seed=400)
# Use our chosen 'k' to perform k-means clustering
set.seed(400)
# z-score scaling method for PIXL not sherloc
k <- 7
km <- kmeans(sherloc_lithology_pixl_z.matrix,k)
# scaling method
k1 <- 6
km1 <- kmeans(sherloc_lithology_pixl.matrix,k1)
#
pheatmap(km$centers,scale="none")
pheatmap(km1$centers,scale="none")
cluster.df<-data.frame(cluster=1:7, size=km$size)
kable(cluster.df,caption="Samplespercluster")
cluster.df<-data.frame(cluster=1:6, size=km1$size)
kable(cluster.df,caption="Samplespercluster")
```

# Analysis of Data (Part 3)

Each team has been assigned one of six datasets:

1. Dataset B: PIXL: The PIXL team's goal is to understand and explain how scaling changes results from Assignment 1. The matrix version was scaled above but not in Assignment 1.

2. Dataset C: LIBS (with appropriate scaling as necessary. Not scaled yet.)
4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data: Alternatively do another creative analysis of your datasets that leads to one of more findings. Make sure to explain what your analysis and discuss your the results.

3. Dataset D: Sherloc (with appropriate scaling as necessary. Not scaled yet.)
```{r}
slp.pca<-prcomp(sherloc_lithology_pixl_z.matrix,scale=FALSE)
ggscreeplot(slp.pca)
summary(slp.pca)
```
Together, the first three components explain about 69.09% of the variance, which explain most of the variance.

4. Dataset E: PIXL + Sherloc (with appropriate scaling as necessary. Not scaled yet.)
```{r}
filtered <- apply(sherloc_lithology_pixl_z.matrix, 2, function(x) var(x) == 0)
filtered_slp <- sherloc_lithology_pixl_z.matrix[, !filtered]
pca_result <- prcomp(filtered_slp, center = TRUE, scale. = TRUE)
pca_data <- data.frame(PC1 = pca_result$x[,1], PC2 = pca_result$x[,2],
Cluster = factor(km$cluster))
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
labs(title = "PCA by Cluster", x = "PC1", y = "PC2") +
theme_minimal()
pca_data <- data.frame(PC1 = pca_result$x[,1], PC3 = pca_result$x[,3],
Cluster = factor(km$cluster))
ggplot(pca_data, aes(x = PC1, y = PC3, color = Cluster)) +
geom_point(size = 3) +
labs(title = "PCA by Cluster", x = "PC1", y = "PC3") +
theme_minimal()
```

5. Dataset F: PIXL + Lithography (with appropriate scaling as necessary. Not scaled yet.)
1. Cluster 1 (Pink):
PC1 vs. PC2 plot: Cluster 1 appears as two data points clustered tightly together near (PC1 = 4, PC2 = 0). This suggests that these points have very similar features in the first two principal components.
PC1 vs. PC3 plot: In the second plot, these same two points are also tightly clustered, but closer to the bottom near (PC1 = 4, PC3 = -3). This indicates a relatively consistent relationship in PC1 but a significant difference in PC3.

6. Dataset G: Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.)
2. Cluster 2 (Brown):
PC1 vs. PC2 plot: Cluster 2 has a single point near (PC1 = 0, PC2 = -1), which means it lies near the origin, showing moderate values in both principal components.
PC1 vs. PC3 plot: The position remains relatively central around (PC1 = 0, PC3 = 0), meaning this data point doesn't exhibit much variability in either PC2 or PC3 compared to other clusters.

7. Dataset H: PIXL + Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.)
3. Cluster 3 (Green):
PC1 vs. PC2 plot: Cluster 3 is further spread out along the PC1 axis, with its point near (PC1 = 5, PC2 = -2). This suggests that Cluster 3 has unique characteristics with high variance in PC1.
PC1 vs. PC3 plot: This cluster stays distant from the others, with a significant value in PC3 around (PC1 = 4, PC3 = 5), emphasizing the distinct separation in both PC1 and PC3.

**For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired. Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same clustering for your team (which is okay but then vary rest), make sure you use the same random seeds.
4. Cluster 4 (Cyan):
PC1 vs. PC2 plot: Points in Cluster 4 are somewhat scattered around (PC1 = -2 to 0, PC2 = 0 to 3). This shows moderate variability in PC2 but not much in PC1.
PC1 vs. PC3 plot: The points remain moderately spread in PC3, but with a slight upward shift near (PC1 = -1 to 0, PC3 = 0 to 3). This suggests moderate variation in PC3 as well.

1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features? Which features are measurements and which features are metadata about the samples? (3 pts)
5. Cluster 5 (Blue):
PC1 vs. PC2 plot: Cluster 5 shows several points dispersed in the upper-left area, between (PC1 = -4 to -3, PC2 = 4 to 6). This wide spread in both PC1 and PC2 suggests considerable variation.
PC1 vs. PC3 plot: The blue points are spread out between (PC1 = -4 to -2, PC3 = 3 to 4), indicating variance in both PC1 and PC3, though less so in PC3.

2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose a scaling method or to not scale. (3 pts)
6. Cluster 6 (Purple):
PC1 vs. PC2 plot: Cluster 6 is an outlier in the bottom-left quadrant with a point near (PC1 = -5, PC2 = -5). This suggests it is very distinct in both PC1 and PC2.
PC1 vs. PC3 plot: This point is also an outlier on the lower side with (PC1 = -5, PC3 = -3), reinforcing that Cluster 6 is unique in all three components.

3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_ Describe how you picked the best number of clusters. Indicate the number of points in each clusters. Coordinate with your team so you try different approaches. If you want to share results with your team mates, make sure to use the same random seeds. (6 pts)
7. Cluster 7 (Red):
PC1 vs. PC2 plot: Cluster 7 lies close to Cluster 1, near (PC1 = 3, PC2 = 0), indicating a similarity in PC1 but with slight variation in PC2.
PC1 vs. PC3 plot: However, the separation becomes more visible here, as it’s positioned around (PC1 = 3, PC3 = -2), suggesting divergence in PC3.

4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data: Alternatively do another creative analysis of your datasets that leads to one of more findings. Make sure to explain what your analysis and discuss your the results.
Summary:
Clusters 5 (blue) and 6 (purple) exhibit the most distinct separation from the other clusters, with high variability in both PC1 and PC2, as well as in PC3.
Clusters 1 (pink) and 7 (red) are closely clustered, with minimal separation in PC1, and their difference becomes more apparent in PC3.
Clusters 2 (brown) and 4 (cyan) occupy more central positions in the PCA plots, showing less variation compared to the other clusters.


# Preparation of Team Presentation (Part 4)
Expand Down
Binary file modified StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.pdf
Binary file not shown.