DataINCITE · erickj4 · Sep 27, 2024 · Sep 25, 2024
diff --git a/StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.Rmd b/StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.Rmd
@@ -12,27 +12,37 @@ output:
 ---
 ```{r setup, include=FALSE}
 
-# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
-# This section  install packages if they are not already installed. 
+ #Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
+ #This section  install packages if they are not already installed. 
 # This block will not be shown in the knit file.
-knitr::opts_chunk$set(echo = TRUE)
+#knitr::opts_chunk$set(echo = TRUE)
 
 # Set the default CRAN repository
 local({r <- getOption("repos")
        r["CRAN"] <- "http://cran.r-project.org" 
        options(repos=r)
 })
 
+if (!require("kableExtra")) {
+  install.packages("kableExtra")
+  library(kableExtra)
+}
+
 if (!require("pandoc")) {
   install.packages("pandoc")
   library(pandoc)
 }
 
+if (!require("reshape2")) {
+  install.packages("reshape2")
+  library(reshape2)
+}
+
 # Required packages for M20 LIBS analysis
 if (!require("rmarkdown")) {
   install.packages("rmarkdown")
-  library(rmarkdown)
-}
+  library(rmarkdown) 
+}  
 if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
@@ -51,7 +61,10 @@ if (!require("pheatmap")) {
   install.packages("pheatmap")
   library(pheatmap)
 }
-
+if (!require("randomForest")) {
+  install.packages("randomForest")
+  library(randomForest)
+}
 ```
 
 # DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook
@@ -186,14 +199,9 @@ lithology.df[sapply(lithology.df, is.character)] <-
 # Keep only first 16 samples because the data for the rest of the samples is not available yet
 lithology.df<-lithology.df[1:16,]
 
-# Look at summary of cleaned data frame
-summary(lithology.df)
-
 # Create a matrix containing only the numeric measurements.  The remaining features are metadata about the sample. 
 lithology.matrix <- sapply(lithology.df[,6:40],as.numeric)-1            
 
-# Review the structure of our matrix
-str(lithology.matrix)
 ```
 
 
@@ -209,14 +217,8 @@ pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide
 pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], 
                                        as.factor)
 
-# Review our dataframe
-summary(pixl.df)
-
 # Make the matrix of just mineral percentage measurements
 pixl.matrix <- pixl.df[,2:14] %>% scale()
-
-# Review the structure
-str(pixl.matrix)
 ```
 
 ## Data Set C: Load the LIBS Data
@@ -235,14 +237,8 @@ libs.df <- libs.df %>%
 # Convert the points to numeric
 libs.df$point <- as.numeric(libs.df$point)
 
-# Review what we have
-summary(libs.df)
-
 # Make the a matrix contain only the libs measurements for each mineral
 libs.matrix <- as.matrix(libs.df[,6:13]) 
-
-# Check to see scaling
-str(libs.matrix)
 ```
 
 
@@ -282,116 +278,144 @@ sherloc.matrix <- sherloc_long %>%
 
 sherloc.df <- cbind(pixl.df[,c("sample","type","campaign","abrasion")],sherloc.matrix)
 
-# Review what we have
-summary(sherloc.df)
-
 # Measurements are everything except first column
 sherloc.matrix<-as.matrix(sherloc.matrix[,-1])
-
-# Sherlock measurement matrix
-# Review the structure 
-str(sherloc.matrix)
-```
-## Data Set E: PIXL + Sherloc 
-```{r}
-# Combine PIXL and SHERLOC dataframes 
-pixl_sherloc.df <- cbind(pixl.df,sherloc.df )
-
-# Review what we have
-summary(pixl_sherloc.df)
-
-# Combine PIXL and SHERLOC matrices
-pixl_sherloc.matrix<-cbind(pixl.matrix,sherloc.matrix)
-
-# Review the structure of our matrix
-str(pixl_sherloc.matrix)
-
 ```
-
 
-## Data Set F: PIXL + Lithography 
+## Data Set H: Sherloc + Lithology + PIXL 
 
-Create data and matrix from prior datasets 
+Create data frame and matrix from prior datasets by making on appropriate combinations.
 
 ```{r}
-# Combine our PIXL and Lithology dataframes
-pixl_lithology.df <- cbind(pixl.df,lithology.df )
-
-# Review what we have
-summary(pixl_lithology.df)
+# Combine the Lithology and SHERLOC dataframes
+sherloc_lithology_pixl.df <- cbind(sherloc.df, lithology.df, pixl.df )
 
-# Combine PIXL and Lithology matrices
-pixl_lithology.matrix<-cbind(pixl.matrix,lithology.matrix)
+# Combine the Lithology, SHERLOC and PIXLmatrices
+sherloc_lithology_pixl.matrix <- cbind(sherloc.matrix,lithology.matrix,pixl.matrix)
 
-# Review the structure
-str(pixl_lithology.matrix)
+# Z-score scaling on pixl data
+pixl.matrix_z <- 1 / (1+ exp(-pixl.matrix))
 
+sherloc_lithology_pixl_z.matrix <- cbind(sherloc.matrix,lithology.matrix,pixl.matrix_z)
 ```
 
-## Data Set G: Sherloc + Lithology 
+# Analysis of Data (Part 3)
 
-Create Data and matrix from prior datasets by taking on appropriate combinations.
+Dataset H: PIXL + Sherloc + Lithograpy  (with appropriate scaling as necessary. Not scaled yet.)
 
-```{r}
-# Combine the Lithology and SHERLOC dataframes
-sherloc_lithology.df <- cbind(sherloc.df,lithology.df )
+1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features?  Which features are measurements and which features are metadata about the samples?  (3 pts)
 
-# Review what we have
-summary(sherloc_lithology.df)
+In the data set H, there are 16 rows and 99 features in the data frame.
 
-# Combine the Lithology and SHERLOC matrices
-sherloc_lithology.matrix<-cbind(sherloc.matrix,lithology.matrix)
+Measurement features: Chemical Oxides like "Na20", "Mgo", "Si02", etc. Mineral/Compound phase like "Sulfate", "Quartz", "Halite", etc.
 
-# Review the resulting matrix
-str(sherloc_lithology.matrix)
+Metadata features: "sample", "type", "location", etc.
 
-```
-## Data Set H: Sherloc + Lithology + PIXL 
+2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose a scaling method or to not scale. (3 pts) 
 
-Create data frame and matrix from prior datasets by making on appropriate combinations.
+I choose to scale the PIXL data but not the Sherloc or Lithography data because only PIXL data is non-binary. For the scaling I use the Z-score scaling in order to make range of the PIXL data goes between 0 and 1. For Sherloc or Lithography data, scaling is not necessary since they are binary data. I still will go through all the process with the non-scaled data and the scaled data to compare the difference.
 
-```{r}
-# Combine the Lithology and SHERLOC dataframes
-sherloc_lithology_pixl.df <- cbind(sherloc.df,lithology.df, pixl.df )
+3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_   Describe how you picked the best number of clusters.  Indicate the number of points in each clusters. (6 pts)
 
-# Review what we have
-summary(sherloc_lithology_pixl.df)
+I used K-means clustering. I picked K based on the "elbow" on the wssplot. With the scaled data, I chose K for 7. The number of points from cluster 1 to 7 is "2 2 2 4 3 2 1". Each For the data set without the scaling, I chose K for 6. The number of points from cluster 1 to 6 is "1 3 7 1 2 2". You can see from this two clustering. Obviously the one with the 7 cluster groups spread more evenly. The final answer is that K = 7.
 
-# Combine the Lithology, SHERLOC and PIXLmatrices
-sherloc_lithology_pixl.matrix<-cbind(sherloc.matrix,lithology.matrix,pixl.matrix)
 
-# Review the resulting matrix
-str(sherloc_lithology_pixl.matrix)
+```{r}
+set.seed(400)
+# insert wssplot function
+wssplot <- function(data, nc=15, seed=55){
+  wss <- data.frame(cluster=1:nc, quality=c(0))
+  for (i in 1:nc){
+    set.seed(seed)
+    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+  ggplot(data=wss,aes(x=cluster,y=quality)) + 
+    geom_line() + 
+    ggtitle("Quality of k-means by Cluster")
+}
 
+# Apply `wssplot()` to our data (z-score on PIXL only)
+wssplot(sherloc_lithology_pixl_z.matrix, nc=11, seed=400) 
+wssplot(sherloc_lithology_pixl.matrix, nc=11, seed=400) 
+# Use our chosen 'k' to perform k-means clustering
+set.seed(400)
+# z-score scaling method for PIXL not sherloc
+k <- 7
+
+km <- kmeans(sherloc_lithology_pixl_z.matrix,k)
+# scaling method
+k1 <- 6
+km1 <- kmeans(sherloc_lithology_pixl.matrix,k1)
+# 
+pheatmap(km$centers,scale="none")
+pheatmap(km1$centers,scale="none")
+
+cluster.df<-data.frame(cluster=1:7, size=km$size)
+kable(cluster.df,caption="Samplespercluster")
+cluster.df<-data.frame(cluster=1:6, size=km1$size)
+kable(cluster.df,caption="Samplespercluster")
 ```
 
-# Analysis of Data (Part 3)
-
-Each team has been assigned one of six datasets:   
-
-1. Dataset B: PIXL: The PIXL team's goal is to understand and explain how scaling changes results from Assignment 1.   The matrix version was scaled above but not in Assignment 1.  
-
-2. Dataset C: LIBS (with appropriate scaling as necessary. Not scaled yet.) 
+4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:  Alternatively do another creative analysis of your datasets that leads to one of more findings.  Make sure to explain what your analysis and discuss your the results.  
 
-3. Dataset D: Sherloc (with appropriate scaling as necessary. Not scaled yet.)
+```{r}
+slp.pca<-prcomp(sherloc_lithology_pixl_z.matrix,scale=FALSE)
+ggscreeplot(slp.pca)
+summary(slp.pca)
+```
+Together, the first three components explain about 69.09% of the variance, which explain most of the variance.
 
-4. Dataset E: PIXL + Sherloc (with appropriate scaling as necessary. Not scaled yet.)
+```{r}
+filtered <- apply(sherloc_lithology_pixl_z.matrix, 2, function(x) var(x) == 0)
+filtered_slp <- sherloc_lithology_pixl_z.matrix[, !filtered]
+
+pca_result <- prcomp(filtered_slp, center = TRUE, scale. = TRUE)
+pca_data <- data.frame(PC1 = pca_result$x[,1], PC2 = pca_result$x[,2], 
+                       Cluster = factor(km$cluster))
+ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
+geom_point(size = 3) +
+labs(title = "PCA by Cluster", x = "PC1", y = "PC2") +
+theme_minimal()
+
+pca_data <- data.frame(PC1 = pca_result$x[,1], PC3 = pca_result$x[,3], 
+                       Cluster = factor(km$cluster))
+ggplot(pca_data, aes(x = PC1, y = PC3, color = Cluster)) +
+geom_point(size = 3) +
+labs(title = "PCA by Cluster", x = "PC1", y = "PC3") +
+theme_minimal()
+```
 
-5. Dataset F: PIXL + Lithography (with appropriate scaling as necessary. Not scaled yet.) 
+1. Cluster 1 (Pink):
+PC1 vs. PC2 plot: Cluster 1 appears as two data points clustered tightly together near (PC1 = 4, PC2 = 0). This suggests that these points have very similar features in the first two principal components.
+PC1 vs. PC3 plot: In the second plot, these same two points are also tightly clustered, but closer to the bottom near (PC1 = 4, PC3 = -3). This indicates a relatively consistent relationship in PC1 but a significant difference in PC3.
 
-6. Dataset G: Sherloc + Lithograpy (with appropriate scaling as necessary. Not scaled yet.)
+2. Cluster 2 (Brown):
+PC1 vs. PC2 plot: Cluster 2 has a single point near (PC1 = 0, PC2 = -1), which means it lies near the origin, showing moderate values in both principal components.
+PC1 vs. PC3 plot: The position remains relatively central around (PC1 = 0, PC3 = 0), meaning this data point doesn't exhibit much variability in either PC2 or PC3 compared to other clusters.
 
-7. Dataset H: PIXL + Sherloc + Lithograpy  (with appropriate scaling as necessary. Not scaled yet.)
+3. Cluster 3 (Green):
+PC1 vs. PC2 plot: Cluster 3 is further spread out along the PC1 axis, with its point near (PC1 = 5, PC2 = -2). This suggests that Cluster 3 has unique characteristics with high variance in PC1.
+PC1 vs. PC3 plot: This cluster stays distant from the others, with a significant value in PC3 around (PC1 = 4, PC3 = 5), emphasizing the distinct separation in both PC1 and PC3.
 
-**For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired.  Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same  clustering for your team (which is okay but then vary rest), make sure you use the same random seeds. 
+4. Cluster 4 (Cyan):
+PC1 vs. PC2 plot: Points in Cluster 4 are somewhat scattered around (PC1 = -2 to 0, PC2 = 0 to 3). This shows moderate variability in PC2 but not much in PC1.
+PC1 vs. PC3 plot: The points remain moderately spread in PC3, but with a slight upward shift near (PC1 = -1 to 0, PC3 = 0 to 3). This suggests moderate variation in PC3 as well.
 
-1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features?  Which features are measurements and which features are metadata about the samples?  (3 pts)
+5. Cluster 5 (Blue):
+PC1 vs. PC2 plot: Cluster 5 shows several points dispersed in the upper-left area, between (PC1 = -4 to -3, PC2 = 4 to 6). This wide spread in both PC1 and PC2 suggests considerable variation.
+PC1 vs. PC3 plot: The blue points are spread out between (PC1 = -4 to -2, PC3 = 3 to 4), indicating variance in both PC1 and PC3, though less so in PC3.
 
-2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose  a scaling method or to not scale. (3 pts) 
+6. Cluster 6 (Purple):
+PC1 vs. PC2 plot: Cluster 6 is an outlier in the bottom-left quadrant with a point near (PC1 = -5, PC2 = -5). This suggests it is very distinct in both PC1 and PC2.
+PC1 vs. PC3 plot: This point is also an outlier on the lower side with (PC1 = -5, PC3 = -3), reinforcing that Cluster 6 is unique in all three components.
 
-3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_   Describe how you picked the best number of clusters.  Indicate the number of points in each clusters. Coordinate with your team so you try different approaches.   If you want to share results with your team mates, make sure to use the same random seeds.  (6 pts)
+7. Cluster 7 (Red):
+PC1 vs. PC2 plot: Cluster 7 lies close to Cluster 1, near (PC1 = 3, PC2 = 0), indicating a similarity in PC1 but with slight variation in PC2.
+PC1 vs. PC3 plot: However, the separation becomes more visible here, as it’s positioned around (PC1 = 3, PC3 = -2), suggesting divergence in PC3.
 
-4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:  Alternatively do another creative analysis of your datasets that leads to one of more findings.  Make sure to explain what your analysis and discuss your the results.  
+Summary:
+Clusters 5 (blue) and 6 (purple) exhibit the most distinct separation from the other clusters, with high variability in both PC1 and PC2, as well as in PC3.
+Clusters 1 (pink) and 7 (red) are closely clustered, with minimal separation in PC1, and their difference becomes more apparent in PC3.
+Clusters 2 (brown) and 4 (cyan) occupy more central positions in the PCA plots, showing less variation compared to the other clusters.
 
 
 # Preparation of Team Presentation (Part 4) 

diff --git a/StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.pdf b/StudentNotebooks/Assignment02/lint5-dar-f24-assignment2.pdf