DataINCITE · erickj4 · Sep 11, 2024 · Sep 11, 2024
diff --git a/StudentNotebooks/Assignment02/peterc8-dar-f24-assignment2.Rmd b/StudentNotebooks/Assignment02/peterc8-dar-f24-assignment2.Rmd
@@ -4,11 +4,11 @@ subtitle: "peterc8 - DAR Assignment 2 (Fall 2024)"
 author: "Charlotte Peterson"
 date: "`r format(Sys.time(), '%d %B %Y')`"
 output:
+  pdf_document: default
   html_document:
     toc: true
     number_sections: true
     df_print: paged
-  pdf_document: default
 ---
 ```{r setup, include=FALSE}
 
@@ -52,6 +52,18 @@ if (!require("pheatmap")) {
   library(pheatmap)
 }
 
+if (!require("randomForest")) {
+  install.packages("randomForest")
+  library(randomForest)
+}
+
+if (!require("knitr")) {
+  install.packages("knitr")
+  library(knitr)
+}
+
+
+
 ```
 
 # DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook
@@ -231,15 +243,19 @@ libs.df$point <- as.numeric(libs.df$point)
 
 # Review what we have
 summary(libs.df)
-
+head(libs.df)
 # Make the a matrix contain only the libs measurements for each mineral
 libs.matrix <- as.matrix(libs.df[,6:13])
 
 # Review the structure
 str(libs.matrix)
-```
-
+head(libs.matrix)
 
+#Scale libs data appropriately
+libs.scaled <- libs.matrix %>% scale()
+str(libs.scaled)
+head(libs.scaled)
+```
 
 ## Dataset D: Load the SHERLOC Data
 
@@ -361,12 +377,77 @@ Each team has been assigned one of six datasets:
 
 1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features?  Which features are measurements and which features are metadata about the samples?  (3 pts)
 
+There are 1932 rows of sample data and 13 features. 5 of these features are not chemicals and are metadata about the samples, including sol, latitude and longitude, target, and point. 8 features are chemical measurements.
+
 2. _Scale this data appropriately (you can choose the scaling method):_ Explain why you chose that scaling method. (3 pts) 
 
+I did a very typical scaling method. By simply using the scale function on the original lib matrix, each column will be mean centered and divided by the standard deviation of that column. This makes the data more uniform and easier to interpret.
+```{r}
+#Scale libs data appropriately
+libs.scaled <- scale(libs.matrix)
+str(libs.scaled)
+head(libs.scaled)
+```
+
 3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_   Describe how you picked the best number of clusters.  Indicate the number of points in each clusters. Coordinate with your team so you try different approaches.   If you want to share results with your team mates, make sure to use the same random seeds.  (6 pts)
 
+I picked the best number of clusters by doing the elbow test. I believe the best number of clusters for this data set is 4. One of the below tables shows the number of points in each cluster: cluster 1 had 164, cluster 2 had 29, cluster 3 had 805, and cluster 4 had 933. 
+```{r}
+# A user-defined function to examine clusters and plot the results
+wssplot <- function(data, nc=15, seed=10){
+  wss <- data.frame(cluster=1:nc, quality=c(0))
+  for (i in 1:nc){
+    set.seed(seed)
+    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+  ggplot(data=wss,aes(x=cluster,y=quality)) + 
+    geom_line() + 
+    ggtitle("Quality of k-means by Cluster")
+}
+
+# Apply `wssplot()` to our LIB data
+wssplot(libs.scaled, nc=8, seed=500) 
+
+set.seed(500)
+k <- 4
+km <- kmeans(libs.scaled,k)
+clusters <- km$cluster
+
+pheatmap(km$centers,scale="none")
+
+# Perform the PCA on the matrix
+
+lib.matrix.scaled.pca <- prcomp(libs.scaled, scale=FALSE)
+
+# generate the Scree plot
+ggscreeplot(lib.matrix.scaled.pca)
+
+# clusters sizes are in the km object produced by kmeans
+cluster.df<-data.frame(cluster= 1:4, size=km$size)
+kable(cluster.df,caption="Samples per cluster")
+
+ggbiplot::ggbiplot(lib.matrix.scaled.pca,
+                   groups = as.factor(km$cluster)) +
+  xlim(-5,2.5) + ylim(-2.5,7.5) +
+  theme_bw()
+```
+
 4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:_ 
 
+I used some help from Chat GPT to use my previous Random Forest knowledge with this data set. This graph is able to provide insight into the relation between chemical composition at different sample sights and samples. By using my previous clustering, we can gather that clusters have similar composition as they are grouped closely together. Cluster 1 and 3 overlap quite a bit, meaning they have similar compositions. This supports that the data can be classified into 4 clusters, in which cluster 3 has a fairly neutral composition with CaO and SiO2 being significant components and cluster 4 having a large amount of MgO and FeO2. I'm not sure exactly what cluster 2 represents. It was a much smaller cluster and has an extremely high percent of CaO. Scaling helped level out SiO2 levels and make it easier to compare chemicals, but for some reason, as shown on the heatmap, cluster 2 is red for CaO. I am assuming this corresponds to cluster 2 being weirdly displayed on random forest. I am still trying to figure out how to improve this, by changing the seed values originally and modifying my random forest code. However, this was the best I could get it. This clustering also shows me that different locations had very different compositions chemically.
+
+```{r}
+#set seed for random forest analysis
+set.seed(42)
+#use random forest analysis on scaled lib matrix
+rf <- randomForest(x = libs.scaled, ntree = 500, proximity = TRUE)
+#proximity matrix meaning proximity between chemical elements
+proximity_matrix <- rf$proximity
+mds <- cmdscale(1 - proximity_matrix, k = 2)  # Perform MDS
+plot(mds, col = clusters, pch = 19, main = "MDS Plot of Random Forest Proximities for LIBS")
+legend("topright", legend = unique(clusters), col = unique(clusters), pch = 19, title = "Clusters")
+
+```
+
 
 # Preparation of Team Presentation (Part 4) 
 

diff --git a/StudentNotebooks/Assignment02/peterc8-dar-f24-assignment2.pdf b/StudentNotebooks/Assignment02/peterc8-dar-f24-assignment2.pdf