DataINCITE · erickj4 · Sep 11, 2024 · Sep 11, 2024
diff --git a/StudentNotebooks/Assignment02/compta-assignment2-f24.Rmd b/StudentNotebooks/Assignment02/compta-assignment2-f24.Rmd
@@ -385,14 +385,72 @@ Each team has been assigned one of six datasets:
 
 **For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired.  Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same  clustering for your team (which is okay but then vary rest), make sure you use the same random seeds. 
 
+I worked on H
+
 1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features?  Which features are measurements and which features are metadata about the samples?  (3 pts)
 
+There are 16 rows of data. 10 features are metadata, also they are duplicates. That leaves 89 features that are measurements.
+
 2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose  a scaling method or to not scale. (3 pts) 
 
+We should scale the data for clustering (maybe future PCA). Let's use scale()
+
+```{R}
+sherloc_lithology_pixl_scaled.matrix <- scale(sherloc_lithology_pixl.matrix)
+summary(sherloc_lithology_pixl_scaled.matrix)
+#Prepare matrix for cluster plot
+sherloc_lithology_pixl_scaled.matrix <- sherloc_lithology_pixl_scaled.matrix[, -16]
+sherloc_lithology_pixl_scaled.matrix <- sherloc_lithology_pixl_scaled.matrix[, -60]
+
+# s_l_p_scaled.df <- data.frame(sherloc_lithology_pixl_scaled.matrix)
+# Sample <- 1:16
+#  s_l_p_scaled.df <- cbind(Sample,s_l_p_scaled.df)
+#  
+# ggplot(sherloc_lithology_pixl_scaled.df, aes(x=Sample), colour = Sample) +
+#   geom_line()
+
+```
+
 3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_   Describe how you picked the best number of clusters.  Indicate the number of points in each clusters. Coordinate with your team so you try different approaches.   If you want to share results with your team mates, make sure to use the same random seeds.  (6 pts)
 
+```{R}
+wssplot <- function(data, nc = 15, seed =10) {
+  wss <- data.frame(cluster=1:nc, quality=c(0))
+  for (i in 1:nc){
+    set.seed(seed)
+    wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
+  ggplot(data=wss,aes(x=cluster,y=quality)) + 
+    geom_line() + 
+    ggtitle("Quality of k-means by Cluster")
+}
+
+wssplot(sherloc_lithology_pixl_scaled.matrix, nc=8, seed=2469)
+
+#Select 4 clusters based on plot
+
+kmean <- kmeans(sherloc_lithology_pixl_scaled.matrix, centers=4)
+```
+
 4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:  Alternatively do another creative analysis of your datasets that leads to one of more findings.  Make sure to explain what your analysis and discuss your the results.  
 
+```{R}
+#Make 2 heatmaps, look for connections
+pheatmap(kmean$centers[,1:41],scale="none")
+pheatmap(kmean$centers[,42:81],scale="none")
+# H.pca <- prcomp(sherloc_lithology_pixl_scaled.matrix,scale=FALSE)
+# 
+# ggbiplot::ggbiplot(H.pca,
+#                    groups = as.factor(kmean$cluster),varname.size=1, var.axes = 0)+
+#   xlim(-3,3) + ylim(-3,3) 
+
+
+#Determine Cluster sizes
+kmean[["cluster"]]
+#1: 3 Samples
+#2: 2 Samples
+#3: 1 Sample
+#4: 10 Samples
+```
 
 # Preparation of Team Presentation (Part 4)