Skip to content

Completed Assignment! #77

Merged
merged 1 commit into from Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
58 changes: 58 additions & 0 deletions StudentNotebooks/Assignment02/compta-assignment2-f24.Rmd
Expand Up @@ -385,14 +385,72 @@ Each team has been assigned one of six datasets:

**For the data set assigned to your team, perform the following steps.** Feel free to use the methods/code from Assignment 1 as desired. Communicate with your teammates. Make sure that you are doing different variations of below analysis so that no team member does the exact same analysis. If you want to use the same clustering for your team (which is okay but then vary rest), make sure you use the same random seeds.

I worked on H

1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features? Which features are measurements and which features are metadata about the samples? (3 pts)

There are 16 rows of data. 10 features are metadata, also they are duplicates. That leaves 89 features that are measurements.

2. _Scale this data appropriately (you can choose the scaling method or decide to not scale data):_ Explain why you chose a scaling method or to not scale. (3 pts)

We should scale the data for clustering (maybe future PCA). Let's use scale()

```{R}
sherloc_lithology_pixl_scaled.matrix <- scale(sherloc_lithology_pixl.matrix)
summary(sherloc_lithology_pixl_scaled.matrix)
#Prepare matrix for cluster plot
sherloc_lithology_pixl_scaled.matrix <- sherloc_lithology_pixl_scaled.matrix[, -16]
sherloc_lithology_pixl_scaled.matrix <- sherloc_lithology_pixl_scaled.matrix[, -60]

# s_l_p_scaled.df <- data.frame(sherloc_lithology_pixl_scaled.matrix)
# Sample <- 1:16
# s_l_p_scaled.df <- cbind(Sample,s_l_p_scaled.df)
#
# ggplot(sherloc_lithology_pixl_scaled.df, aes(x=Sample), colour = Sample) +
# geom_line()

```

3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_ Describe how you picked the best number of clusters. Indicate the number of points in each clusters. Coordinate with your team so you try different approaches. If you want to share results with your team mates, make sure to use the same random seeds. (6 pts)

```{R}
wssplot <- function(data, nc = 15, seed =10) {
wss <- data.frame(cluster=1:nc, quality=c(0))
for (i in 1:nc){
set.seed(seed)
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
ggplot(data=wss,aes(x=cluster,y=quality)) +
geom_line() +
ggtitle("Quality of k-means by Cluster")
}

wssplot(sherloc_lithology_pixl_scaled.matrix, nc=8, seed=2469)

#Select 4 clusters based on plot

kmean <- kmeans(sherloc_lithology_pixl_scaled.matrix, centers=4)
```

4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data: Alternatively do another creative analysis of your datasets that leads to one of more findings. Make sure to explain what your analysis and discuss your the results.

```{R}
#Make 2 heatmaps, look for connections
pheatmap(kmean$centers[,1:41],scale="none")
pheatmap(kmean$centers[,42:81],scale="none")
# H.pca <- prcomp(sherloc_lithology_pixl_scaled.matrix,scale=FALSE)
#
# ggbiplot::ggbiplot(H.pca,
# groups = as.factor(kmean$cluster),varname.size=1, var.axes = 0)+
# xlim(-3,3) + ylim(-3,3)


#Determine Cluster sizes
kmean[["cluster"]]
#1: 3 Samples
#2: 2 Samples
#3: 1 Sample
#4: 10 Samples
```

# Preparation of Team Presentation (Part 4)

Expand Down