Skip to content

assignment two has been completed I believe #76

Merged
merged 1 commit into from Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
89 changes: 85 additions & 4 deletions StudentNotebooks/Assignment02/peterc8-dar-f24-assignment2.Rmd
Expand Up @@ -4,11 +4,11 @@ subtitle: "peterc8 - DAR Assignment 2 (Fall 2024)"
author: "Charlotte Peterson"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
pdf_document: default
html_document:
toc: true
number_sections: true
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
Expand Down Expand Up @@ -52,6 +52,18 @@ if (!require("pheatmap")) {
library(pheatmap)
}
if (!require("randomForest")) {
install.packages("randomForest")
library(randomForest)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
```

# DAR ASSIGNMENT 2 (Introduction): Introductory DAR Notebook
Expand Down Expand Up @@ -231,15 +243,19 @@ libs.df$point <- as.numeric(libs.df$point)
# Review what we have
summary(libs.df)
head(libs.df)
# Make the a matrix contain only the libs measurements for each mineral
libs.matrix <- as.matrix(libs.df[,6:13])
# Review the structure
str(libs.matrix)
```

head(libs.matrix)
#Scale libs data appropriately
libs.scaled <- libs.matrix %>% scale()
str(libs.scaled)
head(libs.scaled)
```

## Dataset D: Load the SHERLOC Data

Expand Down Expand Up @@ -361,12 +377,77 @@ Each team has been assigned one of six datasets:

1. _Describe the data set contained in the data frame and matrix:_ How many rows does it have and how many features? Which features are measurements and which features are metadata about the samples? (3 pts)

There are 1932 rows of sample data and 13 features. 5 of these features are not chemicals and are metadata about the samples, including sol, latitude and longitude, target, and point. 8 features are chemical measurements.

2. _Scale this data appropriately (you can choose the scaling method):_ Explain why you chose that scaling method. (3 pts)

I did a very typical scaling method. By simply using the scale function on the original lib matrix, each column will be mean centered and divided by the standard deviation of that column. This makes the data more uniform and easier to interpret.
```{r}
#Scale libs data appropriately
libs.scaled <- scale(libs.matrix)
str(libs.scaled)
head(libs.scaled)
```

3. _Cluster the data using k-means or your favorite clustering method (like hierarchical clustering):_ Describe how you picked the best number of clusters. Indicate the number of points in each clusters. Coordinate with your team so you try different approaches. If you want to share results with your team mates, make sure to use the same random seeds. (6 pts)

I picked the best number of clusters by doing the elbow test. I believe the best number of clusters for this data set is 4. One of the below tables shows the number of points in each cluster: cluster 1 had 164, cluster 2 had 29, cluster 3 had 805, and cluster 4 had 933.
```{r}
# A user-defined function to examine clusters and plot the results
wssplot <- function(data, nc=15, seed=10){
wss <- data.frame(cluster=1:nc, quality=c(0))
for (i in 1:nc){
set.seed(seed)
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
ggplot(data=wss,aes(x=cluster,y=quality)) +
geom_line() +
ggtitle("Quality of k-means by Cluster")
}
# Apply `wssplot()` to our LIB data
wssplot(libs.scaled, nc=8, seed=500)
set.seed(500)
k <- 4
km <- kmeans(libs.scaled,k)
clusters <- km$cluster
pheatmap(km$centers,scale="none")
# Perform the PCA on the matrix
lib.matrix.scaled.pca <- prcomp(libs.scaled, scale=FALSE)
# generate the Scree plot
ggscreeplot(lib.matrix.scaled.pca)
# clusters sizes are in the km object produced by kmeans
cluster.df<-data.frame(cluster= 1:4, size=km$size)
kable(cluster.df,caption="Samples per cluster")
ggbiplot::ggbiplot(lib.matrix.scaled.pca,
groups = as.factor(km$cluster)) +
xlim(-5,2.5) + ylim(-2.5,7.5) +
theme_bw()
```

4. _Perform a **creative analysis** that provides insights into what one or more of the clusters are and what they tell you about the MARS data:_

I used some help from Chat GPT to use my previous Random Forest knowledge with this data set. This graph is able to provide insight into the relation between chemical composition at different sample sights and samples. By using my previous clustering, we can gather that clusters have similar composition as they are grouped closely together. Cluster 1 and 3 overlap quite a bit, meaning they have similar compositions. This supports that the data can be classified into 4 clusters, in which cluster 3 has a fairly neutral composition with CaO and SiO2 being significant components and cluster 4 having a large amount of MgO and FeO2. I'm not sure exactly what cluster 2 represents. It was a much smaller cluster and has an extremely high percent of CaO. Scaling helped level out SiO2 levels and make it easier to compare chemicals, but for some reason, as shown on the heatmap, cluster 2 is red for CaO. I am assuming this corresponds to cluster 2 being weirdly displayed on random forest. I am still trying to figure out how to improve this, by changing the seed values originally and modifying my random forest code. However, this was the best I could get it. This clustering also shows me that different locations had very different compositions chemically.

```{r}
#set seed for random forest analysis
set.seed(42)
#use random forest analysis on scaled lib matrix
rf <- randomForest(x = libs.scaled, ntree = 500, proximity = TRUE)
#proximity matrix meaning proximity between chemical elements
proximity_matrix <- rf$proximity
mds <- cmdscale(1 - proximity_matrix, k = 2) # Perform MDS
plot(mds, col = clusters, pch = 19, main = "MDS Plot of Random Forest Proximities for LIBS")
legend("topright", legend = unique(clusters), col = unique(clusters), pch = 19, title = "Clusters")
```


# Preparation of Team Presentation (Part 4)

Expand Down
Binary file not shown.