diff --git a/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd b/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd index b030acd..1c54b03 100644 --- a/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd +++ b/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd @@ -1,15 +1,15 @@ --- title: "Mars 2020 Mission Data Notebook:" subtitle: "DAR Assignment 2 (Fall 2024)" -author: "Your Name Here" +author: "Kaiyang Chang" date: "`r format(Sys.time(), '%d %B %Y')`" output: + pdf_document: + latex_engine: xelatex html_document: toc: true number_sections: true df_print: paged - pdf_document: - latex_engine: xelatex --- ```{r setup, include=FALSE} @@ -202,6 +202,30 @@ str(lithology.matrix) The PIXL data provides summaries of the mineral compositions measured at selected sample sites by the PIXL instrument. Note that here we scale pixl.mat so features have mean 0 and standard deviation so results will be different than in Assignment 1. +### Declare: +I used GPT-4O to do some analysis. +I generate analysis for Quality K-means by Clusters. + +For heatmap and PCA Biplot: +I use GPT to figure out some features of Heatmap and PCA Biplot. +but for the analysis and comparison result, I made by myself, but I use GPT for sentence optimization. + +### Description for data set B: +The dataset contains 16 sample points, and 13 mineral components were measured at each sample point,including sodium oxide (Na2O), magnesium oxide (MgO), aluminum oxide (Al2O3), silicon dioxide (SiO2), phosphorus pentoxide (P2O5), sulfur trioxide (SO3), chlorine (Cl), potassium oxide (K2O), calcium oxide (CaO), titanium dioxide (TiO2), chromium trioxide (Cr2O3), manganese oxide (MnO) and total iron oxide (FeO-T). There are three Metadata features: name (sample name), type (sample type, such as igneous or sedimentary rock), campaign (geographic area where the sample was taken). + +The sample types are mainly divided into igneous and sedimentary rocks, including 8 igneous rock samples, 7 sedimentary rock samples, and 1 sample type is undefined. At the same time, the sample locations are classified in detail, including different locations such as crater floors and delta fronts. + +### Scaling +The PIXL team’s goal is to understand and explain how scaling changes results fromAssignment 1. The matrix version was scaled above but not in Assignment 1. + +### Clustering data method +I categorized the data through the k-means clustering method and decided on the optimal number of clusters. Since we need to compare with the plots in Assignment1, we make sure that all the analytical methods remain the same except for the data changes. + +By generating a plot of the quality of the clusters and using the wssplot() function to evaluate the effect of different numbers of clusters on the total sum of squares within the clusters, I chose three cluster centers (as in Assignment1). It begins to level off after three cluster centers, indicating that adding more cluster centers has little effect on the improvement of cluster quality. I then performed k-means clustering and then began the comparison. + +### Creative Analysis!!!! +Let's start! + ```{r} # Load the saved PIXL data with locations added pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") @@ -210,12 +234,10 @@ pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], as.factor) -# Review our dataframe -summary(pixl.df) - # Make the matrix of just mineral percentage measurements pixl.matrix <- pixl.df[,2:14] %>% scale() + # Review the structure str(pixl.matrix) @@ -233,13 +255,14 @@ wssplot <- function(data, nc=15, seed=10){ # Apply `wssplot()` to our PIXL data wssplot(pixl.matrix, nc=8, seed=2) ``` +Compare to quality of k-means by Cluster in assginment1 Smaller quality values: After scaling, the quality values (or WSS values) are significantly lower, starting at around 200. This indicates that the features have been normalized, ensuring that all features contribute equally to the clustering process. This prevents high-value features from dominating the distance calculations and ensures a more balanced approach. Smoother WSS curve: The reduction in quality values is much more smooth, indicating that the features have a more balanced influence on the clustering. Each feature contributes equally after scaling, leading to more consistent and reasonable clustering results. ```{r} set.seed(2) -k <- 5 +k <- 3 km <- kmeans(pixl.matrix,k) @@ -254,6 +277,10 @@ More balanced color distribution: After scaling, all features are normalized to More reasonable clustering results: By standardizing the features to the same scale, the clustering results are no longer dominated by high-value features. Each feature contributes more equally to the cluster centers, leading to smoother transitions in the feature values and a more reliable clustering outcome. +In the Assignment1 heatmap, certain features with larger raw values, like Si02, dominate the clustering process. This feature disproportionately influence the cluster centers, overshadowing the impact of smaller-value features. As a result, the cluster centers are unbalanced, reflecting more of the high-value features and less of the overall data structure. + +In the Assignment2 heatmap, the color distribution is more balanced, indicating that features with previously smaller values now have a more significant role in determining the cluster centers. The clustering results are more balanced and accurately reflect the combined influence of all features, leading to a more representative and fair clustering outcome. + ```{r} pca_result <- prcomp(pixl.matrix, scale=TRUE) pca_data <- data.frame(pca_result$x[, 1:2], cluster=km$cluster) @@ -274,7 +301,7 @@ ggscreeplot(pixl.matrix.pca) ```{r} # clusters sizes are in the km object produced by kmeans -cluster.df<-data.frame(cluster= 1:5, size=km$size) +cluster.df<-data.frame(cluster= 1:3, size=km$size) ggbiplot::ggbiplot(pixl.matrix.pca, labels = pixl.df$type, @@ -283,6 +310,22 @@ ggbiplot::ggbiplot(pixl.matrix.pca, ``` Variable Direction and Distribution: The variables (e.g., SiO2, FeO-T, Mno, etc.) are spread out in a balanced radial pattern. The arrows representing the variables have relatively equal lengths, indicating that each feature has a balanced influence on the results. Since the data has been scaled, each variable contributes similarly in the PCA calculation, preventing certain features from dominating the clustering due to their value range.. +In the Assignment1 PCA biplot, certain features with larger raw values dominate the distribution of the principal components. As a result, these features heavily influence the direction and length of the arrows, skewing the PCA results. + +In the Assignment2 PCA biplot all features contribute more equally to the principal components, indicating a balanced influence of all features on the PCA results. This provides a more accurate and holistic view of the relationships between the different features and the resulting cluster centers. + +Analysis of new graphs: + +Combining PCA and heatmap analyses, we can infer significant chemical composition differences within the data, which are likely related to rock types. The scree plot and K-means clustering analysis indicate that the variance in the dataset is primarily concentrated in the first few principal components, and fewer clusters can effectively categorize the samples. The clustering analysis results support the use of fewer clusters for data classification and complement the PCA findings. This helps to better understand the chemical composition structure and classification of rock samples. + +New result from final PCA Biplot in Assignment2: + +Principal Component Interpretation: The X-axis of the PCA graph (standardizedPC1) explains 49.2% of the data variance, while the Y-axis (standardizedPC2) explains 18.9%. This means that these two principal components together explain about 68.1% of the data variance. This is much less variance than in Assignment1. + +Variable Vector Orientation: The vector orientation of individual elements such as FeOT, MgO, Na2O, etc. can demonstrate their correlation with the principal components. For example, FeOT and MgO have a strong positive correlation on the first principal component, while Al2O3, SiO2 and K2O have a strong negative correlation on the second principal component. +However, a large number of elements could not be analyzed in Assignment1. + +Rock type classification: It can be seen that the points labeled “igneous” are mainly concentrated in the extension direction of FeOT and MgO, while the points labeled “sedimentary” are biased towards the extension direction of Al2O3 and SiO2. Al2O3 and SiO2. This may indicate that the igneous rocks are higher in FeOT and MgO, while the sedimentary rocks may be rich in Al2O3 and SiO2. ## Data Set C: Load the LIBS Data diff --git a/StudentNotebooks/Assignment02/changk2-assignment2-f24.html b/StudentNotebooks/Assignment02/changk2-assignment2-f24.html index 9f4ac13..88c548e 100644 --- a/StudentNotebooks/Assignment02/changk2-assignment2-f24.html +++ b/StudentNotebooks/Assignment02/changk2-assignment2-f24.html @@ -9,9 +9,9 @@ - + - +
The PIXL data provides summaries of the mineral compositions measured at selected sample sites by the PIXL instrument. Note that here we scale pixl.mat so features have mean 0 and standard deviation so results will be different than in Assignment 1.
+I used GPT-4O to do some analysis. I generate analysis for Quality K-means by Clusters.
+For heatmap and PCA Biplot: I use GPT to figure out some features of Heatmap and PCA Biplot. but for the analysis and comparison result, I made by myself, but I use GPT for sentence optimization.
+The dataset contains 16 sample points, and 13 mineral components were measured at each sample point,including sodium oxide (Na2O), magnesium oxide (MgO), aluminum oxide (Al2O3), silicon dioxide (SiO2), phosphorus pentoxide (P2O5), sulfur trioxide (SO3), chlorine (Cl), potassium oxide (K2O), calcium oxide (CaO), titanium dioxide (TiO2), chromium trioxide (Cr2O3), manganese oxide (MnO) and total iron oxide (FeO-T). There are three Metadata features: name (sample name), type (sample type, such as igneous or sedimentary rock), campaign (geographic area where the sample was taken).
+The sample types are mainly divided into igneous and sedimentary rocks, including 8 igneous rock samples, 7 sedimentary rock samples, and 1 sample type is undefined. At the same time, the sample locations are classified in detail, including different locations such as crater floors and delta fronts.
+The PIXL team’s goal is to understand and explain how scaling changes results fromAssignment 1. The matrix version was scaled above but not in Assignment 1.
+I categorized the data through the k-means clustering method and decided on the optimal number of clusters. Since we need to compare with the plots in Assignment1, we make sure that all the analytical methods remain the same except for the data changes.
+By generating a plot of the quality of the clusters and using the wssplot() function to evaluate the effect of different numbers of clusters on the total sum of squares within the clusters, I chose three cluster centers (as in Assignment1). It begins to level off after three cluster centers, indicating that adding more cluster centers has little effect on the improvement of cluster quality. I then performed k-means clustering and then began the comparison.
+Let’s start!
# Load the saved PIXL data with locations added
pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
@@ -440,51 +469,10 @@ 3.2 Data Set B: Load the PIXL Dat
pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)],
as.factor)
-# Review our dataframe
-summary(pixl.df)
-## sample Na20 Mgo Al203
-## Min. : 1.00 Min. :1.000 Min. : 0.730 Min. : 1.700
-## 1st Qu.: 4.75 1st Qu.:1.853 1st Qu.: 2.533 1st Qu.: 2.220
-## Median : 8.50 Median :1.900 Median :12.800 Median : 3.710
-## Mean : 8.50 Mean :2.672 Mean :11.682 Mean : 5.072
-## 3rd Qu.:12.25 3rd Qu.:4.500 3rd Qu.:19.100 3rd Qu.: 7.117
-## Max. :16.00 Max. :5.550 Max. :22.700 Max. :11.600
-##
-## Si02 P205 S03 Cl
-## Min. :22.60 Min. :0.1000 Min. : 0.780 Min. :0.400
-## 1st Qu.:31.22 1st Qu.:0.2350 1st Qu.: 1.495 1st Qu.:0.940
-## Median :38.85 Median :0.5250 Median : 2.600 Median :1.740
-## Mean :38.55 Mean :0.6512 Mean : 5.562 Mean :1.846
-## 3rd Qu.:41.17 3rd Qu.:0.8400 3rd Qu.: 3.800 3rd Qu.:2.080
-## Max. :57.10 Max. :2.7600 Max. :21.530 Max. :4.500
-##
-## K20 Cao Ti02 Cr203
-## Min. :0.0000 Min. :1.500 Min. :0.2000 Min. :0.000
-## 1st Qu.:0.1600 1st Qu.:2.655 1st Qu.:0.5900 1st Qu.:0.025
-## Median :0.2000 Median :3.120 Median :0.7000 Median :0.155
-## Mean :0.5800 Mean :3.688 Mean :0.8194 Mean :0.355
-## 3rd Qu.:0.8275 3rd Qu.:4.310 3rd Qu.:0.9900 3rd Qu.:0.290
-## Max. :1.9000 Max. :7.770 Max. :2.4900 Max. :1.900
-##
-## Mno FeO-T name type
-## Min. :0.1000 Min. :13.24 Atsah : 1 Igneous :8
-## 1st Qu.:0.2800 1st Qu.:16.71 Bearwallow: 1 N/A :1
-## Median :0.4000 Median :23.86 Coulettes : 1 Sedimentary:7
-## Mean :0.3812 Mean :21.45 Hahonih : 1
-## 3rd Qu.:0.4900 3rd Qu.:25.70 Hazeltop : 1
-## Max. :0.6900 Max. :30.05 Kukaklek : 1
-## (Other) :10
-## campaign location abrasion
-## Crater Floor:9 01 : 1 Alfalfa :2
-## Delta Front :7 02 : 1 Bellegrade :2
-## 03 : 1 Berry Hollow:2
-## 04 : 1 Dourbes :2
-## 05 : 1 Novarupta :2
-## 06 : 1 Quartier :2
-## (Other):10 (Other) :4
-# Make the matrix of just mineral percentage measurements
+# Make the matrix of just mineral percentage measurements
pixl.matrix <- pixl.df[,2:14] %>% scale()
+
# Review the structure
str(pixl.matrix)
## num [1:16, 1:13] 1.928 1.338 -0.498 -0.538 1.225 ...
@@ -508,41 +496,33 @@ 3.2 Data Set B: Load the PIXL Dat
# Apply `wssplot()` to our PIXL data
wssplot(pixl.matrix, nc=8, seed=2)
- Smaller quality values: After scaling, the quality values (or WSS values) are significantly lower, starting at around 200. This indicates that the features have been normalized, ensuring that all features contribute equally to the clustering process. This prevents high-value features from dominating the distance calculations and ensures a more balanced approach.
Compare to quality of k-means by Cluster in assginment1 Smaller quality values: After scaling, the quality values (or WSS values) are significantly lower, starting at around 200. This indicates that the features have been normalized, ensuring that all features contribute equally to the clustering process. This prevents high-value features from dominating the distance calculations and ensures a more balanced approach.
Smoother WSS curve: The reduction in quality values is much more smooth, indicating that the features have a more balanced influence on the clustering. Each feature contributes equally after scaling, leading to more consistent and reasonable clustering results.
set.seed(2)
-k <- 5
+k <- 3
km <- kmeans(pixl.matrix,k)
pheatmap(km$centers,scale="none")
-# cluster result
print(km$cluster)
-## [1] 1 1 4 4 2 4 4 2 2 5 5 3 3 5 5 3
+## [1] 1 1 3 3 1 3 3 1 1 2 2 3 3 2 2 3
print(km$centers)
-## Na20 Mgo Al203 Si02 P205 S03
-## 1 1.6332256 -1.1634882 0.584802916 0.2254886 2.23932342 -0.3489424
-## 2 1.2245004 -1.3765168 1.740742791 1.6820388 0.27203366 -0.6094593
-## 3 -0.4729375 0.1505025 0.009277295 -0.5835043 -0.08827582 1.9970458
-## 4 -0.5276575 0.3427938 -0.719129262 0.1384221 -0.46299768 -0.5506760
-## 5 -0.8526275 1.1584610 -0.885787260 -1.0750673 -0.79448240 -0.3155428
-## Cl K20 Cao Ti02 Cr203 Mno
-## 1 0.4563046 0.4589407 2.0620376 2.12133788 -0.5464895 0.3855410
-## 2 0.1796399 1.8640052 0.3203040 -0.41924125 -0.5791157 -0.5677967
-## 3 -0.9218954 -0.6354563 -0.4017736 -0.13289300 -0.4105468 -1.5211343
-## 4 -0.7502608 -0.5436682 -0.3846221 0.02855867 -0.1182701 1.2687802
-## 5 1.0788001 -0.6072138 -0.5852945 -0.67512692 1.1337618 0.1051475
-## FeO-T
-## 1 -0.07982037
-## 2 -1.38478322
-## 3 -0.84322786
-## 4 0.90079975
-## 5 0.81011875
+## Na20 Mgo Al203 Si02 P205 S03 Cl
+## 1 1.3879905 -1.2913054 1.2783668 1.0994187 1.0589496 -0.5052525 0.2903058
+## 2 -0.8526275 1.1584610 -0.8857873 -1.0750673 -0.7944824 -0.3155428 1.0788001
+## 3 -0.5042060 0.2603833 -0.4069550 -0.1709749 -0.3024026 0.5412048 -0.8238185
+## K20 Cao Ti02 Cr203 Mno FeO-T
+## 1 1.3019794 1.0169975 0.59699040 -0.5660653 -0.18646163 -0.8627981
+## 2 -0.6072138 -0.5852945 -0.67512692 1.1337618 0.10514753 0.8101187
+## 3 -0.5830060 -0.3919728 -0.04063491 -0.2435316 0.07310257 0.1533593
print(km$tot.withinss)
-## [1] 25.17972
+## [1] 83.3454
More balanced color distribution: After scaling, all features are normalized to a similar range. This results in a more balanced contribution from each feature to the clustering process, as reflected by the more uniform color distribution. Now, features such as K2O also show out notable color differences, indicating they play a more significant role in clustering after scaling.
More reasonable clustering results: By standardizing the features to the same scale, the clustering results are no longer dominated by high-value features. Each feature contributes more equally to the cluster centers, leading to smoother transitions in the feature values and a more reliable clustering outcome.
+In the Assignment1 heatmap, certain features with larger raw values, like Si02, dominate the clustering process. This feature disproportionately influence the cluster centers, overshadowing the impact of smaller-value features. As a result, the cluster centers are unbalanced, reflecting more of the high-value features and less of the overall data structure.
+In the Assignment2 heatmap, the color distribution is more balanced, indicating that features with previously smaller values now have a more significant role in determining the cluster centers. The clustering results are more balanced and accurately reflect the combined influence of all features, leading to a more representative and fair clustering outcome.
pca_result <- prcomp(pixl.matrix, scale=TRUE)
pca_data <- data.frame(pca_result$x[, 1:2], cluster=km$cluster)
@@ -551,7 +531,7 @@ 3.2 Data Set B: Load the PIXL Dat
geom_point(size=4) +
ggtitle("K-means Clustering Results (PCA Reduced Data)") +
theme_minimal()
-# Perform the PCA on the matrix `pixl_trim.mat` we created earlier
pixl.matrix.pca <- prcomp(pixl.matrix, scale=FALSE)
@@ -560,13 +540,22 @@ 3.2 Data Set B: Load the PIXL Dat
ggscreeplot(pixl.matrix.pca)
# clusters sizes are in the km object produced by kmeans
-cluster.df<-data.frame(cluster= 1:5, size=km$size)
+cluster.df<-data.frame(cluster= 1:3, size=km$size)
ggbiplot::ggbiplot(pixl.matrix.pca,
labels = pixl.df$type,
groups = as.factor(km$cluster)) +
xlim(-2,2) + ylim(-2,2)
- Variable Direction and Distribution: The variables (e.g., SiO2, FeO-T, Mno, etc.) are spread out in a balanced radial pattern. The arrows representing the variables have relatively equal lengths, indicating that each feature has a balanced influence on the results. Since the data has been scaled, each variable contributes similarly in the PCA calculation, preventing certain features from dominating the clustering due to their value range..
Variable Direction and Distribution: The variables (e.g., SiO2, FeO-T, Mno, etc.) are spread out in a balanced radial pattern. The arrows representing the variables have relatively equal lengths, indicating that each feature has a balanced influence on the results. Since the data has been scaled, each variable contributes similarly in the PCA calculation, preventing certain features from dominating the clustering due to their value range..
In the Assignment1 PCA biplot, certain features with larger raw values dominate the distribution of the principal components. As a result, these features heavily influence the direction and length of the arrows, skewing the PCA results.
+In the Assignment2 PCA biplot all features contribute more equally to the principal components, indicating a balanced influence of all features on the PCA results. This provides a more accurate and holistic view of the relationships between the different features and the resulting cluster centers.
+Analysis of new graphs:
+Combining PCA and heatmap analyses, we can infer significant chemical composition differences within the data, which are likely related to rock types. The scree plot and K-means clustering analysis indicate that the variance in the dataset is primarily concentrated in the first few principal components, and fewer clusters can effectively categorize the samples. The clustering analysis results support the use of fewer clusters for data classification and complement the PCA findings. This helps to better understand the chemical composition structure and classification of rock samples.
+New result from final PCA Biplot in Assignment2:
+Principal Component Interpretation: The X-axis of the PCA graph (standardizedPC1) explains 49.2% of the data variance, while the Y-axis (standardizedPC2) explains 18.9%. This means that these two principal components together explain about 68.1% of the data variance. This is much less variance than in Assignment1.
+Variable Vector Orientation: The vector orientation of individual elements such as FeOT, MgO, Na2O, etc. can demonstrate their correlation with the principal components. For example, FeOT and MgO have a strong positive correlation on the first principal component, while Al2O3, SiO2 and K2O have a strong negative correlation on the second principal component. However, a large number of elements could not be analyzed in Assignment1.
+Rock type classification: It can be seen that the points labeled “igneous” are mainly concentrated in the extension direction of FeOT and MgO, while the points labeled “sedimentary” are biased towards the extension direction of Al2O3 and SiO2. Al2O3 and SiO2. This may indicate that the igneous rocks are higher in FeOT and MgO, while the sedimentary rocks may be rich in Al2O3 and SiO2.
+